Last Updated on August 21, 2023
Unzipping a large zip file is typically slow.
It can become painfully slow in situations where you may need to unzip thousands of files to disk. The hope is that multithreading will speed up the unzipping operation.
In this tutorial, you will explore how to unzip large zip files containing many data files using concurrency.
Let’s dive in.
Create a Large Zip File
Before we can explore multithreaded file unzipping, we need a large zip file that contains many files.
We can create this file ourselves.
We can create .CSV data files full of random numbers then add them to one large zip file.
First, we can define a function that creates one line of ten random numeric data values separated by commas.
The generate_line() function below implements this.
1 2 3 |
# generate a line of mock data of 10 random data points def generate_line(): return ','.join([str(random()) for _ in range(10)]) |
Next, we can call this function 10,000 times (e.g. 10,000 lines) to create a CSV data file that is about 2 megabytes when unzipped.
The generate_file_data() function below implements this, returning a string that represents the content of a CSV data file full of random numbers.
1 2 3 4 5 6 |
# generate file data of 10K lines each with 10 data points def generate_file_data(): # generate many lines of data lines = [generate_line() for _ in range(10000)] # convert to a single ascii doc with new lines return '\n'.join(lines) |
Finally, we can create a zip file and fill it with 1,000 different CSV data files.
First, we can open the zip file, which we will call ‘testing.zip‘ stored in our current working directory. This can be achieved using the zipfile.ZipFile class with a context manager to ensure the file is closed once we are finished with it.
1 2 3 4 |
... # create the zip file with ZipFile('testing.zip', 'w', compression=ZIP_DEFLATED) as handle: # ... |
We can then iterate 1,000 times and in each iteration generate the data, specify a unique filename and store the data against the filename in the zip file.
1 2 3 4 5 6 7 8 9 |
... # generate data data = generate_file_data() # create filenames filename = f'data-{i:04d}.csv' # save data file handle.writestr(filename, data) # report progress print(f'.added {filename}') |
Tying these elements together, the complete example of creating the large zip file full of smaller CSV files is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# SuperFastPython.com # create a zip file containing many files from random import random from zipfile import ZipFile from zipfile import ZIP_DEFLATED # generate a line of mock data of 10 random data points def generate_line(): return ','.join([str(random()) for _ in range(10)]) # generate file data of 10K lines each with 10 data points def generate_file_data(): # generate many lines of data lines = [generate_line() for _ in range(10000)] # convert to a single ascii doc with new lines return '\n'.join(lines) # create a zip file def main(num_files=1000): # create the zip file with ZipFile('testing.zip', 'w', compression=ZIP_DEFLATED) as handle: # create some files for i in range(num_files): # generate data data = generate_file_data() # create filenames filename = f'data-{i:04d}.csv' # save data file handle.writestr(filename, data) # report progress print(f'.added {filename}') print('Done') # entry point if __name__ == '__main__': main() |
Running the example will create a single zip file named ‘testing.zip‘ in your current working directory.
Progress is reported along the way and a “Done” message is reported at the end of the run.
1 2 3 4 5 6 7 8 9 10 11 12 |
... .added data-0990.csv .added data-0991.csv .added data-0992.csv .added data-0993.csv .added data-0994.csv .added data-0995.csv .added data-0996.csv .added data-0997.csv .added data-0998.csv .added data-0999.csv Done |
It may take a while depending on the speed of your machine.
On my system it takes 210.9 seconds to complete and the resulting zip file is about 847 megabytes in size.
1 |
testing.zip |
This provides an excellent case study to see if we can use multithreading to speed-up the decompression of a large zip file archive that contains thousands of smaller files.
Before we explore multithreading, let’s write a program to unzip the archive sequentially.
Run loops using all CPUs, download your FREE book to learn how.
Unzip the Files One-by-One (slowly)
Unzipping a file with Python is straightforward,
We can open the archive using the context manager as we did when creating the zip file, then call the ZipFile.extractall() function and specify a path.
For example:
1 2 3 4 5 |
... # open the zip file with ZipFile('testing.zip', 'r') as handle: # extract all files handle.extractall(path) |
That’s all there is to it.
The complete example of unzipping the 847 megabytes zip file containing 1,000 data files is listed below.
It has been configured to unzip all files into a tmp/ sub-directory under the current working directory, to avoid clutter.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# SuperFastPython.com # unzip a large number of files sequentially from zipfile import ZipFile # unzip a large number of files def main(path='tmp'): # open the zip file with ZipFile('testing.zip', 'r') as handle: # extract all files handle.extractall(path) # entry point if __name__ == '__main__': main() |
Running the example will open the ‘testing.zip‘ archive and will uncompress all data files into a tmp/ sub-directory.
It is reasonably fast, depending on the speed of your hardware (e.g. hard drive).
On my system, it takes about 8.0 seconds.
1 2 3 |
[Finished in 8.1s] [Finished in 8.0s] [Finished in 7.9s] |
Looking in the tmp/ subdirectory shows 1,000 data files, each about 2 megabytes in size.
1 2 3 4 5 6 7 8 9 10 11 |
... -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0990.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0991.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0992.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0993.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0994.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0995.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0996.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0997.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0998.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0999.csv |
We can also unzip the files one by one by calling the ZipFile.extract() function for each filename in the archive.
We can get a list of filenames in the archive by calling the ZipFile.namelist() function.
For example:
1 2 3 4 5 6 7 8 9 |
... # open the zip file with ZipFile('testing.zip', 'r') as handle: # iterate over all files in the archive for member in handle.namelist(): # unzip a single file from the archive handle.extract(member, path) # report progress print(f'.unzipped {member}') |
Tying this together, the complete example of extracting each file from the archive one at a time by name is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# SuperFastPython.com # unzip a large number of files sequentially from zipfile import ZipFile # unzip a large number of files def main(path='tmp'): # open the zip file with ZipFile('testing.zip', 'r') as handle: # iterate over all files in the archive for member in handle.namelist(): # unzip a single file from the archive handle.extract(member, path) # report progress print(f'.unzipped {member}') # entry point if __name__ == '__main__': main() |
Running the example does exactly the same thing as the previous example, that is extracts all files into a tmp/ sub-directory.
It also takes about the same amount of time to run, e.g. about 8.36 seconds on my machine, perhaps slightly slower because of the print messages reporting progress.
1 2 3 |
[Finished in 8.3s] [Finished in 8.4s] [Finished in 8.4s] |
How long does the serial version take to run on your machine?
Let me know in the comments below.
Next, let’s explore how we might use multi-threading to speed-up the unzipping process.
How to Unzip Files Concurrently
We can adapt the program to be multithreaded with very few changes.
Unzipping files has three elements:
- Load the compressed data into memory.
- Decompress data.
- Save the decompressed data to disk.
Decompressing data in memory is purely algorithmic and intuitively we might think it is CPU-bound. That is, it runs as fast as the CPU.
Saving the decompressed data to file is IO-bound as it is limited by the speed that we can move data from main memory onto the hard drive. Intuitively, we would expect that the IO-bound part of the task is slower than the CPU-bound part of the task. That is, we expect to spend more time waiting for the hard drive than waiting for the CPU.
One approach would be to call the ZipFile.extract() function directly to first decompress the data into memory then save the data to disk.
An alternate approach might be to first decompress the data into memory as a string using the ZipFile.read() function, then create a path and save the file to disk manually using Python disk IO functions. This might offer a benefit if we wish to explore possible speed-ups by separating the two elements of unzipping each file.
Next, let’s start to explore the use of threads to speed-up file unzipping and saving.
Free Concurrent File I/O Course
Get FREE access to my 7-day email course on concurrent File I/O.
Discover patterns for concurrent file I/O, how save files with a process pool, how to copy files with a thread pool, and how to append to a file from multiple threads safely.
Unzip Files Concurrently With Threads
The ThreadPoolExecutor provides a pool of worker threads that we can use.
Each file can be submitted as a task to the thread pool and worker threads can perform the task of loading data from disk into memory.
First, we can create the thread pool with 100 worker threads.
We will use the context manager to ensure the thread pool is closed once we are finished with it.
1 2 3 4 |
... # create the thread pool with ThreadPoolExecutor(100) as exe: # ... |
We can then call the submit() function to issue a task to the worker threads for each file in the archive to unzip.
Each call to the ZipFile.extract() function can be a separate concurrent task, passing the name of the member to unzip and the path location on disk where to save the uncompressed data.
For example:
1 2 3 |
... # unzip each file from the archive _ = [exe.submit(handle.extract, m, path) for m in handle.namelist()] |
It appears that reading the compressed data from the archive is thread safe in the ZipFile class, although there no mention of this in the API documentation at the time of writing (Python 3.10).
Tying this together, the complete example of using multiple worker threads to unzip the 847 megabyte file full of smaller data files is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# SuperFastPython.com # unzip a large number of files concurrently from zipfile import ZipFile from concurrent.futures import ThreadPoolExecutor # unzip a large number of files def main(path='tmp'): # open the zip file with ZipFile('testing.zip', 'r') as handle: # start the thread pool with ThreadPoolExecutor(100) as exe: # unzip each file from the archive _ = [exe.submit(handle.extract, m, path) for m in handle.namelist()] # entry point if __name__ == '__main__': main() |
Running the example unzips the archive into a tmp/ local sub-directory, just like the serial case.
Checking the tmp/ subdirectory shows 1,000 data files, each about 2 megabytes in size.
1 2 3 4 5 6 7 8 9 10 11 |
... -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0990.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0991.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0992.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0993.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0994.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0995.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0996.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0997.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0998.csv -rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0999.csv |
In this case, we see a large speedup by unzipping files concurrently.
On my system, the example takes about 3.16 seconds, compared to about 8 seconds by unzipping the files sequentially. That is about 2.53x faster.
1 2 3 |
[Finished in 3.2s] [Finished in 3.1s] [Finished in 3.2s] |
We can adapt the multithreaded unzip example to use a separate target function to unzip a single file.
This may provide more control over the unzip process which we may require later in more advanced examples.
We can define a function that takes the zip file handle, the name of the member to zip and the path to write the file, then unzip the file and report progress.
The unzip_file() function below implements this.
1 2 3 4 5 6 |
# unzip a file from an archive def unzip_file(handle, filename, path): # unzip the file handle.extract(filename, path) # report progress print(f'.unzipped {filename}') |
We can then call this function for each member in the zip file using the thread pool.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# SuperFastPython.com # unzip a large number of files concurrently with threads from zipfile import ZipFile from concurrent.futures import ThreadPoolExecutor # unzip a file from an archive def unzip_file(handle, filename, path): # unzip the file handle.extract(filename, path) # report progress print(f'.unzipped {filename}') # unzip a large number of files def main(path='tmp'): # open the zip file with ZipFile('testing.zip', 'r') as handle: # start the thread pool with ThreadPoolExecutor(100) as exe: # unzip each file from the archive _ = [exe.submit(unzip_file, handle, m, path) for m in handle.namelist()] # entry point if __name__ == '__main__': main() |
Running the example unzips all 1,000 files as before, although now reports progress.
1 2 3 4 5 6 7 8 9 10 11 |
... .unzipped data-0986.csv .unzipped data-0989.csv .unzipped data-0995.csv .unzipped data-0987.csv .unzipped data-0996.csv .unzipped data-0997.csv .unzipped data-0993.csv .unzipped data-0994.csv .unzipped data-0999.csv .unzipped data-0998.csv |
The example is functionally equivalent to the previous version although makes use of a custom target task function. Therefore, it should take about the same amount of time to complete.
On my system, the example takes about 3.4 seconds, compared to about 8.36 seconds for the sequential version. That is about 2.46x faster.
1 2 3 |
[Finished in 3.4s] [Finished in 3.4s] [Finished in 3.4s] |
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Unzip Files Concurrently with Threads in Batch
Each task in the thread pool adds overhead that slows down the overall task.
This includes function calls and object creation for each task, occurring internally within the ThreadPoolExecutor class.
We might reduce this overhead by performing file operations in a batch within each worker thread.
This can be achieved by updating the unzip_file() function to receive a list of files to unzip, and splitting up the files in the main() function into chunks to be submitted to worker threads for batch processing.
The hope is that this would reduce some of the overhead required for submitting so many small tasks by replacing it with a few larger batch tasks.
First, we can change the target function to receive a list of file names to unzip, listed below.
1 2 3 4 5 6 7 8 |
# unzip files from an archive def unzip_files(handle, filenames, path): # unzip multiple files for filename in filenames: # unzip the file handle.extract(filename, path) # report progress print(f'.unzipped {filename}') |
Next, in the main() function we can select a chunksize based on the number of worker threads and the number of files to copy.
First, we must create a list of all files we wish to unzip by calling the namelist() function on the zip file.
In this case, we will use 100 worker threads with 1,000 files to unzip. Dividing the files evenly between workers (1,000 / 100) or 10 files to unzip for each worker.
1 2 3 4 |
... # determine chunksize n_workers = 100 chunksize = round(len(files) / n_workers) |
Next, we can create the thread pool with the parameterized number of worker threads.
1 2 3 4 |
... # create the thread pool with ThreadPoolExecutor(n_workers) as exe: ... |
Next, we can iterate over the list of files to unzip and split them into chunks defined by a chunksize.
This can be achieved by iterating over the list of files and using the chunksize as the increment size in a for-loop. We can then split off a chunk of files to send to a worker thread as a task.
1 2 3 4 5 6 7 |
... # split the copy operations into chunks for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # submit the batch copy task _ = exe.submit(copy_files, filenames, dest) |
And that’s it.
The complete example of unzipping files in a batch concurrently using worker threads is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
# SuperFastPython.com # unzip a large number of files concurrently with threads in batch from zipfile import ZipFile from concurrent.futures import ThreadPoolExecutor # unzip files from an archive def unzip_files(handle, filenames, path): # unzip multiple files for filename in filenames: # unzip the file handle.extract(filename, path) # report progress print(f'.unzipped {filename}') # unzip a large number of files def main(path='tmp'): # open the zip file with ZipFile('testing.zip', 'r') as handle: # list of all files to unzip files = handle.namelist() # determine chunksize n_workers = 100 chunksize = round(len(files) / n_workers) # start the thread pool with ThreadPoolExecutor(n_workers) as exe: # split the copy operations into chunks for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # submit the batch copy task _ = exe.submit(unzip_files, handle, filenames, path) # entry point if __name__ == '__main__': main() |
Running the example unzips all 1,000 files to the tmp/ subdirectory as before.
1 2 3 4 5 6 7 8 9 10 11 |
... .unzipped data-0389.csv .unzipped data-0089.csv .unzipped data-0789.csv .unzipped data-0889.csv .unzipped data-0659.csv .unzipped data-0979.csv .unzipped data-0109.csv .unzipped data-0359.csv .unzipped data-0969.csv .unzipped data-0679.csv |
In this case, we don’t see any benefit in terms of reduced execution time.
On my system it takes about 3.3 seconds to complete, compared to about 8.36 seconds for the sequential version. That is about 2.53x faster.
It is slightly faster than the non-batch version that took about 3.4 seconds, although may be within the margin of measurement error.
1 2 3 |
[Finished in 3.3s] [Finished in 3.4s] [Finished in 3.3s] |
Unzip Files Concurrently with Processes
We can also try to unzip files concurrently with processes instead of threads.
It is unclear whether processes can offer a speed benefit in this case. Given that we can get a benefit using threads, we know it is possible. Nevertheless, using processes requires data for each task to be serialized which introduces additional overhead that might negate any speed-up from performing file operations with true parallelism via processes.
We can explore using processes to unzip files concurrently using the ProcessPoolExecutor.
Unlike the ThreadPoolExecutor, we cannot simply send the handle to the open zip file to each worker process as the ZipFile class cannot be serialized.
Instead, each process must open the zip archive and extract files separately.
First, we can update the unzip_file() function to take the name of the zip file instead of the file handle. Then open the zip file before then unzipping a single file to the destination directory.
The updated version of the unzip_file() function is listed below.
1 2 3 4 5 6 7 8 |
# unzip a file from an archive def unzip_file(zip_filename, filename, path): # open the zip file with ZipFile(zip_filename, 'r') as handle: # unzip the file handle.extract(filename, path) # report progress print(f'.unzipped {filename}') |
Next, we must open the zip file in the main thread and get a list of all files to unzip.
1 2 3 4 5 |
... # open the zip file with ZipFile('testing.zip', 'r') as handle: # get a list of files to unzip files = handle.namelist() |
We can then submit each filename as a task to the process pool.
In this case, we will use 8 workers in the pool because I have 8 logical CPU cores. You may want to adapt this for the number of logical CPU cores in your system or try different configurations in order to discover what works well.
1 2 3 4 5 |
... # start the process pool with ProcessPoolExecutor(8) as exe: # unzip each file from the archive _ = [exe.submit(unzip_file, zip_filename, name, path) for name in files] |
And that’s it.
The complete example of unzipping files concurrently using processes is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# SuperFastPython.com # unzip a large number of files concurrently with processes from zipfile import ZipFile from concurrent.futures import ProcessPoolExecutor # unzip a file from an archive def unzip_file(zip_filename, filename, path): # open the zip file with ZipFile(zip_filename, 'r') as handle: # unzip the file handle.extract(filename, path) # report progress print(f'.unzipped {filename}') # unzip a large number of files def main(path='tmp', zip_filename='testing.zip'): # open the zip file with ZipFile(zip_filename, 'r') as handle: # get a list of files to unzip files = handle.namelist() # start the process pool with ProcessPoolExecutor(8) as exe: # unzip each file from the archive _ = [exe.submit(unzip_file, zip_filename, name, path) for name in files] # entry point if __name__ == '__main__': main() |
Running the example unzips all files from the archive as before.
1 2 3 4 5 6 7 8 9 10 11 |
... .unzipped data-0928.csv .unzipped data-0937.csv .unzipped data-0944.csv .unzipped data-0951.csv .unzipped data-0960.csv .unzipped data-0968.csv .unzipped data-0976.csv .unzipped data-0984.csv .unzipped data-0992.csv .unzipped data-0999.csv |
In this case, using a process pool offers a benefit over the single-threaded version, although about the same performance as the multi-threaded example.
On my system, the example takes about 3.26 seconds to complete, compared to about 8.36 seconds for the single-threaded version. That is about 2.56x faster.
1 2 3 |
[Finished in 3.2s] [Finished in 3.3s] [Finished in 3.3s] |
Unzip Files Concurrently with Processes in Batch
Opening the zip file for each of the 1,000 files to unzip likely results in too much overhead.
Instead, it may be more efficient to open the zip file once per process, then unzip a subset of the files in batch.
Therefore, we can adapt the example from the previous section to unzip multiple files per task, similar to the multithreaded batch example we developed previously.
First, we can update the unzip_files() function to take a list of files to unzip.
1 2 3 4 5 6 7 8 9 10 |
# unzip files from an archive def unzip_files(zip_filename, filenames, path): # open the zip file with ZipFile(zip_filename, 'r') as handle: # unzip multiple files for filename in filenames: # unzip the file handle.extract(filename, path) # report progress print(f'.unzipped {filename}') |
We can then split the list of files into chunks.
In this case, we will use 8 processes and split the 1,000 files to unzip evenly giving 125 files to unzip per process. It may be interesting to explore different divisions of work among the processes.
1 2 3 4 |
... # determine chunksize n_workers = 8 chunksize = round(len(files) / n_workers) |
Finally, we can iterate the list of files to unzip in chunks and send each into the process pool as a task, much like we did in the multithreaded batch unzipping example.
1 2 3 4 5 6 7 |
... # split the copy operations into chunks for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # submit the batch copy task _ = exe.submit(unzip_files, zip_filename, filenames, path) |
Tying this together, the complete example of unzipping files concurrently using the process pool in batch is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# SuperFastPython.com # unzip a large number of files concurrently with processes in batch from zipfile import ZipFile from concurrent.futures import ProcessPoolExecutor # unzip files from an archive def unzip_files(zip_filename, filenames, path): # open the zip file with ZipFile(zip_filename, 'r') as handle: # unzip multiple files for filename in filenames: # unzip the file handle.extract(filename, path) # report progress print(f'.unzipped {filename}') # unzip a large number of files def main(path='tmp', zip_filename='testing.zip'): # open the zip file with ZipFile(zip_filename, 'r') as handle: # list of all files to unzip files = handle.namelist() # determine chunksize n_workers = 8 chunksize = round(len(files) / n_workers) # start the thread pool with ProcessPoolExecutor(n_workers) as exe: # split the copy operations into chunks for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # submit the batch copy task _ = exe.submit(unzip_files, zip_filename, filenames, path) # entry point if __name__ == '__main__': main() |
Running the example unzips all files from the archive as before.
1 2 3 4 5 6 7 8 9 10 11 |
... .unzipped data-0865.csv .unzipped data-0866.csv .unzipped data-0867.csv .unzipped data-0868.csv .unzipped data-0869.csv .unzipped data-0870.csv .unzipped data-0871.csv .unzipped data-0872.csv .unzipped data-0873.csv .unzipped data-0874.csv |
In this case, we see a benefit over all previous examples tried so far.
On my system, the example takes about 2.03 seconds, compared to about 8.36 seconds for the single-threaded version. That is about 4.12x faster.
1 2 3 |
[Finished in 2.1s] [Finished in 2.0s] [Finished in 2.0s] |
Interestingly, the ZipFile.extractall() allows a subset of files to be extracted in batch.
We could call this function from each process for a chunk of files. This could be slightly more efficient given the fewer lines of code although it means we cannot report progress along the way.
The unzip_files() function can be updated to call the ZipFile.extractall() function directly, specifying the directory in which the files are to be extracted and a list of names of files to extract from the archive.
The unzip_files() function with these changes is listed below.
1 2 3 4 5 6 |
# unzip files from an archive def unzip_files(zip_filename, filenames, path): # open the zip file with ZipFile(zip_filename, 'r') as handle: # unzip a batch of files handle.extractall(path=path, members=filenames) |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# SuperFastPython.com # unzip a large number of files concurrently with processes in batch from zipfile import ZipFile from concurrent.futures import ProcessPoolExecutor # unzip files from an archive def unzip_files(zip_filename, filenames, path): # open the zip file with ZipFile(zip_filename, 'r') as handle: # unzip a batch of files handle.extractall(path=path, members=filenames) # unzip a large number of files def main(path='tmp', zip_filename='testing.zip'): # open the zip file with ZipFile(zip_filename, 'r') as handle: # list of all files to unzip files = handle.namelist() # determine chunksize n_workers = 8 chunksize = round(len(files) / n_workers) # start the thread pool with ProcessPoolExecutor(n_workers) as exe: # split the copy operations into chunks for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # submit the batch copy task _ = exe.submit(unzip_files, zip_filename, filenames, path) # entry point if __name__ == '__main__': main() |
Running the example extracts all 1,000 files from the archive as before.
In this case, we don’t see any significant speed improvement.
On my system, it takes about 2.0 seconds, compared to about 2.03 seconds when calling the ZipFile.extract() function manually for each file to decompress which may be within the margin of measurement error.
Nevertheless, compared to the sequential version that took 8.36 seconds, this version is about 4.18x faster.
1 2 3 |
[Finished in 2.0s] [Finished in 2.0s] [Finished in 2.0s] |
Unzip Files Concurrently with Processes and Threads in Batch
It is possible to separate the CPU-bound and IO-bound effort of each file unzip task.
We can decompress a file from the archive to main memory by calling the ZipFile.read() function, then save it to disk manually by creating a path and writing the bytes.
The decompression of each file in memory could be performed by processes and the writing of each file to disk could be performed by threads.
This could be achieved by first using a process pool to decompress batches of files from the archive then using a thread pool within each process to save the decompressed file data to disk.
There are a lot of moving parts and it is unclear whether this would offer a speed improvement. Nevertheless, it is an interesting exercise.
First, we can define a task to save data to disk to be executed in a thread pool.
The function takes the decompressed data, the filename and the directory in which to save the file. A relative path is constructed using the os.path.join() function, then the file is opened in write mode and the decompressed bytes are written to disk and progress is reported.
The save_file() function below implements this.
1 2 3 4 5 6 7 8 9 |
# save file to disk def save_file(data, filename, path): # create a path filepath = join(path, filename) # write to disk with open(filepath, 'wb') as file: file.write(data) # report progress print(f'.unzipped {filename}', flush=True) |
Next, the unzip_files() function needs to be updated to first decompress each file, then issue a save task to a thread pool.
First, we must create a thread pool. We will use 10 workers per pool in this case, chosen arbitrarily. Recall, each of the eight processes will have its own thread pool.
1 2 3 4 |
... # create a thread pool with ThreadPoolExecutor(10) as exe: # ... |
We can then iterate over the list of files that the process is responsible for unzipping and decompressing each and submit the data to the thread pool for saving.
Because the thread pool is within the process, no inter-process communication is required, avoiding additional computational overhead, such as if the main thread/process were to save the decompressed files.
1 2 3 4 5 6 7 |
... # unzip each file for filename in filenames: # decompress data data = handle.read(filename) # save to disk _ = exe.submit(save_file, data, filename, path) |
The updated version of the unzip_files() function with these changes is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 |
# unzip files from an archive def unzip_files(zip_filename, filenames, path): # open the zip file with ZipFile(zip_filename, 'r') as handle: # create a thread pool with ThreadPoolExecutor(10) as exe: # unzip each file for filename in filenames: # decompress data data = handle.read(filename) # save to disk _ = exe.submit(save_file, data, filename, path) |
And that is it.
No other changes from the previous example for batch unzipping with processes are required.
The complete example is listed below.
It is more complicated than the previous example, so let’s review what is happening. First, we split the list of files into chunks and send each chunk to a separate process for extraction. Each process then starts a thread pool, extracts the data for each file and throws tasks into the thread pool to save the decompressed data for each file.
The hope is that decompressing data is faster than saving data to disk so that the CPU can get through the decompression steps fast and then wait on the threads to save all the files.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# SuperFastPython.com # unzip a large number of files concurrently with processes and threads in batch from os import makedirs from os.path import join from zipfile import ZipFile from concurrent.futures import ProcessPoolExecutor from concurrent.futures import ThreadPoolExecutor # save file to disk def save_file(data, filename, path): # create a path filepath = join(path, filename) # write to disk with open(filepath, 'wb') as file: file.write(data) # report progress print(f'.unzipped {filename}', flush=True) # unzip files from an archive def unzip_files(zip_filename, filenames, path): # open the zip file with ZipFile(zip_filename, 'r') as handle: # create a thread pool with ThreadPoolExecutor(10) as exe: # unzip each file for filename in filenames: # decompress data data = handle.read(filename) # save to disk _ = exe.submit(save_file, data, filename, path) # unzip a large number of files def main(path='tmp', zip_filename='testing.zip'): # create the target directory makedirs(path, exist_ok=True) # open the zip file with ZipFile(zip_filename, 'r') as handle: # list of all files to unzip files = handle.namelist() # determine chunksize n_workers = 8 chunksize = round(len(files) / n_workers) # start the thread pool with ProcessPoolExecutor(n_workers) as exe: # split the copy operations into chunks for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # submit the batch copy task _ = exe.submit(unzip_files, zip_filename, filenames, path) # entry point if __name__ == '__main__': main() |
Running the example decompresses all 1,000 data files to the tmp/ sub-directory as with all previous examples and reports progress.
1 2 3 4 5 6 7 8 9 10 11 |
... .unzipped data-0747.csv .unzipped data-0998.csv .unzipped data-0623.csv .unzipped data-0873.csv .unzipped data-0374.csv .unzipped data-0748.csv .unzipped data-0999.csv .unzipped data-0624.csv .unzipped data-0874.csv .unzipped data-0749.csv |
In this case, we may see a small speed increase
On my system, it takes about 1.83 seconds to complete, compared to about 8.36 seconds for the sequential version. That is about 4.57x faster.
1 2 3 |
[Finished in 1.8s] [Finished in 1.8s] [Finished in 1.9s] |
Unzip Files Concurrently with AsyncIO
We can also unzip files concurrently using asyncio.
Generally, Python does not support non-blocking IO operations when working with files. It is not provided in the asyncio module. This is because it is challenging to implement in a general cross-platform manner.
The third-party library aiofiles provides file operations for use in asyncio operations, but again, the operations are not true non-blocking IO and instead the concurrency is simulated using thread pools.
Nevertheless, we can use aiofiles in an asyncio program to load files concurrently. It does not support an asynchronous version of the ZipFile class.
You can install the aiofiles library using your Python package manager, such as pip:
1 |
pip3 install aiofiles |
Given that the aiofiles library does not support asyncio support for ZipFile, we can develop an example that first decompresses each file in a blocking manner, then save the decompressed data to disk using (simulated) non-blocking IO.
First, we can develop a function to extract data from the zip file and save data to disk using aiofiles.
The data can be extracted from the zip file, which requires a blocking read operation.
1 2 3 |
... # decompress data, blocking data = handle.read(filename) |
Next, we can construct the output path for saving the file using os.path.join() function.
1 2 3 |
... # create a path filepath = join(path, filename) |
We can then open the file using aiofiles.open() and write the data to disk, yielding control until the task is completed.
1 2 3 4 |
... # write to disk async with aiofiles.open(filepath, 'wb') as fd: await fd.write(data) |
The extract_and_save() function below implements this.
1 2 3 4 5 6 7 8 9 10 11 |
# extract a file from the archive and save to disk async def extract_and_save(handle, filename, path): # decompress data, blocking data = handle.read(filename) # create a path filepath = join(path, filename) # write to disk async with aiofiles.open(filepath, 'wb') as fd: await fd.write(data) # report progress print(f'.unzipped {filename}') |
Finally, we can update the main() function.
We can create the directory in which all files will be saved using the os.makedirs() function.
1 2 3 |
... # create the target directory makedirs(path, exist_ok=True) |
After opening the ZipFile for unzipping, we can create a list of coroutines, each with a call to our extract_and_save() function for each filename in the zip archive.
This can be achieved using a list comprehension.
1 2 3 |
... # create coroutines tasks = [extract_and_save(handle, name, path) for name in handle.namelist()] |
Recall that this will create a list of coroutines and will not call the extract_and_save() function.
We can execute all of the coroutines and await their completion using the asyncio.gather() function. We can unpack the coroutines for the gather function using the * operator, for example:
1 2 3 |
... # execute all tasks _ = await asyncio.gather(*tasks) |
Finally, we can start the asyncio runtime and call the main() to begin the event loop and start processing coroutines.
1 2 3 |
# entry point if __name__ == '__main__': asyncio.run(main()) |
Tying this together, the complete example of concurrently unzipping files using asyncio is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# SuperFastPython.com # unzip a large number of files concurrently with asyncio from os import makedirs from os.path import join from zipfile import ZipFile import asyncio import aiofiles # extract a file from the archive and save to disk async def extract_and_save(handle, filename, path): # decompress data, blocking data = handle.read(filename) # create a path filepath = join(path, filename) # write to disk async with aiofiles.open(filepath, 'wb') as fd: await fd.write(data) # report progress print(f'.unzipped {filename}') # unzip a large number of files async def main(path='tmp'): # create the target directory makedirs(path, exist_ok=True) # open the zip file with ZipFile('testing.zip', 'r') as handle: # create coroutines tasks = [extract_and_save(handle, name, path) for name in handle.namelist()] # execute all tasks _ = await asyncio.gather(*tasks) # entry point if __name__ == '__main__': asyncio.run(main()) |
Running the example unzips all files to the tmp/ subdirectory as before.
1 2 3 4 5 6 7 8 9 10 11 |
... .unzipped data-0987.csv .unzipped data-0994.csv .unzipped data-0989.csv .unzipped data-0993.csv .unzipped data-0995.csv .unzipped data-0998.csv .unzipped data-0999.csv .unzipped data-0992.csv .unzipped data-0997.csv .unzipped data-0996.csv |
In this case, we don’t see any speed benefit, in fact, the asyncio version has much worse performance than other multithreaded versions.
On my system, the example takes about 7.6 seconds, compared to about 8.36 seconds with the sequential version. This is about 1.10x faster.
1 2 3 |
[Finished in 7.9s] [Finished in 7.6s] [Finished in 7.5s] |
Unzip Files Concurrently with AsyncIO in Batch
Perhaps we can reduce some of the overhead in unzipping files concurrently with asyncio by performing tasks in batch.
In the previous example, each coroutine performed a single IO-bound operation. We can update the extract_and_save() function to process multiple files in a batch as we did in previous batch processing examples.
This will reduce the number of coroutines in the asyncio runtime and may reduce the amount of data moving around.
The updated version of the task function that will process a list of files is listed below. The task will yield control for each file write operation, but only to other coroutines. The for-loop within a task will not progress while it is waiting.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# extract files from the archive and save to disk async def extract_and_save(handle, filenames, path): # extract each file for filename in filenames: # decompress data, blocking data = handle.read(filename) # create a path filepath = join(path, filename) # write to disk async with aiofiles.open(filepath, 'wb') as fd: await fd.write(data) # report progress print(f'.unzipped {filename}') |
Next, we can prepare a list of all files we expect to unzip from the archive.
1 2 3 |
... # determine all of the files to extract files = handle.namelist() |
We can then choose the number of coroutines we wish to use and then divide the total number of files up evenly into chunks.
We will use 100 in this case, chosen arbitrarily.
1 2 3 4 |
... # determine chunksize n_workers = 100 chunksize = round(len(files) / n_workers) |
We can then iterate over the chunks and add each coroutine to a list for execution at the end.
1 2 3 4 5 6 7 8 |
... # split the move operations into chunks tasks = list() for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # add coroutine to list of tasks tasks.append(extract_and_save(handle, filenames, path)) |
Finally, we can execute all 100 coroutines concurrently.
1 2 3 |
... # execute and wait for all coroutines await asyncio.gather(*tasks) |
Tying this together, a complete example of batch file unzipping using asyncio is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
# SuperFastPython.com # unzip a large number of files concurrently with asyncio in batch from os import makedirs from os.path import join from zipfile import ZipFile import asyncio import aiofiles # extract files from the archive and save to disk async def extract_and_save(handle, filenames, path): # extract each file for filename in filenames: # decompress data, blocking data = handle.read(filename) # create a path filepath = join(path, filename) # write to disk async with aiofiles.open(filepath, 'wb') as fd: await fd.write(data) # report progress print(f'.unzipped {filename}') # unzip a large number of files async def main(path='tmp'): # create the target directory makedirs(path, exist_ok=True) # open the zip file with ZipFile('testing.zip', 'r') as handle: # determine all of the files to extract files = handle.namelist() # determine chunksize n_workers = 100 chunksize = round(len(files) / n_workers) # split the move operations into chunks tasks = list() for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # add coroutine to list of tasks tasks.append(extract_and_save(handle, filenames, path)) # execute and wait for all coroutines await asyncio.gather(*tasks) # entry point if __name__ == '__main__': asyncio.run(main()) |
Running the example unzips all files to the tmp/ subdirectory as before.
1 2 3 4 5 6 7 8 9 10 |
.unzipped data-0769.csv .unzipped data-0819.csv .unzipped data-0989.csv .unzipped data-0579.csv .unzipped data-0959.csv .unzipped data-0969.csv .unzipped data-0879.csv .unzipped data-0999.csv .unzipped data-0979.csv .unzipped data-0949.csv |
In this case we see a modest speed improvement over the non-batch asyncio version of unzipping files.
On my system, the example takes about 6.4 seconds to complete, compared to 8.36 seconds for the sequential version. That is about 1.31x faster.
1 2 3 |
[Finished in 6.4s] [Finished in 6.4s] [Finished in 6.4s] |
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent File I/O in Python, Jason Brownlee (my book!)
Guides
Python File I/O APIs
- Built-in Functions
- os - Miscellaneous operating system interfaces
- os.path - Common pathname manipulations
- shutil - High-level file operations
- zipfile — Work with ZIP archives
- Python Tutorial: Chapter 7. Input and Output
Python Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
- asyncio — Asynchronous I/O
File I/O in Asyncio
References
Takeaways
In this tutorial, you discovered how to explore concurrent file unzipping in Python.
Do you have any questions?
Leave your question in a comment below and I will reply fast with my best advice.
Jan Derk says
Thanks for your very useful website.
One small remark: In the unzip batch code there is a line:
chunksize = round(len(files) / n_workers)
If the n_workers is more than twice the size of len(files) the chunksize becomes 0.
It’s better to use ceil() or to make sure chunksize is never zero.
Jason Brownlee says
You’re welcome.
Great tip, thank you!
Dirk Heinrichs says
Hi,
nice tutorial, thanks a lot.
I’m trying to unzip large zips (>2G) on Windows using the ThreadPoolExcutor method, but when it has finished, several files were not extracted. It works if I set the number of threads to 1, but that’s certainly not what I want. It seems to work just fine on Linux, though.
Do you have any idea what could be the problem on Windows?
Thanks in advance…
Jason Brownlee says
Fascinating. I have not seen a problem like this.
Perhaps confirm the version of Python matches between the two systems?
Perhaps you can try two manual threading.Thread instances instead and see if it still fails?
Perhaps you can devise the simplest possible version of the program and run it on both systems?
Perhaps you can try searching or posting on StackOverflow to see if anyone else has had a similar problem?
Let me know how you go.