Multithreaded File Unzipping in Python

Last Updated on August 21, 2023

Unzipping a large zip file is typically slow.

It can become painfully slow in situations where you may need to unzip thousands of files to disk. The hope is that multithreading will speed up the unzipping operation.

In this tutorial, you will explore how to unzip large zip files containing many data files using concurrency.

Let’s dive in.

Table of Contents

Create a Large Zip File

Before we can explore multithreaded file unzipping, we need a large zip file that contains many files.

We can create this file ourselves.

We can create .CSV data files full of random numbers then add them to one large zip file.

First, we can define a function that creates one line of ten random numeric data values separated by commas.

The generate_line() function below implements this.

# generate a line of mock data of 10 random data points

def generate_line():

return ','.join([str(random()) for _ in range(10)])

Next, we can call this function 10,000 times (e.g. 10,000 lines) to create a CSV data file that is about 2 megabytes when unzipped.

The generate_file_data() function below implements this, returning a string that represents the content of a CSV data file full of random numbers.

# generate file data of 10K lines each with 10 data points

def generate_file_data():

# generate many lines of data

lines = [generate_line() for _ in range(10000)]

# convert to a single ascii doc with new lines

return '\n'.join(lines)

Finally, we can create a zip file and fill it with 1,000 different CSV data files.

First, we can open the zip file, which we will call ‘testing.zip‘ stored in our current working directory. This can be achieved using the zipfile.ZipFile class with a context manager to ensure the file is closed once we are finished with it.

...

# create the zip file

with ZipFile('testing.zip', 'w', compression=ZIP_DEFLATED) as handle:

# ...

We can then iterate 1,000 times and in each iteration generate the data, specify a unique filename and store the data against the filename in the zip file.

...

# generate data

data = generate_file_data()

# create filenames

filename = f'data-{i:04d}.csv'

# save data file

handle.writestr(filename, data)

# report progress

print(f'.added {filename}')

Tying these elements together, the complete example of creating the large zip file full of smaller CSV files is listed below.

# SuperFastPython.com

# create a zip file containing many files

from random import random

from zipfile import ZipFile

from zipfile import ZIP_DEFLATED

# generate a line of mock data of 10 random data points

def generate_line():

return ','.join([str(random()) for _ in range(10)])

# generate file data of 10K lines each with 10 data points

def generate_file_data():

# generate many lines of data

lines = [generate_line() for _ in range(10000)]

# convert to a single ascii doc with new lines

return '\n'.join(lines)

# create a zip file

def main(num_files=1000):

# create the zip file

with ZipFile('testing.zip', 'w', compression=ZIP_DEFLATED) as handle:

# create some files

for i in range(num_files):

# generate data

data = generate_file_data()

# create filenames

filename = f'data-{i:04d}.csv'

# save data file

handle.writestr(filename, data)

# report progress

print(f'.added {filename}')

print('Done')

# entry point

if __name__ == '__main__':

main()

Running the example will create a single zip file named ‘testing.zip‘ in your current working directory.

Progress is reported along the way and a “Done” message is reported at the end of the run.

...

.added data-0990.csv

.added data-0991.csv

.added data-0992.csv

.added data-0993.csv

.added data-0994.csv

.added data-0995.csv

.added data-0996.csv

.added data-0997.csv

.added data-0998.csv

.added data-0999.csv

Done

It may take a while depending on the speed of your machine.

On my system it takes 210.9 seconds to complete and the resulting zip file is about 847 megabytes in size.

1	testing.zip

This provides an excellent case study to see if we can use multithreading to speed-up the decompression of a large zip file archive that contains thousands of smaller files.

Before we explore multithreading, let’s write a program to unzip the archive sequentially.

Run loops using all CPUs, download your FREE book to learn how.

Unzip the Files One-by-One (slowly)

Unzipping a file with Python is straightforward,

We can open the archive using the context manager as we did when creating the zip file, then call the ZipFile.extractall() function and specify a path.

For example:

...

# open the zip file

with ZipFile('testing.zip', 'r') as handle:

# extract all files

handle.extractall(path)

That’s all there is to it.

The complete example of unzipping the 847 megabytes zip file containing 1,000 data files is listed below.

It has been configured to unzip all files into a tmp/ sub-directory under the current working directory, to avoid clutter.

# SuperFastPython.com

# unzip a large number of files sequentially

from zipfile import ZipFile

# unzip a large number of files

def main(path='tmp'):

# open the zip file

with ZipFile('testing.zip', 'r') as handle:

# extract all files

handle.extractall(path)

# entry point

if __name__ == '__main__':

main()

Running the example will open the ‘testing.zip‘ archive and will uncompress all data files into a tmp/ sub-directory.

It is reasonably fast, depending on the speed of your hardware (e.g. hard drive).

On my system, it takes about 8.0 seconds.

[Finished in 8.1s]

[Finished in 8.0s]

[Finished in 7.9s]

Looking in the tmp/ subdirectory shows 1,000 data files, each about 2 megabytes in size.

...

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0990.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0991.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0992.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0993.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0994.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0995.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0996.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0997.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0998.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:19 data-0999.csv

We can also unzip the files one by one by calling the ZipFile.extract() function for each filename in the archive.

We can get a list of filenames in the archive by calling the ZipFile.namelist() function.

For example:

...

# open the zip file

with ZipFile('testing.zip', 'r') as handle:

# iterate over all files in the archive

for member in handle.namelist():

# unzip a single file from the archive

handle.extract(member, path)

# report progress

print(f'.unzipped {member}')

Tying this together, the complete example of extracting each file from the archive one at a time by name is listed below.

# SuperFastPython.com

# unzip a large number of files sequentially

from zipfile import ZipFile

# unzip a large number of files

def main(path='tmp'):

# open the zip file

with ZipFile('testing.zip', 'r') as handle:

# iterate over all files in the archive

for member in handle.namelist():

# unzip a single file from the archive

handle.extract(member, path)

# report progress

print(f'.unzipped {member}')

# entry point

if __name__ == '__main__':

main()

Running the example does exactly the same thing as the previous example, that is extracts all files into a tmp/ sub-directory.

It also takes about the same amount of time to run, e.g. about 8.36 seconds on my machine, perhaps slightly slower because of the print messages reporting progress.

[Finished in 8.3s]

[Finished in 8.4s]

How long does the serial version take to run on your machine?
Let me know in the comments below.

Next, let’s explore how we might use multi-threading to speed-up the unzipping process.

Start Now: Free Concurrent File I/O Crash Course

How to Unzip Files Concurrently

We can adapt the program to be multithreaded with very few changes.

Unzipping files has three elements:

Load the compressed data into memory.
Decompress data.
Save the decompressed data to disk.

Decompressing data in memory is purely algorithmic and intuitively we might think it is CPU-bound. That is, it runs as fast as the CPU.

Saving the decompressed data to file is IO-bound as it is limited by the speed that we can move data from main memory onto the hard drive. Intuitively, we would expect that the IO-bound part of the task is slower than the CPU-bound part of the task. That is, we expect to spend more time waiting for the hard drive than waiting for the CPU.

One approach would be to call the ZipFile.extract() function directly to first decompress the data into memory then save the data to disk.

An alternate approach might be to first decompress the data into memory as a string using the ZipFile.read() function, then create a path and save the file to disk manually using Python disk IO functions. This might offer a benefit if we wish to explore possible speed-ups by separating the two elements of unzipping each file.

Next, let’s start to explore the use of threads to speed-up file unzipping and saving.

Free Concurrent File I/O Course

Get FREE access to my 7-day email course on concurrent File I/O.

Discover patterns for concurrent file I/O, how save files with a process pool, how to copy files with a thread pool, and how to append to a file from multiple threads safely.

Learn more

Unzip Files Concurrently With Threads

The ThreadPoolExecutor provides a pool of worker threads that we can use.

Each file can be submitted as a task to the thread pool and worker threads can perform the task of loading data from disk into memory.

First, we can create the thread pool with 100 worker threads.

We will use the context manager to ensure the thread pool is closed once we are finished with it.

...

# create the thread pool

with ThreadPoolExecutor(100) as exe:

# ...

We can then call the submit() function to issue a task to the worker threads for each file in the archive to unzip.

Each call to the ZipFile.extract() function can be a separate concurrent task, passing the name of the member to unzip and the path location on disk where to save the uncompressed data.

For example:

...

# unzip each file from the archive

_ = [exe.submit(handle.extract, m, path) for m in handle.namelist()]

It appears that reading the compressed data from the archive is thread safe in the ZipFile class, although there no mention of this in the API documentation at the time of writing (Python 3.10).

Tying this together, the complete example of using multiple worker threads to unzip the 847 megabyte file full of smaller data files is listed below.

# SuperFastPython.com

# unzip a large number of files concurrently

from zipfile import ZipFile

from concurrent.futures import ThreadPoolExecutor

# unzip a large number of files

def main(path='tmp'):

# open the zip file

with ZipFile('testing.zip', 'r') as handle:

# start the thread pool

with ThreadPoolExecutor(100) as exe:

# unzip each file from the archive

_ = [exe.submit(handle.extract, m, path) for m in handle.namelist()]

# entry point

if __name__ == '__main__':

main()

Running the example unzips the archive into a tmp/ local sub-directory, just like the serial case.

Checking the tmp/ subdirectory shows 1,000 data files, each about 2 megabytes in size.

...

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0990.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0991.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0992.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0993.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0994.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0995.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0996.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0997.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0998.csv

-rw-r--r-- 1 jasonb staff 1.8M 20 Dec 10:29 data-0999.csv

In this case, we see a large speedup by unzipping files concurrently.

On my system, the example takes about 3.16 seconds, compared to about 8 seconds by unzipping the files sequentially. That is about 2.53x faster.

[Finished in 3.2s]

[Finished in 3.1s]

[Finished in 3.2s]

We can adapt the multithreaded unzip example to use a separate target function to unzip a single file.

This may provide more control over the unzip process which we may require later in more advanced examples.

We can define a function that takes the zip file handle, the name of the member to zip and the path to write the file, then unzip the file and report progress.

The unzip_file() function below implements this.

# unzip a file from an archive

def unzip_file(handle, filename, path):

# unzip the file

handle.extract(filename, path)

# report progress

print(f'.unzipped {filename}')

We can then call this function for each member in the zip file using the thread pool.

The complete example is listed below.

# SuperFastPython.com

# unzip a large number of files concurrently with threads

from zipfile import ZipFile

from concurrent.futures import ThreadPoolExecutor

# unzip a file from an archive

def unzip_file(handle, filename, path):

# unzip the file

handle.extract(filename, path)

# report progress

print(f'.unzipped {filename}')

# unzip a large number of files

def main(path='tmp'):

# open the zip file

with ZipFile('testing.zip', 'r') as handle:

# start the thread pool

with ThreadPoolExecutor(100) as exe:

# unzip each file from the archive

_ = [exe.submit(unzip_file, handle, m, path) for m in handle.namelist()]

# entry point

if __name__ == '__main__':

main()

Running the example unzips all 1,000 files as before, although now reports progress.

...

.unzipped data-0986.csv

.unzipped data-0989.csv

.unzipped data-0995.csv

.unzipped data-0987.csv

.unzipped data-0996.csv

.unzipped data-0997.csv

.unzipped data-0993.csv

.unzipped data-0994.csv

.unzipped data-0999.csv

.unzipped data-0998.csv

The example is functionally equivalent to the previous version although makes use of a custom target task function. Therefore, it should take about the same amount of time to complete.

On my system, the example takes about 3.4 seconds, compared to about 8.36 seconds for the sequential version. That is about 2.46x faster.

[Finished in 3.4s]

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Unzip Files Concurrently with Threads in Batch

Each task in the thread pool adds overhead that slows down the overall task.

This includes function calls and object creation for each task, occurring internally within the ThreadPoolExecutor class.

We might reduce this overhead by performing file operations in a batch within each worker thread.

This can be achieved by updating the unzip_file() function to receive a list of files to unzip, and splitting up the files in the main() function into chunks to be submitted to worker threads for batch processing.

The hope is that this would reduce some of the overhead required for submitting so many small tasks by replacing it with a few larger batch tasks.

First, we can change the target function to receive a list of file names to unzip, listed below.

# unzip files from an archive

def unzip_files(handle, filenames, path):

# unzip multiple files

for filename in filenames:

# unzip the file

handle.extract(filename, path)

# report progress

print(f'.unzipped {filename}')

Next, in the main() function we can select a chunksize based on the number of worker threads and the number of files to copy.

First, we must create a list of all files we wish to unzip by calling the namelist() function on the zip file.

In this case, we will use 100 worker threads with 1,000 files to unzip. Dividing the files evenly between workers (1,000 / 100) or 10 files to unzip for each worker.

...

# determine chunksize

n_workers = 100

chunksize = round(len(files) / n_workers)

Next, we can create the thread pool with the parameterized number of worker threads.

...

# create the thread pool

with ThreadPoolExecutor(n_workers) as exe:

...

Next, we can iterate over the list of files to unzip and split them into chunks defined by a chunksize.

This can be achieved by iterating over the list of files and using the chunksize as the increment size in a for-loop. We can then split off a chunk of files to send to a worker thread as a task.

...

# split the copy operations into chunks

for i in range(0, len(files), chunksize):

# select a chunk of filenames

filenames = files[i:(i + chunksize)]

# submit the batch copy task

_ = exe.submit(copy_files, filenames, dest)

And that’s it.

The complete example of unzipping files in a batch concurrently using worker threads is listed below.

# SuperFastPython.com

# unzip a large number of files concurrently with threads in batch

from zipfile import ZipFile

from concurrent.futures import ThreadPoolExecutor

# unzip files from an archive

def unzip_files(handle, filenames, path):

# unzip multiple files

for filename in filenames:

# unzip the file

handle.extract(filename, path)

# report progress

print(f'.unzipped {filename}')

# unzip a large number of files

def main(path='tmp'):

# open the zip file

with ZipFile('testing.zip', 'r') as handle:

# list of all files to unzip

files = handle.namelist()

# determine chunksize

n_workers = 100

chunksize = round(len(files) / n_workers)

# start the thread pool

with ThreadPoolExecutor(n_workers) as exe:

# split the copy operations into chunks

for i in range(0, len(files), chunksize):

# select a chunk of filenames

filenames = files[i:(i + chunksize)]

# submit the batch copy task

_ = exe.submit(unzip_files, handle, filenames, path)

# entry point

if __name__ == '__main__':

main()

Running the example unzips all 1,000 files to the tmp/ subdirectory as before.

...

.unzipped data-0389.csv

.unzipped data-0089.csv

.unzipped data-0789.csv

.unzipped data-0889.csv

.unzipped data-0659.csv

.unzipped data-0979.csv

.unzipped data-0109.csv

.unzipped data-0359.csv

.unzipped data-0969.csv

.unzipped data-0679.csv

In this case, we don’t see any benefit in terms of reduced execution time.

On my system it takes about 3.3 seconds to complete, compared to about 8.36 seconds for the sequential version. That is about 2.53x faster.

It is slightly faster than the non-batch version that took about 3.4 seconds, although may be within the margin of measurement error.

[Finished in 3.3s]

[Finished in 3.4s]

[Finished in 3.3s]

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Unzip Files Concurrently with Processes

We can also try to unzip files concurrently with processes instead of threads.

It is unclear whether processes can offer a speed benefit in this case. Given that we can get a benefit using threads, we know it is possible. Nevertheless, using processes requires data for each task to be serialized which introduces additional overhead that might negate any speed-up from performing file operations with true parallelism via processes.

We can explore using processes to unzip files concurrently using the ProcessPoolExecutor.

Unlike the ThreadPoolExecutor, we cannot simply send the handle to the open zip file to each worker process as the ZipFile class cannot be serialized.

Instead, each process must open the zip archive and extract files separately.

First, we can update the unzip_file() function to take the name of the zip file instead of the file handle. Then open the zip file before then unzipping a single file to the destination directory.

The updated version of the unzip_file() function is listed below.

# unzip a file from an archive

def unzip_file(zip_filename, filename, path):

# open the zip file

with ZipFile(zip_filename, 'r') as handle:

# unzip the file

handle.extract(filename, path)

# report progress

print(f'.unzipped {filename}')

Next, we must open the zip file in the main thread and get a list of all files to unzip.

...

# open the zip file

with ZipFile('testing.zip', 'r') as handle:

# get a list of files to unzip

files = handle.namelist()

We can then submit each filename as a task to the process pool.

In this case, we will use 8 workers in the pool because I have 8 logical CPU cores. You may want to adapt this for the number of logical CPU cores in your system or try different configurations in order to discover what works well.

...

# start the process pool

with ProcessPoolExecutor(8) as exe:

# unzip each file from the archive

_ = [exe.submit(unzip_file, zip_filename, name, path) for name in files]

And that’s it.

The complete example of unzipping files concurrently using processes is listed below.

# SuperFastPython.com

# unzip a large number of files concurrently with processes

from zipfile import ZipFile

from concurrent.futures import ProcessPoolExecutor

# unzip a file from an archive

def unzip_file(zip_filename, filename, path):

# open the zip file

with ZipFile(zip_filename, 'r') as handle:

# unzip the file

handle.extract(filename, path)

# report progress

print(f'.unzipped {filename}')

# unzip a large number of files

def main(path='tmp', zip_filename='testing.zip'):

# open the zip file

with ZipFile(zip_filename, 'r') as handle:

# get a list of files to unzip

files = handle.namelist()

# start the process pool

with ProcessPoolExecutor(8) as exe:

# unzip each file from the archive

_ = [exe.submit(unzip_file, zip_filename, name, path) for name in files]

# entry point

if __name__ == '__main__':

main()

Running the example unzips all files from the archive as before.

...

.unzipped data-0928.csv

.unzipped data-0937.csv

.unzipped data-0944.csv

.unzipped data-0951.csv

.unzipped data-0960.csv

.unzipped data-0968.csv

.unzipped data-0976.csv

.unzipped data-0984.csv

.unzipped data-0992.csv

.unzipped data-0999.csv

In this case, using a process pool offers a benefit over the single-threaded version, although about the same performance as the multi-threaded example.

On my system, the example takes about 3.26 seconds to complete, compared to about 8.36 seconds for the single-threaded version. That is about 2.56x faster.

[Finished in 3.2s]

[Finished in 3.3s]

Unzip Files Concurrently with Processes in Batch

Opening the zip file for each of the 1,000 files to unzip likely results in too much overhead.

Instead, it may be more efficient to open the zip file once per process, then unzip a subset of the files in batch.

Therefore, we can adapt the example from the previous section to unzip multiple files per task, similar to the multithreaded batch example we developed previously.

First, we can update the unzip_files() function to take a list of files to unzip.

# unzip files from an archive

def unzip_files(zip_filename, filenames, path):

# open the zip file

with ZipFile(zip_filename, 'r') as handle:

# unzip multiple files

for filename in filenames:

# unzip the file

handle.extract(filename, path)

# report progress

print(f'.unzipped {filename}')

We can then split the list of files into chunks.

In this case, we will use 8 processes and split the 1,000 files to unzip evenly giving 125 files to unzip per process. It may be interesting to explore different divisions of work among the processes.

...

# determine chunksize

n_workers = 8

chunksize = round(len(files) / n_workers)

Finally, we can iterate the list of files to unzip in chunks and send each into the process pool as a task, much like we did in the multithreaded batch unzipping example.

...

# split the copy operations into chunks

for i in range(0, len(files), chunksize):

# select a chunk of filenames

filenames = files[i:(i + chunksize)]

# submit the batch copy task

_ = exe.submit(unzip_files, zip_filename, filenames, path)

Tying this together, the complete example of unzipping files concurrently using the process pool in batch is listed below.

# SuperFastPython.com

# unzip a large number of files concurrently with processes in batch

from zipfile import ZipFile

from concurrent.futures import ProcessPoolExecutor

# unzip files from an archive

def unzip_files(zip_filename, filenames, path):

# open the zip file

with ZipFile(zip_filename, 'r') as handle:

# unzip multiple files

for filename in filenames:

# unzip the file

handle.extract(filename, path)

# report progress

print(f'.unzipped {filename}')

# unzip a large number of files

def main(path='tmp', zip_filename='testing.zip'):

# open the zip file

with ZipFile(zip_filename, 'r') as handle:

# list of all files to unzip

files = handle.namelist()

# determine chunksize

n_workers = 8

chunksize = round(len(files) / n_workers)

# start the thread pool

with ProcessPoolExecutor(n_workers) as exe:

# split the copy operations into chunks

for i in range(0, len(files), chunksize):

# select a chunk of filenames

filenames = files[i:(i + chunksize)]

# submit the batch copy task

_ = exe.submit(unzip_files, zip_filename, filenames, path)

# entry point

if __name__ == '__main__':

main()

Running the example unzips all files from the archive as before.

...

.unzipped data-0865.csv

.unzipped data-0866.csv

.unzipped data-0867.csv

.unzipped data-0868.csv

.unzipped data-0869.csv

.unzipped data-0870.csv

.unzipped data-0871.csv

.unzipped data-0872.csv

.unzipped data-0873.csv

.unzipped data-0874.csv

In this case, we see a benefit over all previous examples tried so far.

On my system, the example takes about 2.03 seconds, compared to about 8.36 seconds for the single-threaded version. That is about 4.12x faster.

[Finished in 2.1s]

[Finished in 2.0s]

Interestingly, the ZipFile.extractall() allows a subset of files to be extracted in batch.

We could call this function from each process for a chunk of files. This could be slightly more efficient given the fewer lines of code although it means we cannot report progress along the way.

The unzip_files() function can be updated to call the ZipFile.extractall() function directly, specifying the directory in which the files are to be extracted and a list of names of files to extract from the archive.

The unzip_files() function with these changes is listed below.

# unzip files from an archive

def unzip_files(zip_filename, filenames, path):

# open the zip file

with ZipFile(zip_filename, 'r') as handle:

# unzip a batch of files

handle.extractall(path=path, members=filenames)

Tying this together, the complete example is listed below.

# SuperFastPython.com

# unzip a large number of files concurrently with processes in batch

from zipfile import ZipFile

from concurrent.futures import ProcessPoolExecutor

# unzip files from an archive

def unzip_files(zip_filename, filenames, path):

# open the zip file

with ZipFile(zip_filename, 'r') as handle:

# unzip a batch of files

handle.extractall(path=path, members=filenames)

# unzip a large number of files

def main(path='tmp', zip_filename='testing.zip'):

# open the zip file

with ZipFile(zip_filename, 'r') as handle:

# list of all files to unzip

files = handle.namelist()

# determine chunksize

n_workers = 8

chunksize = round(len(files) / n_workers)

# start the thread pool

with ProcessPoolExecutor(n_workers) as exe:

# split the copy operations into chunks

for i in range(0, len(files), chunksize):

# select a chunk of filenames

filenames = files[i:(i + chunksize)]

# submit the batch copy task

_ = exe.submit(unzip_files, zip_filename, filenames, path)

# entry point

if __name__ == '__main__':

main()

Running the example extracts all 1,000 files from the archive as before.

In this case, we don’t see any significant speed improvement.

On my system, it takes about 2.0 seconds, compared to about 2.03 seconds when calling the ZipFile.extract() function manually for each file to decompress which may be within the margin of measurement error.

Nevertheless, compared to the sequential version that took 8.36 seconds, this version is about 4.18x faster.

[Finished in 2.0s]

Unzip Files Concurrently with Processes and Threads in Batch

It is possible to separate the CPU-bound and IO-bound effort of each file unzip task.

We can decompress a file from the archive to main memory by calling the ZipFile.read() function, then save it to disk manually by creating a path and writing the bytes.

The decompression of each file in memory could be performed by processes and the writing of each file to disk could be performed by threads.

This could be achieved by first using a process pool to decompress batches of files from the archive then using a thread pool within each process to save the decompressed file data to disk.

There are a lot of moving parts and it is unclear whether this would offer a speed improvement. Nevertheless, it is an interesting exercise.

First, we can define a task to save data to disk to be executed in a thread pool.

The function takes the decompressed data, the filename and the directory in which to save the file. A relative path is constructed using the os.path.join() function, then the file is opened in write mode and the decompressed bytes are written to disk and progress is reported.

The save_file() function below implements this.

# save file to disk

def save_file(data, filename, path):

# create a path

filepath = join(path, filename)

# write to disk

with open(filepath, 'wb') as file:

file.write(data)

# report progress

print(f'.unzipped {filename}', flush=True)

Next, the unzip_files() function needs to be updated to first decompress each file, then issue a save task to a thread pool.

First, we must create a thread pool. We will use 10 workers per pool in this case, chosen arbitrarily. Recall, each of the eight processes will have its own thread pool.

...

# create a thread pool

with ThreadPoolExecutor(10) as exe:

# ...

We can then iterate over the list of files that the process is responsible for unzipping and decompressing each and submit the data to the thread pool for saving.

Because the thread pool is within the process, no inter-process communication is required, avoiding additional computational overhead, such as if the main thread/process were to save the decompressed files.

...

# unzip each file

for filename in filenames:

# decompress data

data = handle.read(filename)

# save to disk

_ = exe.submit(save_file, data, filename, path)

The updated version of the unzip_files() function with these changes is listed below.

# unzip files from an archive

def unzip_files(zip_filename, filenames, path):

# open the zip file

with ZipFile(zip_filename, 'r') as handle:

# create a thread pool

with ThreadPoolExecutor(10) as exe:

# unzip each file

for filename in filenames:

# decompress data

data = handle.read(filename)

# save to disk

_ = exe.submit(save_file, data, filename, path)

And that is it.

No other changes from the previous example for batch unzipping with processes are required.

The complete example is listed below.

It is more complicated than the previous example, so let’s review what is happening. First, we split the list of files into chunks and send each chunk to a separate process for extraction. Each process then starts a thread pool, extracts the data for each file and throws tasks into the thread pool to save the decompressed data for each file.

The hope is that decompressing data is faster than saving data to disk so that the CPU can get through the decompression steps fast and then wait on the threads to save all the files.

# SuperFastPython.com

# unzip a large number of files concurrently with processes and threads in batch

from os import makedirs

from os.path import join

from zipfile import ZipFile

from concurrent.futures import ProcessPoolExecutor

from concurrent.futures import ThreadPoolExecutor

# save file to disk

def save_file(data, filename, path):

# create a path

filepath = join(path, filename)

# write to disk

with open(filepath, 'wb') as file:

file.write(data)

# report progress

print(f'.unzipped {filename}', flush=True)

# unzip files from an archive

def unzip_files(zip_filename, filenames, path):

# open the zip file

with ZipFile(zip_filename, 'r') as handle:

# create a thread pool

with ThreadPoolExecutor(10) as exe:

# unzip each file

for filename in filenames:

# decompress data

data = handle.read(filename)

# save to disk

_ = exe.submit(save_file, data, filename, path)

# unzip a large number of files

def main(path='tmp', zip_filename='testing.zip'):

# create the target directory

makedirs(path, exist_ok=True)

# open the zip file

with ZipFile(zip_filename, 'r') as handle:

# list of all files to unzip

files = handle.namelist()

# determine chunksize

n_workers = 8

chunksize = round(len(files) / n_workers)

# start the thread pool

with ProcessPoolExecutor(n_workers) as exe:

# split the copy operations into chunks

for i in range(0, len(files), chunksize):

# select a chunk of filenames

filenames = files[i:(i + chunksize)]

# submit the batch copy task

_ = exe.submit(unzip_files, zip_filename, filenames, path)

# entry point

if __name__ == '__main__':

main()

Running the example decompresses all 1,000 data files to the tmp/ sub-directory as with all previous examples and reports progress.

...

.unzipped data-0747.csv

.unzipped data-0998.csv

.unzipped data-0623.csv

.unzipped data-0873.csv

.unzipped data-0374.csv

.unzipped data-0748.csv

.unzipped data-0999.csv

.unzipped data-0624.csv

.unzipped data-0874.csv

.unzipped data-0749.csv

In this case, we may see a small speed increase

On my system, it takes about 1.83 seconds to complete, compared to about 8.36 seconds for the sequential version. That is about 4.57x faster.

[Finished in 1.8s]

[Finished in 1.9s]

Unzip Files Concurrently with AsyncIO

We can also unzip files concurrently using asyncio.

Generally, Python does not support non-blocking IO operations when working with files. It is not provided in the asyncio module. This is because it is challenging to implement in a general cross-platform manner.

The third-party library aiofiles provides file operations for use in asyncio operations, but again, the operations are not true non-blocking IO and instead the concurrency is simulated using thread pools.

Nevertheless, we can use aiofiles in an asyncio program to load files concurrently. It does not support an asynchronous version of the ZipFile class.

You can install the aiofiles library using your Python package manager, such as pip:

1	pip3 install aiofiles

Given that the aiofiles library does not support asyncio support for ZipFile, we can develop an example that first decompresses each file in a blocking manner, then save the decompressed data to disk using (simulated) non-blocking IO.

First, we can develop a function to extract data from the zip file and save data to disk using aiofiles.

The data can be extracted from the zip file, which requires a blocking read operation.

...

# decompress data, blocking

data = handle.read(filename)

Next, we can construct the output path for saving the file using os.path.join() function.

...

# create a path

filepath = join(path, filename)

We can then open the file using aiofiles.open() and write the data to disk, yielding control until the task is completed.

...

# write to disk

async with aiofiles.open(filepath, 'wb') as fd:

await fd.write(data)

The extract_and_save() function below implements this.

# extract a file from the archive and save to disk

async def extract_and_save(handle, filename, path):

# decompress data, blocking

data = handle.read(filename)

# create a path

filepath = join(path, filename)

# write to disk

async with aiofiles.open(filepath, 'wb') as fd:

await fd.write(data)

# report progress

print(f'.unzipped {filename}')

Finally, we can update the main() function.

We can create the directory in which all files will be saved using the os.makedirs() function.

...

# create the target directory

makedirs(path, exist_ok=True)

After opening the ZipFile for unzipping, we can create a list of coroutines, each with a call to our extract_and_save() function for each filename in the zip archive.

This can be achieved using a list comprehension.

...

# create coroutines

tasks = [extract_and_save(handle, name, path) for name in handle.namelist()]

Recall that this will create a list of coroutines and will not call the extract_and_save() function.

We can execute all of the coroutines and await their completion using the asyncio.gather() function. We can unpack the coroutines for the gather function using the * operator, for example:

...

# execute all tasks

_ = await asyncio.gather(*tasks)

Finally, we can start the asyncio runtime and call the main() to begin the event loop and start processing coroutines.

# entry point

if __name__ == '__main__':

asyncio.run(main())

Tying this together, the complete example of concurrently unzipping files using asyncio is listed below.

# SuperFastPython.com

# unzip a large number of files concurrently with asyncio

from os import makedirs

from os.path import join

from zipfile import ZipFile

import asyncio

import aiofiles

# extract a file from the archive and save to disk

async def extract_and_save(handle, filename, path):

# decompress data, blocking

data = handle.read(filename)

# create a path

filepath = join(path, filename)

# write to disk

async with aiofiles.open(filepath, 'wb') as fd:

await fd.write(data)

# report progress

print(f'.unzipped {filename}')

# unzip a large number of files

async def main(path='tmp'):

# create the target directory

makedirs(path, exist_ok=True)

# open the zip file

with ZipFile('testing.zip', 'r') as handle:

# create coroutines

tasks = [extract_and_save(handle, name, path) for name in handle.namelist()]

# execute all tasks

_ = await asyncio.gather(*tasks)

# entry point

if __name__ == '__main__':

asyncio.run(main())

Running the example unzips all files to the tmp/ subdirectory as before.

...

.unzipped data-0987.csv

.unzipped data-0994.csv

.unzipped data-0989.csv

.unzipped data-0993.csv

.unzipped data-0995.csv

.unzipped data-0998.csv

.unzipped data-0999.csv

.unzipped data-0992.csv

.unzipped data-0997.csv

.unzipped data-0996.csv

In this case, we don’t see any speed benefit, in fact, the asyncio version has much worse performance than other multithreaded versions.

On my system, the example takes about 7.6 seconds, compared to about 8.36 seconds with the sequential version. This is about 1.10x faster.

[Finished in 7.9s]

[Finished in 7.6s]

[Finished in 7.5s]

Unzip Files Concurrently with AsyncIO in Batch

Perhaps we can reduce some of the overhead in unzipping files concurrently with asyncio by performing tasks in batch.

In the previous example, each coroutine performed a single IO-bound operation. We can update the extract_and_save() function to process multiple files in a batch as we did in previous batch processing examples.

This will reduce the number of coroutines in the asyncio runtime and may reduce the amount of data moving around.

The updated version of the task function that will process a list of files is listed below. The task will yield control for each file write operation, but only to other coroutines. The for-loop within a task will not progress while it is waiting.

# extract files from the archive and save to disk

async def extract_and_save(handle, filenames, path):

# extract each file

for filename in filenames:

# decompress data, blocking

data = handle.read(filename)

# create a path

filepath = join(path, filename)

# write to disk

async with aiofiles.open(filepath, 'wb') as fd:

await fd.write(data)

# report progress

print(f'.unzipped {filename}')

Next, we can prepare a list of all files we expect to unzip from the archive.

...

# determine all of the files to extract

files = handle.namelist()

We can then choose the number of coroutines we wish to use and then divide the total number of files up evenly into chunks.

We will use 100 in this case, chosen arbitrarily.

...

# determine chunksize

n_workers = 100

chunksize = round(len(files) / n_workers)

We can then iterate over the chunks and add each coroutine to a list for execution at the end.

...

# split the move operations into chunks

tasks = list()

for i in range(0, len(files), chunksize):

# select a chunk of filenames

filenames = files[i:(i + chunksize)]

# add coroutine to list of tasks

tasks.append(extract_and_save(handle, filenames, path))

Finally, we can execute all 100 coroutines concurrently.

...

# execute and wait for all coroutines

await asyncio.gather(*tasks)

Tying this together, a complete example of batch file unzipping using asyncio is listed below.

# SuperFastPython.com

# unzip a large number of files concurrently with asyncio in batch

from os import makedirs

from os.path import join

from zipfile import ZipFile

import asyncio

import aiofiles

# extract files from the archive and save to disk

async def extract_and_save(handle, filenames, path):

# extract each file

for filename in filenames:

# decompress data, blocking

data = handle.read(filename)

# create a path

filepath = join(path, filename)

# write to disk

async with aiofiles.open(filepath, 'wb') as fd:

await fd.write(data)

# report progress

print(f'.unzipped {filename}')

# unzip a large number of files

async def main(path='tmp'):

# create the target directory

makedirs(path, exist_ok=True)

# open the zip file

with ZipFile('testing.zip', 'r') as handle:

# determine all of the files to extract

files = handle.namelist()

# determine chunksize

n_workers = 100

chunksize = round(len(files) / n_workers)

# split the move operations into chunks

tasks = list()

for i in range(0, len(files), chunksize):

# select a chunk of filenames

filenames = files[i:(i + chunksize)]

# add coroutine to list of tasks

tasks.append(extract_and_save(handle, filenames, path))

# execute and wait for all coroutines

await asyncio.gather(*tasks)

# entry point

if __name__ == '__main__':

asyncio.run(main())

Running the example unzips all files to the tmp/ subdirectory as before.

.unzipped data-0769.csv

.unzipped data-0819.csv

.unzipped data-0989.csv

.unzipped data-0579.csv

.unzipped data-0959.csv

.unzipped data-0969.csv

.unzipped data-0879.csv

.unzipped data-0999.csv

.unzipped data-0979.csv

.unzipped data-0949.csv

In this case we see a modest speed improvement over the non-batch asyncio version of unzipping files.

On my system, the example takes about 6.4 seconds to complete, compared to 8.36 seconds for the sequential version. That is about 1.31x faster.

[Finished in 6.4s]

Takeaways

In this tutorial, you discovered how to explore concurrent file unzipping in Python.

Do you have any questions?
Leave your question in a comment below and I will reply fast with my best advice.

Photo by Appic on Unsplash

Comments

Jan Derk says

July 19, 2023 at 4:13 am

Thanks for your very useful website.

One small remark: In the unzip batch code there is a line:
chunksize = round(len(files) / n_workers)

If the n_workers is more than twice the size of len(files) the chunksize becomes 0.

It’s better to use ceil() or to make sure chunksize is never zero.

- Jason Brownlee says
  
  July 19, 2023 at 6:08 am
  
  You’re welcome.
  
  Great tip, thank you!
  
Dirk Heinrichs says

January 11, 2024 at 7:59 pm

Hi,

nice tutorial, thanks a lot.

I’m trying to unzip large zips (>2G) on Windows using the ThreadPoolExcutor method, but when it has finished, several files were not extracted. It works if I set the number of threads to 1, but that’s certainly not what I want. It seems to work just fine on Linux, though.

Do you have any idea what could be the problem on Windows?

Thanks in advance…

- Jason Brownlee says
  
  January 12, 2024 at 6:58 am
  
  Fascinating. I have not seen a problem like this.
  
  Perhaps confirm the version of Python matches between the two systems?
  Perhaps you can try two manual threading.Thread instances instead and see if it still fails?
  Perhaps you can devise the simplest possible version of the program and run it on both systems?
  Perhaps you can try searching or posting on StackOverflow to see if anyone else has had a similar problem?
  
  Let me know how you go.

Multithreaded File Unzipping in Python

Create a Large Zip File

Unzip the Files One-by-One (slowly)

How to Unzip Files Concurrently

Unzip Files Concurrently With Threads

Unzip Files Concurrently with Threads in Batch

Unzip Files Concurrently with Processes

Unzip Files Concurrently with Processes in Batch

Unzip Files Concurrently with Processes and Threads in Batch

Unzip Files Concurrently with AsyncIO

Unzip Files Concurrently with AsyncIO in Batch

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent File I/O Fast
(without the frustration)

Additional menu

Create a Large Zip File

Unzip the Files One-by-One (slowly)

How to Unzip Files Concurrently

Unzip Files Concurrently With Threads

Unzip Files Concurrently with Threads in Batch

Unzip Files Concurrently with Processes

Unzip Files Concurrently with Processes in Batch

Unzip Files Concurrently with Processes and Threads in Batch

Unzip Files Concurrently with AsyncIO

Unzip Files Concurrently with AsyncIO in Batch

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Comments

Leave a Reply Cancel reply

Footer

Learn Concurrent File I/O Fast (without the frustration)

Learn Concurrent File I/O Fast
(without the frustration)