How to Speed-up File IO with Concurrency in Python

January 6, 2022 Concurrent File I/O

You can speed-up most file IO operations with concurrency in Python.

In this tutorial you will discover the specific techniques to use to speed-up the most common file-io operations using concurrency in Python.

Let's get started.

File IO is Slow

File IO refers to manipulating files on a hard disk.

This typically includes reading and writing data to files, but also includes a host of related operations such as renaming files, copying files, and deleting files. It may also refer to operations such as zipping files into common archive formats.

File IO might be one of the most common operations performed in Python programs, or programs generally. Files are where we store data, and programs need data to be useful.

Python provides a number of modules for manipulating files.

The most common Python file IO modules include the following:

built-in: with functions such as the open() function for opening a file.
os: with functions such makedirs() remove(), rename(), and many more.
os.path: with functions such as basename(), join(), and many more.
shutil: with functions such as copy(), move(), and many more.

You can learn more about Python file IO here:

Python File IO: A Whirlwind Tour

Manipulating files on disk is slow.

This is because file IO requires performing operations with data on a hard disk drive and hard drives are significantly slower than working with data in memory.

We might not notice how slow manipulating files is when we are working with just a few files. Nevertheless, when you need to copy, move, or delete thousands or millions of files, the operations can be painfully slow.

How can we speed-up file IO operations?

Speed-Up File IO With Concurrency

Many file IO operations can be made faster using concurrency.

This means that if we have the same operation to perform on many files, such as renaming or deleting, that we can perform these operations at the same time from the perspective of the program.

This is because each operation is independent and the order in which the operations are performed does not matter.

Concurrent tasks may or may not be executed in parallel. Parallelism refers explicitly to the ability to execute tasks simultaneously, such as with multiple CPU cores.

The reference Python interpreter, CPython provides four main modules for concurrency, they are:

multiprocessing: for process-based concurrency with the multiprocessing.Process class.
threading: for thread-based concurrency with the threading.Thread class.
concurrent.futures: for thread and process pools that extend the concurrent.futures.Executor class.
asyncio: for coroutine-based concurrency with the async/await keywords.

Only Processes provide true parallelism in Python, that is the ability to execute tasks simultaneously on multiple CPUs. This is achieved by starting new child instances of the Python interpreter process for executing tasks. Data sharing between processes is achieved via inter-process communication in which data must be serialized at added computational cost.

Python Threads provide concurrency, but only limited parallelism. This is because of the Global Interpreter Lock (GIL) that ensures the Python interpreter is thread safe by limiting only a single thread to execute at any one time within a Python process. The GIL is released in some situations, such as performing IO operations, allowing modest parallel execution when interacting with files, sockets, and devices.

Asynchronous I/O, or AsyncIO for short, is an implementation of asynchronous programming that provides the ability to create and run coroutines within an event-loop. A coroutine is a type of routine that can be suspended and resumed, allowing the asyncio event loop to jump from one to another when instructed via the await keyword. The paradigm provides a limited set of non-blocking IO operations.

Each of processes, threads, and asyncio has a sweet spot.

Processes are suited for CPU-bound tasks, but are limited to perhaps tens of tasks and overhead in sharing data between processes.
Threads are suited to IO-bound tasks, but are limited to perhaps thousands of tasks.
AsyncIO is suited to large-scale IO-bound tasks, e.g. perhaps tens of thousands of tasks, but are limited to a subset of non-blocking IO operations and require adopting the asynchronous programming paradigm.

You can learn more about Python concurrency here:

Python Concurrency: A Whirlwind Tour

So what concurrency should we use for speeding-up file IO?

Concurrency for File IO

Intuitively, we may think that threads or perhaps asyncio are best suited for speeding-up file IO operations.

Sometimes this is the case.

Alternatively, we may think that hard disk drives are only able to perform one file IO operation at a time, and therefore file IO operations cannot be performed concurrently.

Again, sometimes this is the case.

Unfortunately, it is not obvious what file IO operations can be sped-up with concurrency and what type of Python concurrency should be used. Further, it is possible for one approach to offer benefits for a specific operating system or type of hard disk drive, and not another.

Therefore, it is critical that you perform experiments in order to discover what type of concurrency works well or best for your specific file IO operation and computer system.

I have performed systematic experiments on a modern operating system (macOS) and an SSD hard drive in order to discover how to speed-up many common file IO operations.

The results are listed below and provide a guide that you can use as a starting point to speed-up file IO operations with concurrency on your system.

For each file IO operation we may consider a number of different concurrency set-ups, they are as follows:

Threads: Use of a thread pool with one operation per worker.
Threads in Batch: Use of a thread pool with batched operations per worker.
Processes: Use of a process pool with one operation per worker.
Processes in Batch: Use of a process pool with batched operations per worker.
Processes with Threads: Use of a process pool with batch operations per worker executed with a thread pool per worker process.
AsyncIO: Use of aiofiles with asyncio with one coroutine per operation.
AsyncIO in Batch: Use of aiofiles with asyncio with batched operations per coroutine.

Results are listed in terms of the speed-up offered over the sequential (non-concurrent) version of the same operation. The results are approximate and suggestive of the type of performance improvement you could achieve.

If no benefit was observed (within the margin of measurement error), then the result is listed as 1.0 (e.g. 1.0x the speed of the sequential version of the code). Any value above 1.0 demonstrates a benefit. If the concurrent version was slower than the sequential version, then the result is marked as "slower".

Not all combinations were trialled with each file IO operation. If a result is not available, it will be marked as "n/a".

Let's dive into the results.

Concurrent File Writing

Results for concurrently saving data to files to disk.

Method	Result
Thread	1.0x
Threads in Batch	n/a
Processes	2.96x
Processes in Batch	2.96x
Processes with Threads	2.96x
AsyncIO	1.0x
AsyncIO in Batch	n/a

For the full results see:

Multithreaded File Saving in Python

Concurrent File Loading

Results for concurrently loading data files from disk into memory.

Method	Result
Thread	1.0x
Threads in Batch	1.0x
Processes	1.0x
Processes in Batch	slower
Processes with Threads	2.21
AsyncIO	1.13x
AsyncIO in Batch	slower

For the full results see:

Multithreaded File Loading in Python

Concurrent File Copying

Results for concurrently copying files.

Method	Result
Thread	2.24x
Threads in Batch	1.75x
Processes	1.64x
Processes in Batch	2.31x
Processes with Threads	n/a
AsyncIO	n/a
AsyncIO in Batch	n/a

For the full results see:

Multithreaded File Copying in Python

Concurrent File Moving

Results for concurrently moving files.

Method	Result
Thread	1.0x
Threads in Batch	1.63x
Processes	slower
Processes in Batch	1.38x
Processes with Threads	n/a
AsyncIO	slower
AsyncIO in Batch	slower

For the full results see:

Multithreaded File Moving in Python

Concurrent File Renaming

Results for concurrently rename files.

Method	Result
Thread	1.41x
Threads in Batch	1.65x
Processes	slower
Processes in Batch	1.39x
Processes with Threads	1.65x
AsyncIO	slower
AsyncIO in Batch	slower

For the full results see:

Multithreaded File Renaming in Python

Concurrent File Deleting

Results for concurrently deleting files.

Method	Result
Thread	1.0x
Threads in Batch	1.7x
Processes	slower
Processes in Batch	1.35x
Processes with Threads	n/a
AsyncIO	slower
AsyncIO in Batch	slower

For the full results see:

Multithreaded File Deletion in Python

Concurrent File Zipping

Results for concurrently zipping files.

Method	Result
Thread	1.0x
Threads in Batch	1.0x
Processes	1.0x
Processes in Batch	1.0x
Processes with Threads	1.0x
AsyncIO	1.0x
AsyncIO in Batch	1.0x

For the full results see:

Multithreaded File Zipping in Python

Concurrent File Unzipping

Results for concurrently unzipping files.

Method	Result
Thread	2.46x
Threads in Batch	2.53x
Processes	2.56x
Processes in Batch	4.12x
Processes with Threads	4.57x
AsyncIO	1.10x
AsyncIO in Batch	1.31x

For the full results see:

Multithreaded File Unzipping in Python

Takeaways

You now know how to speed-up the most common file IO operations with concurrency in Python.

If you enjoyed this tutorial, you will love my book: Concurrent File I/O in Python. It covers everything you need to master the topic with hands-on examples and clear explanations.