How to Speed-up File IO with Concurrency in Python

January 6, 2022 Concurrent File I/O

You can speed-up most file IO operations with concurrency in Python.

In this tutorial you will discover the specific techniques to use to speed-up the most common file-io operations using concurrency in Python.

Let's get started.

File IO is Slow

File IO refers to manipulating files on a hard disk.

This typically includes reading and writing data to files, but also includes a host of related operations such as renaming files, copying files, and deleting files. It may also refer to operations such as zipping files into common archive formats.

File IO might be one of the most common operations performed in Python programs, or programs generally. Files are where we store data, and programs need data to be useful.

Python provides a number of modules for manipulating files.

The most common Python file IO modules include the following:

You can learn more about Python file IO here:

Manipulating files on disk is slow.

This is because file IO requires performing operations with data on a hard disk drive and hard drives are significantly slower than working with data in memory.

We might not notice how slow manipulating files is when we are working with just a few files. Nevertheless, when you need to copy, move, or delete thousands or millions of files, the operations can be painfully slow.

How can we speed-up file IO operations?

Speed-Up File IO With Concurrency

Many file IO operations can be made faster using concurrency.

This means that if we have the same operation to perform on many files, such as renaming or deleting, that we can perform these operations at the same time from the perspective of the program.

This is because each operation is independent and the order in which the operations are performed does not matter.

Concurrent tasks may or may not be executed in parallel. Parallelism refers explicitly to the ability to execute tasks simultaneously, such as with multiple CPU cores.

The reference Python interpreter, CPython provides four main modules for concurrency, they are:

Only Processes provide true parallelism in Python, that is the ability to execute tasks simultaneously on multiple CPUs. This is achieved by starting new child instances of the Python interpreter process for executing tasks. Data sharing between processes is achieved via inter-process communication in which data must be serialized at added computational cost.

Python Threads provide concurrency, but only limited parallelism. This is because of the Global Interpreter Lock (GIL) that ensures the Python interpreter is thread safe by limiting only a single thread to execute at any one time within a Python process. The GIL is released in some situations, such as performing IO operations, allowing modest parallel execution when interacting with files, sockets, and devices.

Asynchronous I/O, or AsyncIO for short, is an implementation of asynchronous programming that provides the ability to create and run coroutines within an event-loop. A coroutine is a type of routine that can be suspended and resumed, allowing the asyncio event loop to jump from one to another when instructed via the await keyword. The paradigm provides a limited set of non-blocking IO operations.

Each of processes, threads, and asyncio has a sweet spot.

You can learn more about Python concurrency here:

So what concurrency should we use for speeding-up file IO?

Concurrency for File IO

Intuitively, we may think that threads or perhaps asyncio are best suited for speeding-up file IO operations.

Sometimes this is the case.

Alternatively, we may think that hard disk drives are only able to perform one file IO operation at a time, and therefore file IO operations cannot be performed concurrently.

Again, sometimes this is the case.

Unfortunately, it is not obvious what file IO operations can be sped-up with concurrency and what type of Python concurrency should be used. Further, it is possible for one approach to offer benefits for a specific operating system or type of hard disk drive, and not another.

Therefore, it is critical that you perform experiments in order to discover what type of concurrency works well or best for your specific file IO operation and computer system.

I have performed systematic experiments on a modern operating system (macOS) and an SSD hard drive in order to discover how to speed-up many common file IO operations.

The results are listed below and provide a guide that you can use as a starting point to speed-up file IO operations with concurrency on your system.

For each file IO operation we may consider a number of different concurrency set-ups, they are as follows:

Results are listed in terms of the speed-up offered over the sequential (non-concurrent) version of the same operation. The results are approximate and suggestive of the type of performance improvement you could achieve.

If no benefit was observed (within the margin of measurement error), then the result is listed as 1.0 (e.g. 1.0x the speed of the sequential version of the code). Any value above 1.0 demonstrates a benefit. If the concurrent version was slower than the sequential version, then the result is marked as "slower".

Not all combinations were trialled with each file IO operation. If a result is not available, it will be marked as "n/a".

Let's dive into the results.

Concurrent File Writing

Results for concurrently saving data to files to disk.

MethodResult
Thread1.0x
Threads in Batchn/a
Processes2.96x
Processes in Batch2.96x
Processes with Threads2.96x
AsyncIO1.0x
AsyncIO in Batchn/a

For the full results see:

Concurrent File Loading

Results for concurrently loading data files from disk into memory.

MethodResult
Thread1.0x
Threads in Batch1.0x
Processes1.0x
Processes in Batchslower
Processes with Threads2.21
AsyncIO1.13x
AsyncIO in Batchslower

For the full results see:

Concurrent File Copying

Results for concurrently copying files.

MethodResult
Thread2.24x
Threads in Batch1.75x
Processes1.64x
Processes in Batch2.31x
Processes with Threadsn/a
AsyncIOn/a
AsyncIO in Batchn/a

For the full results see:

Concurrent File Moving

Results for concurrently moving files.

MethodResult
Thread1.0x
Threads in Batch1.63x
Processesslower
Processes in Batch1.38x
Processes with Threadsn/a
AsyncIOslower
AsyncIO in Batchslower

For the full results see:

Concurrent File Renaming

Results for concurrently rename files.

MethodResult
Thread1.41x
Threads in Batch1.65x
Processesslower
Processes in Batch1.39x
Processes with Threads1.65x
AsyncIOslower
AsyncIO in Batchslower

For the full results see:

Concurrent File Deleting

Results for concurrently deleting files.

MethodResult
Thread1.0x
Threads in Batch1.7x
Processesslower
Processes in Batch1.35x
Processes with Threadsn/a
AsyncIOslower
AsyncIO in Batchslower

For the full results see:

Concurrent File Zipping

Results for concurrently zipping files.

MethodResult
Thread1.0x
Threads in Batch1.0x
Processes1.0x
Processes in Batch1.0x
Processes with Threads1.0x
AsyncIO1.0x
AsyncIO in Batch1.0x

For the full results see:

Concurrent File Unzipping

Results for concurrently unzipping files.

MethodResult
Thread2.46x
Threads in Batch2.53x
Processes2.56x
Processes in Batch4.12x
Processes with Threads4.57x
AsyncIO1.10x
AsyncIO in Batch1.31x

For the full results see:

Takeaways

You now know how to speed-up the most common file IO operations with concurrency in Python.



If you enjoyed this tutorial, you will love my book: Concurrent File I/O in Python. It covers everything you need to master the topic with hands-on examples and clear explanations.