Last Updated on August 21, 2023
You can speed-up most file IO operations with concurrency in Python.
In this tutorial you will discover the specific techniques to use to speed-up the most common file-io operations using concurrency in Python.
Let’s get started.
File IO is Slow
File IO refers to manipulating files on a hard disk.
This typically includes reading and writing data to files, but also includes a host of related operations such as renaming files, copying files, and deleting files. It may also refer to operations such as zipping files into common archive formats.
File IO might be one of the most common operations performed in Python programs, or programs generally. Files are where we store data, and programs need data to be useful.
Python provides a number of modules for manipulating files.
The most common Python file IO modules include the following:
- built-in: with functions such as the open() function for opening a file.
- os: with functions such makedirs() remove(), rename(), and many more.
- os.path: with functions such as basename(), join(), and many more.
- shutil: with functions such as copy(), move(), and many more.
You can learn more about Python file IO here:
Manipulating files on disk is slow.
This is because file IO requires performing operations with data on a hard disk drive and hard drives are significantly slower than working with data in memory.
We might not notice how slow manipulating files is when we are working with just a few files. Nevertheless, when you need to copy, move, or delete thousands or millions of files, the operations can be painfully slow.
How can we speed-up file IO operations?
Run loops using all CPUs, download your FREE book to learn how.
Speed-Up File IO With Concurrency
Many file IO operations can be made faster using concurrency.
This means that if we have the same operation to perform on many files, such as renaming or deleting, that we can perform these operations at the same time from the perspective of the program.
This is because each operation is independent and the order in which the operations are performed does not matter.
Concurrent tasks may or may not be executed in parallel. Parallelism refers explicitly to the ability to execute tasks simultaneously, such as with multiple CPU cores.
The reference Python interpreter, CPython provides four main modules for concurrency, they are:
- multiprocessing: for process-based concurrency with the multiprocessing.Process class.
- threading: for thread-based concurrency with the threading.Thread class.
- concurrent.futures: for thread and process pools that extend the concurrent.futures.Executor class.
- asyncio: for coroutine-based concurrency with the async/await keywords.
Only Processes provide true parallelism in Python, that is the ability to execute tasks simultaneously on multiple CPUs. This is achieved by starting new child instances of the Python interpreter process for executing tasks. Data sharing between processes is achieved via inter-process communication in which data must be serialized at added computational cost.
Python Threads provide concurrency, but only limited parallelism. This is because of the Global Interpreter Lock (GIL) that ensures the Python interpreter is thread safe by limiting only a single thread to execute at any one time within a Python process. The GIL is released in some situations, such as performing IO operations, allowing modest parallel execution when interacting with files, sockets, and devices.
Asynchronous I/O, or AsyncIO for short, is an implementation of asynchronous programming that provides the ability to create and run coroutines within an event-loop. A coroutine is a type of routine that can be suspended and resumed, allowing the asyncio event loop to jump from one to another when instructed via the await keyword. The paradigm provides a limited set of non-blocking IO operations.
Each of processes, threads, and asyncio has a sweet spot.
- Processes are suited for CPU-bound tasks, but are limited to perhaps tens of tasks and overhead in sharing data between processes.
- Threads are suited to IO-bound tasks, but are limited to perhaps thousands of tasks.
- AsyncIO is suited to large-scale IO-bound tasks, e.g. perhaps tens of thousands of tasks, but are limited to a subset of non-blocking IO operations and require adopting the asynchronous programming paradigm.
You can learn more about Python concurrency here:
So what concurrency should we use for speeding-up file IO?
Concurrency for File IO
Intuitively, we may think that threads or perhaps asyncio are best suited for speeding-up file IO operations.
Sometimes this is the case.
Alternatively, we may think that hard disk drives are only able to perform one file IO operation at a time, and therefore file IO operations cannot be performed concurrently.
Again, sometimes this is the case.
Unfortunately, it is not obvious what file IO operations can be sped-up with concurrency and what type of Python concurrency should be used. Further, it is possible for one approach to offer benefits for a specific operating system or type of hard disk drive, and not another.
Therefore, it is critical that you perform experiments in order to discover what type of concurrency works well or best for your specific file IO operation and computer system.
I have performed systematic experiments on a modern operating system (macOS) and an SSD hard drive in order to discover how to speed-up many common file IO operations.
The results are listed below and provide a guide that you can use as a starting point to speed-up file IO operations with concurrency on your system.
For each file IO operation we may consider a number of different concurrency set-ups, they are as follows:
- Threads: Use of a thread pool with one operation per worker.
- Threads in Batch: Use of a thread pool with batched operations per worker.
- Processes: Use of a process pool with one operation per worker.
- Processes in Batch: Use of a process pool with batched operations per worker.
- Processes with Threads: Use of a process pool with batch operations per worker executed with a thread pool per worker process.
- AsyncIO: Use of aiofiles with asyncio with one coroutine per operation.
- AsyncIO in Batch: Use of aiofiles with asyncio with batched operations per coroutine.
Results are listed in terms of the speed-up offered over the sequential (non-concurrent) version of the same operation. The results are approximate and suggestive of the type of performance improvement you could achieve.
If no benefit was observed (within the margin of measurement error), then the result is listed as 1.0 (e.g. 1.0x the speed of the sequential version of the code). Any value above 1.0 demonstrates a benefit. If the concurrent version was slower than the sequential version, then the result is marked as “slower”.
Not all combinations were trialled with each file IO operation. If a result is not available, it will be marked as “n/a”.
Let’s dive into the results.
Concurrent File Writing
Results for concurrently saving data to files to disk.
Method | Result |
Thread | 1.0x |
Threads in Batch | n/a |
Processes | 2.96x |
Processes in Batch | 2.96x |
Processes with Threads | 2.96x |
AsyncIO | 1.0x |
AsyncIO in Batch | n/a |
For the full results see:
Concurrent File Loading
Results for concurrently loading data files from disk into memory.
Method | Result |
Thread | 1.0x |
Threads in Batch | 1.0x |
Processes | 1.0x |
Processes in Batch | slower |
Processes with Threads | 2.21 |
AsyncIO | 1.13x |
AsyncIO in Batch | slower |
For the full results see:
Concurrent File Copying
Results for concurrently copying files.
Method | Result |
Thread | 2.24x |
Threads in Batch | 1.75x |
Processes | 1.64x |
Processes in Batch | 2.31x |
Processes with Threads | n/a |
AsyncIO | n/a |
AsyncIO in Batch | n/a |
For the full results see:
Concurrent File Moving
Results for concurrently moving files.
Method | Result |
Thread | 1.0x |
Threads in Batch | 1.63x |
Processes | slower |
Processes in Batch | 1.38x |
Processes with Threads | n/a |
AsyncIO | slower |
AsyncIO in Batch | slower |
For the full results see:
Concurrent File Renaming
Results for concurrently rename files.
Method | Result |
Thread | 1.41x |
Threads in Batch | 1.65x |
Processes | slower |
Processes in Batch | 1.39x |
Processes with Threads | 1.65x |
AsyncIO | slower |
AsyncIO in Batch | slower |
For the full results see:
Concurrent File Deleting
Results for concurrently deleting files.
Method | Result |
Thread | 1.0x |
Threads in Batch | 1.7x |
Processes | slower |
Processes in Batch | 1.35x |
Processes with Threads | n/a |
AsyncIO | slower |
AsyncIO in Batch | slower |
For the full results see:
Concurrent File Zipping
Results for concurrently zipping files.
Method | Result |
Thread | 1.0x |
Threads in Batch | 1.0x |
Processes | 1.0x |
Processes in Batch | 1.0x |
Processes with Threads | 1.0x |
AsyncIO | 1.0x |
AsyncIO in Batch | 1.0x |
For the full results see:
Concurrent File Unzipping
Results for concurrently unzipping files.
Method | Result |
Thread | 2.46x |
Threads in Batch | 2.53x |
Processes | 2.56x |
Processes in Batch | 4.12x |
Processes with Threads | 4.57x |
AsyncIO | 1.10x |
AsyncIO in Batch | 1.31x |
For the full results see:
Free Concurrent File I/O Course
Get FREE access to my 7-day email course on concurrent File I/O.
Discover patterns for concurrent file I/O, how save files with a process pool, how to copy files with a thread pool, and how to append to a file from multiple threads safely.
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent File I/O in Python, Jason Brownlee (my book!)
Guides
Python File I/O APIs
- Built-in Functions
- os - Miscellaneous operating system interfaces
- os.path - Common pathname manipulations
- shutil - High-level file operations
- zipfile — Work with ZIP archives
- Python Tutorial: Chapter 7. Input and Output
Python Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
- asyncio — Asynchronous I/O
File I/O in Asyncio
References
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Takeaways
You now know how to speed-up the most common file IO operations with concurrency in Python.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Mayar Zidan on Unsplash
Do you have any questions?