Last Updated on August 21, 2023
You can use reusable concurrent programming patterns when speeding up file IO using concurrency in Python.
In this tutorial, you will discover patterns for concurrent file IO in Python.
Let’s get started.
Types of File I/O Operations
We may need to perform file IO in our Python programs.
There are many types of file IO operations we may want to perform in our program that may vary based on the type of task, e.g. read or write, on the size of the file, small or large, and so on.
It is helpful to divide the type of IO we wish to perform into two main classes:
- One-off file IO operations.
- Multiple similar file IO operations.
Let’s take a closer look at each of these cases in turn.
One-Off File I/O Operations
A one-off file operation, as its name suggests, is a file operation that is one file io task performed on demand.
For example, we may need to load some data from a file, store state to a file, or append some data to a file.
Often, our program depends on this task being performed before moving on, in which case it cannot or probably not be performed concurrently.
For example, a program may need to load data and then process that data. The processing task depends upon the load task and is blocked until the data is loaded. Concurrency offers no benefit in this case.
In other cases, the one-off operation may not be depended upon by subsequent tasks, in which case it can be performed asynchronously in a fire-and-forget manner.
For example, a program may need to save the state on request by the user. This activity can be performed in the background.
A single thread or worker thread in a thread pool can be used. The task can be dispatched and any error handling or notifications can be handled by the task itself.
Multiple File I/O Operations
Multiple file IO operations, as its name suggests, refer to a task that involves performing many similar file tasks.
For example, we may need to:
- Delete all files in a directory.
- Copy all files from one directory to another.
- Save each in-memory dataset as a separate file.
These tasks are both independent and collectively slow to perform.
- Independent: Each operation is similar or identical, applied to different data. As such, each task is reasonably independent and simple to implement and complete.
- Slow: There are many of these similar operations that must be performed on demand by the program, e.g. tens, hundreds, thousands, or more. As such, performing all file I/O operations takes a long time to complete.
Naively these tasks are performed sequentially, one after the other.
The problem is that as more file I/O operations are added, e.g. more files to load, copy, delete, etc., the slower the overall task takes to perform.
It is these tasks in particular that require careful attention and can benefit the most from concurrency.
For example, the duration required to save 1,000 files is the duration to save one file multiplied by 1,000, e.g. it scales linearly with the number of tasks.
If we can perform the operation using concurrency, it may be reduced by a factor of the number of CPU cores or a factor of the number of concurrent tasks, or similar, depending on the limiting factors of the task itself and the hardware on which it is run.
It is tasks of this type that we need to focus our attention on.
Run loops using all CPUs, download your FREE book to learn how.
Concurrent Patterns For File I/O Operations
The main use case we need to address is how to perform many file IO operations faster using concurrency.
Because the tasks are generally independent and relatively simple, they are a good candidate for using a pool of workers, instead of creating many one-off threads or processes.
Therefore, the focus of faster file IO using concurrent will be:
- How to choose between pools of threads and processes, and how the same result can be achieved using coroutines.
- How to configure and arrange the pools of workers.
With these questions in mind, let’s consider some general reusable programming patterns for executing file IO tasks.
Four reusable programming patterns for concurrent IO we may consider are:
- Pattern 1: Single Worker
- Pattern 2: Pools of Workers
- Pattern 3: Batched Tasks And Pools of Workers
- Pattern 4: Pools of Workers within Each Worker
Let’s take a closer look at each pattern in turn.
Pattern 1: Single Worker
The simplest pattern is to create a single concurrent worker on demand for a file IO operation.
This approach would only be helpful in those cases where a single file IO operation can be executed safely in the background while the program continues on with other tasks not dependent on the file IO operation.
This could involve the creation of a single thread, a single process, or a single coroutine.
- New thread worker via the Thread class.
- New process and main thread worker via the Process class.
- New coroutine worker via the Task class.
A pool of workers could be used for a one-off task, but it may be overkill, e.g. more costly to setup and operate than a single standalone thread or process.
Pattern 2: Pools of Workers
We might consider using a pool of workers to complete a suite of similar file IO operations.
Pools of workers are provided in Python using the modern executor framework, including the ThreadPoolExecutor and the ProcessPoolExecutor. File IO tasks can be defined as a pure function that is provided the data it needs as an argument and provides any loaded data via a return value.
- Pools of threads in ThreadPoolExecutor.
- Pools of processes in ProcessPoolExecutor.
Asyncio programs can use aiofiles that will allow coroutines to issue file IO tasks asynchronously, like any other coroutine. Coroutines can be created and issued in batch and awaited using tools like asyncio.gather() and asyncio.wait().
- Pools of coroutines via asyncio.Task and asyncio.gather().
Using one-off threads or processes would not be appropriate. The main reason is that we would expect to have many more tasks than workers. Using a pool of workers allows the program to issue tasks and have them queued until a worker in the pool is available to consume and execute the task.
If one-off processes or threads were used instead, we would have to develop mechanisms to limit the number of tasks that are able to execute concurrently. It would also result in many threads or processes being created and kept idle, and would consume a lot more memory than having just the tasks themselves sit idle in a pool.
Pattern 3: Batched Tasks And Pools of Workers
There may be efficiency gains to be had in grouping tasks into batches when executing them with a pool of workers.
Many file IO tasks may require data in order to be completed or may result in data being loaded.
If the tasks are performed within a pool of workers, then there may be some overhead in transmitting data from the main program to the workers for each task, and in transmitting data base from workers to the main program.
We may overcome some of this overhead by executing file IO tasks in batches.
This might involve configuring the pool of workers itself to automatically collect groups of tasks into chunks for transmission to workers, such as via the “chunksize” argument on the map() method.
Alternatively, we may manually chunk the tasks, e.g. create batches of file names to load or delete, then issue each batch of tasks to a worker to process.
This batching may be used with each type of pool of workers, for example:
- Batched tasks in the ThreadPoolExecutor.
- Batched tasks in the ProcessPoolExecutor.
- Batched tasks within reach asyncio coroutine.
Pattern 4: Pools of Workers within Each Worker
We can take these patterns one step further by having a pool of workers within each worker in a pool of workers.
Not all variations of this pattern make sense.
A good use of this pattern is to have a pool of worker processes, where each worker process in the pool manages its own pool of worker threads.
In this way, a batch of file IO tasks can be issued to each worker in the pool for parallel execution and each batch is executed concurrently using threads.
- Batched tasks in the ProcessPoolExecutor with ThreadPoolExecutors
How to Choose File IO Pattern
Should we use threads or processes for concurrent file IO?
Further, given the concurrency patterns explored in the previous section, which pattern should we use?
Intuitively, we may think that threads or perhaps asyncio are best suited for speeding-up file IO operations. The reason is that threads will release the GIL when blocked from performing a read or a write operation. Threads should be best suited to concurrent file IO in theory.
Sometimes this is the case.
Alternatively, we may think that hard disk drives are only able to perform one or a few file IO operations at a time, and therefore file IO operations cannot be performed concurrently.
Again, sometimes this is the case.
Unfortunately, it is not obvious what file IO operations can be sped-up with concurrency and what type of Python concurrency should be used. Further, it is possible for one approach to offer benefits for a specific operating system or type of hard disk drive, and not another.
Therefore, it is critical that we perform experiments in order to discover what type of concurrency works well or best for your specific file IO operation and computer system.
This involves three things:
- Testing threads and processes for a given file IO task.
- Testing different concurrency patterns for a given file IO task.
- Carefully benchmarking and recording the performance of a task under different conditions.
Testing threads and processes for a given task might be as simple as switching out a ThreadPoolExecutor for a ProcessPoolExecutor, both of which use the same APIs.
Testing different concurrency patterns is a little more involved as it may require adding code to automatically or manually grouping file IO operations into batches for execution, or adding a thread pool to each process worker in a process pool.
Benchmarking the performance of different conditions is essential. In order to choose between threads and processes or one pattern or another is the time taken to complete the task.
As a baseline, it is a good idea to first develop a sequential version of the task that performs all file IO operations sequentially, one after the other. Recording how long this sequential takes to complete provides a line in the sand that all concurrent versions of the program must out-perform in order to be considered “faster”.
Timing the time taken to execute a program can be performed once and recorded. The problem is that a single run of the program may not be representative, especially if the operating system decided to perform another task at the same time. Therefore, it can be good to repeat a benchmark a few times and calculate the average time taken (e.g. the sum of all times divided by the number of times collected).
Benchmarking can be performed in the program using code, such as using time.time() function and recording the start time, end time, and subtracting the end time from the start time to give the duration in seconds.
For example:
1 2 3 4 5 6 7 8 9 10 |
... # record start time time_start = time.time() ... # record end time time_end = time.time() # calculate duration time_duration = time_end - time_start # report the duration print(f'Took {time_duration} seconds') |
Alternatively, on POSIX workstations like Linux and MacOS, we can use the time command to calculate how long it takes to execute a Python script with a given concurrency pattern.
For example:
1 |
time python ./program.py |
The result reports the total time taken in seconds, as well as the portion of time used in user mode and kernel mode while running the script.
For example:
1 2 3 |
real 0m1.051s user 0m0.026s sys 0m0.012s |
More sophisticated approaches are also available such as the timeit Python module.
Free Concurrent File I/O Course
Get FREE access to my 7-day email course on concurrent File I/O.
Discover patterns for concurrent file I/O, how save files with a process pool, how to copy files with a thread pool, and how to append to a file from multiple threads safely.
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent File I/O in Python, Jason Brownlee (my book!)
Guides
Python File I/O APIs
- Built-in Functions
- os - Miscellaneous operating system interfaces
- os.path - Common pathname manipulations
- shutil - High-level file operations
- zipfile — Work with ZIP archives
- Python Tutorial: Chapter 7. Input and Output
Python Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
- asyncio — Asynchronous I/O
File I/O in Asyncio
References
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Takeaways
You now know about patterns for concurrent file IO in Python.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Do you have any questions?