File I/O can be so much faster with the careful use of concurrency.
This includes using Python threads and processes to read, write, and copy files, as well as how to safely append to a file from multiple threads.
This course provides you with a 7-day crash course in concurrency for File I/O.
You will get a taste of what is possible and hopefully, skills that you can bring to your next project.
Let’s get started.
Concurrent NumPy Overview
Hi, thanks for your interest in Concurrent File I/O.
You have a lot of fun ahead, including:
- Lesson 01: Faster File I/O With Concurrency (today)
- Lesson 02: One-off Background File I/O Tasks
- Lesson 03: Patterns for Concurrent File I/O
- Lesson 04: Concurrent File Writing
- Lesson 05: Concurrent File Copying
- Lesson 06: Concurrent File Reading
- Lesson 07: Concurrent File Appending
You can download a .zip of all the source code for this course here:
Your first lesson in this series is just below.
Until then, a quick question:
Why are you interested in Concurrent File I/O?
Let me know in the comments below.
Run loops using all CPUs, download your FREE book to learn how.
Lesson 01: Faster File I/O With Concurrency
Hi, file I/O operations are inherently slower compared to working with data in main memory.
File I/O operations are essential in computer programming for tasks that involve data persistence, file manipulation, and data exchange with external systems.
This is nearly all Python programs that we write!
The performance of file I/O is constrained by the underlying hardware of the hard drive, resulting in significantly slower execution times when reading from or writing to files.
There are several reasons why file I/O operations can be slow:
- Disk speed.
- File size.
- File fragmentation.
- Hardware limitations.
You can learn more about how to perform common file I/O tasks in Python in the tutorial:
The performance gap creates an opportunity for optimizing file I/O tasks with concurrency, which involves performing multiple file I/O operations simultaneously, freeing up the program to handle other tasks.
Concurrency is a programming paradigm that allows multiple operations to be performed out of sequence, such as asynchronously (at a later time) or simultaneously (in parallel).
Concurrent file I/O can allow a program to perform file I/O later in the background, while the program continues on with other tasks.
It may also reduce the burden of an I/O-bound program and perhaps allow it to transition to becoming CPU-bound or constrained by a resource other than disk access and transfer speeds.
Concurrency is important when performing file I/O for 5 main reasons:
- Improved performance
- Better resource utilization
- Parallelism and responsiveness
- Avoiding blocking and deadlocks
- Scalability
You can learn more about how to use Python concurrency in the tutorial:
We don’t have to adopt concurrency when performing file I/O, but not doing so can negatively impact the performance of our programs.
As such, we may experience several consequences, such as:
- Reduced performance: Without utilizing concurrency, our file I/O operations may be slower and less efficient.
- Resource underutilization: Without concurrency, our program may not be making optimal use of system resources.
- Limited scalability: Without concurrency, our file I/O operations may have constrained scalability.
- Poor responsiveness: Lack of concurrency in file I/O can impact the responsiveness of our application.
We now know about the importance of concurrency for performing faster file I/O.
You can learn more about this in the tutorial:
Lesson 02: One-off Background File I/O Tasks
Hi, we can run one-off file IO tasks in the background using a new thread or new child process.
A one-off file operation is an operation that is one file I/O task performed on demand.
For example, we may need to load some data from a file, store state to a file, or append some data to a file.
The task can be dispatched and any error handling or notifications can be handled by the task itself.
For a one-off file IO task to be completed in the background, it assumes some properties of the task, such as:
- Independence. The task does not depend upon another task and is not depended upon by another task, meaning the program is not waiting for the file IO operation to complete.
- Has Data. All data required by the task is provided to the task upfront, e.g. the data to save in the file and the name and path of the file where it is saved.
- Error Handling. The task should manage its own error handling, such as handling and reporting exceptions as the program does not explicitly depend upon the task.
- Isolated. There is only one or a very small number of tasks to perform, e.g. one-off. This means that we are not referring to batch file IO operations, like loading all files in a directory.
One-off file IO operations can be performed in the background of the main program.
This can be achieved using concurrency, such as a new thread.
The example below starts a new thread to save data to a file in the background.
We can use this as a template for background file I/O tasks in our own programs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# SuperFastPython.com # example of writing a file in the background with a thread import threading # task to write a file def write_file(filename, data): # open a file for writing text with open(filename, 'w', encoding='utf-8') as handle: # write string data to the file handle.write(data) # report a message print(f'File saved: {filename}') # protect the entry point if __name__ == '__main__': # report a message print('Main is running') # save the file print('Saving file...') # create and configure the new thread thread = threading.Thread(target=write_file, args=('data.txt', 'Hello world')) # start running the new thread in the background thread.start() # report a message print('Main is done.') |
Run the example.
Change the program to save more and different data and have the main program perform other tasks in the foreground.
Let me know what you come up with.
You can learn more about running file I/O tasks in the background in the tutorial:
Free Concurrent File I/O Course
Get FREE access to my 7-day email course on concurrent File I/O.
Discover patterns for concurrent file I/O, how save files with a process pool, how to copy files with a thread pool, and how to append to a file from multiple threads safely.
Lesson 03: Patterns for Concurrent File I/O
Hi, we can use concurrent programming patterns when speeding up file IO using concurrency in Python.
We typically need to perform the same file I/O task on many files, for example:
- Delete all files in a directory.
- Copy all files from one directory to another.
- Save each in-memory dataset as a separate file.
These tasks are both independent and collectively slow to perform.
Naively these tasks are performed sequentially, one after the other.
If we can perform the operation using concurrency, it may be reduced by a factor of the number of CPU cores or a factor of the number of concurrent tasks, or similar, depending on the limiting factors of the task itself and the hardware on which it is run.
There are 4 reusable programming patterns for concurrent IO we may consider:
- Pattern 1: Single Worker
- Pattern 2: Pools of Workers
- Pattern 3: Batched Tasks And Pools of Workers
- Pattern 4: Pools of Workers within Each Worker
Let’s take a closer look at each pattern in turn.
Pattern 1: Single Worker
The simplest pattern is to create a single concurrent worker on demand for a file IO operation.
This approach would only be helpful in those cases where a single file IO operation can be executed safely in the background while the program continues on with other tasks not dependent on the file IO operation.
Pattern 2: Pools of Workers
We might consider using a pool of workers to complete a suite of similar file I/O operations.
Pools of workers are provided in Python using the modern executor framework, including the ThreadPoolExecutor and the ProcessPoolExecutor.
File IO tasks can be defined as a pure function that is given the data it needs as an argument and provides any loaded data via a return value.
Using a pool of workers allows the program to issue tasks and have them queued until a worker in the pool is available to consume and execute the task.
Pattern 3: Batched Tasks And Pools of Workers
There may be efficiency gains to be had in grouping tasks into batches when executing them with a pool of workers.
Many file I/O tasks may require data in order to be completed or may result in data being loaded.
If the tasks are performed within a pool of workers, then there may be some overhead in transmitting data from the main program to the workers for each task, and in transmitting data base from workers to the main program.
We may overcome some of this overhead by executing file I/O tasks in batches.
This might involve configuring the pool of workers itself to automatically collect groups of tasks into chunks for transmission to workers.
Alternatively, we may manually chunk the tasks, e.g. create batches of file names to load or delete, then issue each batch of tasks to a worker to process.
Pattern 4: Pools of Workers within Each Worker
We can take these patterns one step further by having a pool of workers within each worker in a pool of workers.
Not all variations of this pattern make sense.
A good use of this pattern is to have a pool of worker processes, where each worker process in the pool manages its own pool of worker threads.
In this way, a batch of file IO tasks can be issued to each worker in the pool for parallel execution and each batch is executed concurrently using threads.
We now know about patterns for concurrent file IO in Python.
It is critical that we perform experiments in order to discover what type of concurrency works well or best for your specific file I/O operation and system.
You can learn more about concurrent file I/O patterns in the tutorial:
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Lesson 04: Concurrent File Writing
Hi, saving a single file to disk can be slow.
It can become painfully slow in situations where we may need to save thousands of files. The hope is that concurrency can be used to speed up file saving.
Consider a CSV file with 10 values per line and 10,000 lines.
It can take about 200 seconds to save 10,000 files of this type sequentially, one by one.
There are many ways to save files to disk concurrently.
For example, we could use a thread pool or a process pool and each task could save one file or a batch of files.
In this case, we will use a process pool via the ProcessPoolExecutor class and save one file per task.
We will use a pool of 8 worker processes. Each task will save one file with a unique filename and unique data.
All files will be saved in a tmp/ directory under the current directory where we save the program.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
# SuperFastPython.com # save many data files concurrently with a process pool import os import os.path import concurrent.futures import time # save data to a file def save_file(filepath, data): # open the file with open(filepath, 'w') as handle: # save the data handle.write(data) # generate a csv file with v=10 values per line and l=10k lines def generate_file(identifier, n_values=10, n_lines=10000): # generate many lines of data data = list() for i in range(n_lines): # generate a list of numeric values as strings line = [str(identifier+i+j) for j in range(n_values)] # convert list to string with values separated by commas data.append(','.join(line)) # convert list of lines to a string separated by new lines return '\n'.join(data) # generate and save a file def generate_and_save(path, identifier): # generate data data = generate_file(identifier) # create a unique filename filepath = os.path.join(path, f'data-{identifier:04d}.csv') # save data file to disk save_file(filepath, data) # report progress print(f'.saved {filepath}', flush=True) # save many data files in a directory if __name__ == '__main__': # record the start time time_start = time.perf_counter() # define the local dir to save files path = 'tmp' # define the number of files to save n_files = 10000 # create a local directory to save files os.makedirs(path, exist_ok=True) # create the process pool with concurrent.futures.ProcessPoolExecutor(8) as exe: # submit tasks to generate files _ = [exe.submit(generate_and_save, path, i) for i in range(n_files)] # calculate the duration time_duration = time.perf_counter() - time_start # report the duration print(f'Took {time_duration} seconds') |
Run the program and note how long it takes to execute.
On my system, it takes about 71 seconds to complete.
Depending on the speed of your hardware it should be about 2.5x to 3x faster than saving the files sequentially.
Try varying the number of worker processes and see if it makes a difference to the run time.
Let me know what you discover.
You can learn more about concurrent file saving in the tutorial:
Lesson 05: Concurrent File Copying
Hi, copying files is typically really slow.
We can explore how to speed up file copying using concurrency.
In the previous lesson, we saw how we can create many files on disk.
Re-run the program from the previous lesson to create a tmp/ directory with many files.
We can then develop a program to copy all 10,000 files from one directory to another.
If we use the shutil.copy() function to copy all files sequentially from the tmp/ directory to a new directory named tmp2/ it will take about 10 seconds.
We can speed this up by using a thread pool.
Each task will copy one file from the source directory into the destination directory.
We can then issue one task per file in the source directory.
The example below implements this with a thread pool of 100 workers.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
# SuperFastPython.com # copy files from one directory to another concurrently with threads import os import os.path import shutil import concurrent.futures import time # copy a file from source to destination def copy_file(src_path, dest_dir): # copy source file to dest file dest_path = shutil.copy(src_path, dest_dir) # report progress print(f'.copied {src_path} to {dest_path}') # copy files from src to dest if __name__ == '__main__': # record the start time time_start = time.perf_counter() # define the source and destimation dirs src = 'tmp' dest = 'tmp2' # create the destination directory if needed os.makedirs(dest, exist_ok=True) # create full paths for all files we wish to copy files = [os.path.join(src,name) for name in os.listdir(src)] # create the thread pool with concurrent.futures.ThreadPoolExecutor(100) as exe: # submit all copy tasks _ = [exe.submit(copy_file, path, dest) for path in files] print('Done') # calculate the duration time_duration = time.perf_counter() - time_start # report the duration print(f'Took {time_duration} seconds') |
Run the program and note how long it takes to execute.
On my system, it takes about 3.7 seconds to complete.
Depending on the speed of your hardware it should be about 2x to 2.5x faster than copying the files sequentially.
Try varying the number of worker threads and see if it makes a difference to the run time.
Let me know what you discover.
You can learn more about concurrent file copying in the tutorial:
Lesson 06: Concurrent File Reading
Hi, loading a single file from disk can be slow.
It can become painfully slow in situations where you may need to load thousands of files. The hope is that concurrency can be used to speed up file saving.
In a previous lesson, we saw how we can create many files on disk.
Re-run the program from a previous lesson to create a tmp/ directory with many files.
We can then develop a program to load all files.
Loading files is faster than saving in general. Loading all 10,000 files sequentially should take about 2.5 seconds to complete.
We can develop a concurrent version that is more than 2x faster.
This is an advanced approach that uses a process pool where each worker in the process pool uses a thread pool. Each worker in the thread pool then loads a file.
This involves two parts.
The first part involves splitting the number of files to load into chunks. Each worker process will load one chunk of files. All chunks will be executed concurrently.
The second part involves the task that performs the loading. It will start a new thread pool with one worker thread per file to load. All files can then be loaded concurrently.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
# SuperFastPython.com # load many files concurrently with processes and threads in batch import os import os.path import concurrent.futures import time # load a file and return the contents def load_file(filepath): # open the file with open(filepath, 'r') as handle: # return the contents handle.read() # return the contents of many files def load_files(filepaths): # create a thread pool with concurrent.futures.ThreadPoolExecutor(len(filepaths)) as exe: # load files futures = [exe.submit(load_file, name) for name in filepaths] # collect data data_list = [future.result() for future in futures] # return data and file paths return (data_list, filepaths) # load all files in a directory into memory if __name__ == '__main__': # record the start time time_start = time.perf_counter() # define the pat to load fils path = 'tmp' # prepare all of the paths paths = [os.path.join(path, filepath) for filepath in os.listdir(path)] # determine chunksize n_workers = 8 chunksize = round(len(paths) / n_workers) # create the process pool with concurrent.futures.ProcessPoolExecutor(n_workers) as exe: futures = list() # split the load operations into chunks for i in range(0, len(paths), chunksize): # select a chunk of filenames filepaths = paths[i:(i + chunksize)] # submit the task future = exe.submit(load_files, filepaths) futures.append(future) # process all results for future in concurrent.futures.as_completed(futures): # open the file and load the data _, filepaths = future.result() for filepath in filepaths: # report progress print(f'.loaded {filepath}') # calculate the duration time_duration = time.perf_counter() - time_start # report the duration print(f'Took {time_duration} seconds') |
Run the program and note how long it takes to execute.
On my system, it takes about 1.4 seconds to complete.
Depending on the speed of your hardware it should be about 1.5x to 2x faster than loading the files sequentially.
Try varying the number of worker processes and worker threads and see if it makes a difference to the run time.
Let me know what you discover.
You can learn more about concurrent file saving in the tutorial:
Lesson 07: Concurrent File Appending
Hi, appending a file from multiple threads is not thread-safe and will result in overwritten data and file corruption.
We can explore how to append to a file from multiple threads.
One approach to safely append to a file from multiple threads is by using a mutual exclusion lock.
Python provides a mutual exclusion lock, also called a mutex, via the threading.Lock class.
A lock can be created and passed to each task executed in a new thread.
Then, before any thread attempts to write to the file it must obtain the lock. Only one thread can have the lock at any one time, and if writes to the file can only occur when the lock is held, then only one thread can write to the file at a time.
This will make writing to the file thread safe and will avoid any corruption of messages.
Tying this together, the complete example of writing to file from multiple threads using a lock is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# SuperFastPython.com # write to a file from multiple threads using a shared lock import time import random import threading import concurrent.futures # task that blocks for a moment and write to a file def task(name, lock, file_handle): # block for a moment value = random.random() time.sleep(value) # obtain the lock with lock: # write to a file file_handle.write(f'Task {name} blocked for {value} of a second.\n') # write to a file from multiple threads if __name__ == '__main__': # open a file with open('results.txt', 'a', encoding='utf-8') as handle: # create a lock to protect the file lock = threading.Lock() # create the thread pool with concurrent.futures.ThreadPoolExecutor(1000) as exe: # dispatch all tasks _ = [exe.submit(task, i, lock, handle) for i in range(10000)] # wait for all tasks to complete print('Done.') |
Run the program and note how long it takes to execute.
Review the created file with the name “results.txt” and confirm that it contains 10,000 lines of data without corruption.
Try varying the number of threads and the number of lines written by each thread.
Try the same program with and without the use of the lock and compare the results.txt file.
Let me know what you discover.
You can learn more about concurrent file appending in the tutorial:
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent File I/O in Python, Jason Brownlee (my book!)
Guides
Python File I/O APIs
- Built-in Functions
- os - Miscellaneous operating system interfaces
- os.path - Common pathname manipulations
- shutil - High-level file operations
- zipfile — Work with ZIP archives
- Python Tutorial: Chapter 7. Input and Output
Python Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
- asyncio — Asynchronous I/O
File I/O in Asyncio
References
Thank You
Hi, thank you for letting me help you learn more about concurrent File I/O.
If you ever have any questions about this course or Python concurrency in general, please reach out. Just ask in the comments below.
If you enjoyed this course and want to know more, check out my new book:
This book distills only what you need to know to get started and be effective with concurrent File I/O, super fast.
The book is divided into 15 chapters spread across 3 parts. It contains everything you need to get started, then get really good at concurrent File I/O, all in one book.
It’s exactly how I would teach you concurrent File I/O if we were sitting together, pair programming.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Did you enjoy this course?
Let me know in the comments below.
Photo by Angel Rodarte on Unsplash
Do you have any questions?