Concurrent File I/O 7-Day Course
File I/O can be so much faster with the careful use of concurrency.
This includes using Python threads and processes to read, write, and copy files, as well as how to safely append to a file from multiple threads.
This course provides you with a 7-day crash course in concurrency for File I/O.
You will get a taste of what is possible and hopefully, skills that you can bring to your next project.
Let's get started.
Concurrent NumPy Overview
Hi, thanks for your interest in Concurrent File I/O.
You have a lot of fun ahead, including:
- Lesson 01: Faster File I/O With Concurrency (today)
- Lesson 02: One-off Background File I/O Tasks
- Lesson 03: Patterns for Concurrent File I/O
- Lesson 04: Concurrent File Writing
- Lesson 05: Concurrent File Copying
- Lesson 06: Concurrent File Reading
- Lesson 07: Concurrent File Appending
You can download a .zip of all the source code for this course here:
Your first lesson in this series is just below.
Until then, a quick question:
Why are you interested in Concurrent File I/O?
Let me know in the comments below.
Lesson 01: Faster File I/O With Concurrency
Hi, file I/O operations are inherently slower compared to working with data in main memory.
File I/O operations are essential in computer programming for tasks that involve data persistence, file manipulation, and data exchange with external systems.
This is nearly all Python programs that we write!
The performance of file I/O is constrained by the underlying hardware of the hard drive, resulting in significantly slower execution times when reading from or writing to files.
There are several reasons why file I/O operations can be slow:
- Disk speed.
- File size.
- File fragmentation.
- Hardware limitations.
You can learn more about how to perform common file I/O tasks in Python in the tutorial:
The performance gap creates an opportunity for optimizing file I/O tasks with concurrency, which involves performing multiple file I/O operations simultaneously, freeing up the program to handle other tasks.
Concurrency is a programming paradigm that allows multiple operations to be performed out of sequence, such as asynchronously (at a later time) or simultaneously (in parallel).
Concurrent file I/O can allow a program to perform file I/O later in the background, while the program continues on with other tasks.
It may also reduce the burden of an I/O-bound program and perhaps allow it to transition to becoming CPU-bound or constrained by a resource other than disk access and transfer speeds.
Concurrency is important when performing file I/O for 5 main reasons:
- Improved performance
- Better resource utilization
- Parallelism and responsiveness
- Avoiding blocking and deadlocks
- Scalability
You can learn more about how to use Python concurrency in the tutorial:
We don't have to adopt concurrency when performing file I/O, but not doing so can negatively impact the performance of our programs.
As such, we may experience several consequences, such as:
- Reduced performance: Without utilizing concurrency, our file I/O operations may be slower and less efficient.
- Resource underutilization: Without concurrency, our program may not be making optimal use of system resources.
- Limited scalability: Without concurrency, our file I/O operations may have constrained scalability.
- Poor responsiveness: Lack of concurrency in file I/O can impact the responsiveness of our application.
We now know about the importance of concurrency for performing faster file I/O.
You can learn more about this in the tutorial:
Lesson 02: One-off Background File I/O Tasks
Hi, we can run one-off file IO tasks in the background using a new thread or new child process.
A one-off file operation is an operation that is one file I/O task performed on demand.
For example, we may need to load some data from a file, store state to a file, or append some data to a file.
The task can be dispatched and any error handling or notifications can be handled by the task itself.
For a one-off file IO task to be completed in the background, it assumes some properties of the task, such as:
- Independence. The task does not depend upon another task and is not depended upon by another task, meaning the program is not waiting for the file IO operation to complete.
- Has Data. All data required by the task is provided to the task upfront, e.g. the data to save in the file and the name and path of the file where it is saved.
- Error Handling. The task should manage its own error handling, such as handling and reporting exceptions as the program does not explicitly depend upon the task.
- Isolated. There is only one or a very small number of tasks to perform, e.g. one-off. This means that we are not referring to batch file IO operations, like loading all files in a directory.
One-off file IO operations can be performed in the background of the main program.
This can be achieved using concurrency, such as a new thread.
The example below starts a new thread to save data to a file in the background.
We can use this as a template for background file I/O tasks in our own programs.
# SuperFastPython.com
# example of writing a file in the background with a thread
import threading
# task to write a file
def write_file(filename, data):
# open a file for writing text
with open(filename, 'w', encoding='utf-8') as handle:
# write string data to the file
handle.write(data)
# report a message
print(f'File saved: {filename}')
# protect the entry point
if __name__ == '__main__':
# report a message
print('Main is running')
# save the file
print('Saving file...')
# create and configure the new thread
thread = threading.Thread(target=write_file, args=('data.txt', 'Hello world'))
# start running the new thread in the background
thread.start()
# report a message
print('Main is done.')
Run the example.
Change the program to save more and different data and have the main program perform other tasks in the foreground.
Let me know what you come up with.
You can learn more about running file I/O tasks in the background in the tutorial:
Lesson 03: Patterns for Concurrent File I/O
Hi, we can use concurrent programming patterns when speeding up file IO using concurrency in Python.
We typically need to perform the same file I/O task on many files, for example:
- Delete all files in a directory.
- Copy all files from one directory to another.
- Save each in-memory dataset as a separate file.
These tasks are both independent and collectively slow to perform.
Naively these tasks are performed sequentially, one after the other.
If we can perform the operation using concurrency, it may be reduced by a factor of the number of CPU cores or a factor of the number of concurrent tasks, or similar, depending on the limiting factors of the task itself and the hardware on which it is run.
There are 4 reusable programming patterns for concurrent IO we may consider:
- Pattern 1: Single Worker
- Pattern 2: Pools of Workers
- Pattern 3: Batched Tasks And Pools of Workers
- Pattern 4: Pools of Workers within Each Worker
Let's take a closer look at each pattern in turn.
Pattern 1: Single Worker
The simplest pattern is to create a single concurrent worker on demand for a file IO operation.
This approach would only be helpful in those cases where a single file IO operation can be executed safely in the background while the program continues on with other tasks not dependent on the file IO operation.
Pattern 2: Pools of Workers
We might consider using a pool of workers to complete a suite of similar file I/O operations.
Pools of workers are provided in Python using the modern executor framework, including the ThreadPoolExecutor and the ProcessPoolExecutor.
File IO tasks can be defined as a pure function that is given the data it needs as an argument and provides any loaded data via a return value.
Using a pool of workers allows the program to issue tasks and have them queued until a worker in the pool is available to consume and execute the task.
Pattern 3: Batched Tasks And Pools of Workers
There may be efficiency gains to be had in grouping tasks into batches when executing them with a pool of workers.
Many file I/O tasks may require data in order to be completed or may result in data being loaded.
If the tasks are performed within a pool of workers, then there may be some overhead in transmitting data from the main program to the workers for each task, and in transmitting data base from workers to the main program.
We may overcome some of this overhead by executing file I/O tasks in batches.
This might involve configuring the pool of workers itself to automatically collect groups of tasks into chunks for transmission to workers.
Alternatively, we may manually chunk the tasks, e.g. create batches of file names to load or delete, then issue each batch of tasks to a worker to process.
Pattern 4: Pools of Workers within Each Worker
We can take these patterns one step further by having a pool of workers within each worker in a pool of workers.
Not all variations of this pattern make sense.
A good use of this pattern is to have a pool of worker processes, where each worker process in the pool manages its own pool of worker threads.
In this way, a batch of file IO tasks can be issued to each worker in the pool for parallel execution and each batch is executed concurrently using threads.
We now know about patterns for concurrent file IO in Python.
It is critical that we perform experiments in order to discover what type of concurrency works well or best for your specific file I/O operation and system.
You can learn more about concurrent file I/O patterns in the tutorial:
Lesson 04: Concurrent File Writing
Hi, saving a single file to disk can be slow.
It can become painfully slow in situations where we may need to save thousands of files. The hope is that concurrency can be used to speed up file saving.
Consider a CSV file with 10 values per line and 10,000 lines.
It can take about 200 seconds to save 10,000 files of this type sequentially, one by one.
There are many ways to save files to disk concurrently.
For example, we could use a thread pool or a process pool and each task could save one file or a batch of files.
In this case, we will use a process pool via the ProcessPoolExecutor class and save one file per task.
We will use a pool of 8 worker processes. Each task will save one file with a unique filename and unique data.
All files will be saved in a tmp/ directory under the current directory where we save the program.
The complete example is listed below.
# SuperFastPython.com
# save many data files concurrently with a process pool
import os
import os.path
import concurrent.futures
import time
# save data to a file
def save_file(filepath, data):
# open the file
with open(filepath, 'w') as handle:
# save the data
handle.write(data)
# generate a csv file with v=10 values per line and l=10k lines
def generate_file(identifier, n_values=10, n_lines=10000):
# generate many lines of data
data = list()
for i in range(n_lines):
# generate a list of numeric values as strings
line = [str(identifier+i+j) for j in range(n_values)]
# convert list to string with values separated by commas
data.append(','.join(line))
# convert list of lines to a string separated by new lines
return '\n'.join(data)
# generate and save a file
def generate_and_save(path, identifier):
# generate data
data = generate_file(identifier)
# create a unique filename
filepath = os.path.join(path, f'data-{identifier:04d}.csv')
# save data file to disk
save_file(filepath, data)
# report progress
print(f'.saved {filepath}', flush=True)
# save many data files in a directory
if __name__ == '__main__':
# record the start time
time_start = time.perf_counter()
# define the local dir to save files
path = 'tmp'
# define the number of files to save
n_files = 10000
# create a local directory to save files
os.makedirs(path, exist_ok=True)
# create the process pool
with concurrent.futures.ProcessPoolExecutor(8) as exe:
# submit tasks to generate files
_ = [exe.submit(generate_and_save, path, i) for i in range(n_files)]
# calculate the duration
time_duration = time.perf_counter() - time_start
# report the duration
print(f'Took {time_duration} seconds')
Run the program and note how long it takes to execute.
On my system, it takes about 71 seconds to complete.
Depending on the speed of your hardware it should be about 2.5x to 3x faster than saving the files sequentially.
Try varying the number of worker processes and see if it makes a difference to the run time.
Let me know what you discover.
You can learn more about concurrent file saving in the tutorial:
Lesson 05: Concurrent File Copying
Hi, copying files is typically really slow.
We can explore how to speed up file copying using concurrency.
In the previous lesson, we saw how we can create many files on disk.
Re-run the program from the previous lesson to create a tmp/ directory with many files.
We can then develop a program to copy all 10,000 files from one directory to another.
If we use the shutil.copy() function to copy all files sequentially from the tmp/ directory to a new directory named tmp2/ it will take about 10 seconds.
We can speed this up by using a thread pool.
Each task will copy one file from the source directory into the destination directory.
We can then issue one task per file in the source directory.
The example below implements this with a thread pool of 100 workers.
# SuperFastPython.com
# copy files from one directory to another concurrently with threads
import os
import os.path
import shutil
import concurrent.futures
import time
# copy a file from source to destination
def copy_file(src_path, dest_dir):
# copy source file to dest file
dest_path = shutil.copy(src_path, dest_dir)
# report progress
print(f'.copied {src_path} to {dest_path}')
# copy files from src to dest
if __name__ == '__main__':
# record the start time
time_start = time.perf_counter()
# define the source and destimation dirs
src = 'tmp'
dest = 'tmp2'
# create the destination directory if needed
os.makedirs(dest, exist_ok=True)
# create full paths for all files we wish to copy
files = [os.path.join(src,name) for name in os.listdir(src)]
# create the thread pool
with concurrent.futures.ThreadPoolExecutor(100) as exe:
# submit all copy tasks
_ = [exe.submit(copy_file, path, dest) for path in files]
print('Done')
# calculate the duration
time_duration = time.perf_counter() - time_start
# report the duration
print(f'Took {time_duration} seconds')
Run the program and note how long it takes to execute.
On my system, it takes about 3.7 seconds to complete.
Depending on the speed of your hardware it should be about 2x to 2.5x faster than copying the files sequentially.
Try varying the number of worker threads and see if it makes a difference to the run time.
Let me know what you discover.
You can learn more about concurrent file copying in the tutorial:
Lesson 06: Concurrent File Reading
Hi, loading a single file from disk can be slow.
It can become painfully slow in situations where you may need to load thousands of files. The hope is that concurrency can be used to speed up file saving.
In a previous lesson, we saw how we can create many files on disk.
Re-run the program from a previous lesson to create a tmp/ directory with many files.
We can then develop a program to load all files.
Loading files is faster than saving in general. Loading all 10,000 files sequentially should take about 2.5 seconds to complete.
We can develop a concurrent version that is more than 2x faster.
This is an advanced approach that uses a process pool where each worker in the process pool uses a thread pool. Each worker in the thread pool then loads a file.
This involves two parts.
The first part involves splitting the number of files to load into chunks. Each worker process will load one chunk of files. All chunks will be executed concurrently.
The second part involves the task that performs the loading. It will start a new thread pool with one worker thread per file to load. All files can then be loaded concurrently.
The complete example is listed below.
# SuperFastPython.com
# load many files concurrently with processes and threads in batch
import os
import os.path
import concurrent.futures
import time
# load a file and return the contents
def load_file(filepath):
# open the file
with open(filepath, 'r') as handle:
# return the contents
handle.read()
# return the contents of many files
def load_files(filepaths):
# create a thread pool
with concurrent.futures.ThreadPoolExecutor(len(filepaths)) as exe:
# load files
futures = [exe.submit(load_file, name) for name in filepaths]
# collect data
data_list = [future.result() for future in futures]
# return data and file paths
return (data_list, filepaths)
# load all files in a directory into memory
if __name__ == '__main__':
# record the start time
time_start = time.perf_counter()
# define the pat to load fils
path = 'tmp'
# prepare all of the paths
paths = [os.path.join(path, filepath) for filepath in os.listdir(path)]
# determine chunksize
n_workers = 8
chunksize = round(len(paths) / n_workers)
# create the process pool
with concurrent.futures.ProcessPoolExecutor(n_workers) as exe:
futures = list()
# split the load operations into chunks
for i in range(0, len(paths), chunksize):
# select a chunk of filenames
filepaths = paths[i:(i + chunksize)]
# submit the task
future = exe.submit(load_files, filepaths)
futures.append(future)
# process all results
for future in concurrent.futures.as_completed(futures):
# open the file and load the data
_, filepaths = future.result()
for filepath in filepaths:
# report progress
print(f'.loaded {filepath}')
# calculate the duration
time_duration = time.perf_counter() - time_start
# report the duration
print(f'Took {time_duration} seconds')
Run the program and note how long it takes to execute.
On my system, it takes about 1.4 seconds to complete.
Depending on the speed of your hardware it should be about 1.5x to 2x faster than loading the files sequentially.
Try varying the number of worker processes and worker threads and see if it makes a difference to the run time.
Let me know what you discover.
You can learn more about concurrent file saving in the tutorial:
Lesson 07: Concurrent File Appending
Hi, appending a file from multiple threads is not thread-safe and will result in overwritten data and file corruption.
We can explore how to append to a file from multiple threads.
One approach to safely append to a file from multiple threads is by using a mutual exclusion lock.
Python provides a mutual exclusion lock, also called a mutex, via the threading.Lock class.
A lock can be created and passed to each task executed in a new thread.
Then, before any thread attempts to write to the file it must obtain the lock. Only one thread can have the lock at any one time, and if writes to the file can only occur when the lock is held, then only one thread can write to the file at a time.
This will make writing to the file thread safe and will avoid any corruption of messages.
Tying this together, the complete example of writing to file from multiple threads using a lock is listed below.
# SuperFastPython.com
# write to a file from multiple threads using a shared lock
import time
import random
import threading
import concurrent.futures
# task that blocks for a moment and write to a file
def task(name, lock, file_handle):
# block for a moment
value = random.random()
time.sleep(value)
# obtain the lock
with lock:
# write to a file
file_handle.write(f'Task {name} blocked for {value} of a second.\n')
# write to a file from multiple threads
if __name__ == '__main__':
# open a file
with open('results.txt', 'a', encoding='utf-8') as handle:
# create a lock to protect the file
lock = threading.Lock()
# create the thread pool
with concurrent.futures.ThreadPoolExecutor(1000) as exe:
# dispatch all tasks
_ = [exe.submit(task, i, lock, handle) for i in range(10000)]
# wait for all tasks to complete
print('Done.')
Run the program and note how long it takes to execute.
Review the created file with the name "results.txt" and confirm that it contains 10,000 lines of data without corruption.
Try varying the number of threads and the number of lines written by each thread.
Try the same program with and without the use of the lock and compare the results.txt file.
Let me know what you discover.
You can learn more about concurrent file appending in the tutorial:
Thank You
Hi, thank you for letting me help you learn more about concurrent File I/O.
If you ever have any questions about this course or Python concurrency in general, please reach out. Just ask in the comments below.
If you enjoyed this course and want to know more, check out my new book:
This book distills only what you need to know to get started and be effective with concurrent File I/O, super fast.
The book is divided into 15 chapters spread across 3 parts. It contains everything you need to get started, then get really good at concurrent File I/O, all in one book.
It's exactly how I would teach you concurrent File I/O if we were sitting together, pair programming.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Did you enjoy this course?
Let me know in the comments below.
If you enjoyed this tutorial, you will love my book: Concurrent File I/O in Python. It covers everything you need to master the topic with hands-on examples and clear explanations.