Concurrent File I/O 7-Day Course

File I/O can be so much faster with the careful use of concurrency.

This includes using Python threads and processes to read, write, and copy files, as well as how to safely append to a file from multiple threads.

This course provides you with a 7-day crash course in concurrency for File I/O.

You will get a taste of what is possible and hopefully, skills that you can bring to your next project.

Let’s get started.

Table of Contents

Concurrent NumPy Overview

Hi, thanks for your interest in Concurrent File I/O.

You have a lot of fun ahead, including:

Lesson 01: Faster File I/O With Concurrency (today)
Lesson 02: One-off Background File I/O Tasks
Lesson 03: Patterns for Concurrent File I/O
Lesson 04: Concurrent File Writing
Lesson 05: Concurrent File Copying
Lesson 06: Concurrent File Reading
Lesson 07: Concurrent File Appending

You can download a .zip of all the source code for this course here:

Download the source code for this course

Your first lesson in this series is just below.

Until then, a quick question:

Why are you interested in Concurrent File I/O?
Let me know in the comments below.

Run loops using all CPUs, download your FREE book to learn how.

Lesson 01: Faster File I/O With Concurrency

Hi, file I/O operations are inherently slower compared to working with data in main memory.

File I/O operations are essential in computer programming for tasks that involve data persistence, file manipulation, and data exchange with external systems.

This is nearly all Python programs that we write!

The performance of file I/O is constrained by the underlying hardware of the hard drive, resulting in significantly slower execution times when reading from or writing to files.

There are several reasons why file I/O operations can be slow:

Disk speed.
File size.
File fragmentation.
Hardware limitations.

You can learn more about how to perform common file I/O tasks in Python in the tutorial:

Python File IO: A Whirlwind Tour

The performance gap creates an opportunity for optimizing file I/O tasks with concurrency, which involves performing multiple file I/O operations simultaneously, freeing up the program to handle other tasks.

Concurrency is a programming paradigm that allows multiple operations to be performed out of sequence, such as asynchronously (at a later time) or simultaneously (in parallel).

Concurrent file I/O can allow a program to perform file I/O later in the background, while the program continues on with other tasks.

It may also reduce the burden of an I/O-bound program and perhaps allow it to transition to becoming CPU-bound or constrained by a resource other than disk access and transfer speeds.

Concurrency is important when performing file I/O for 5 main reasons:

Improved performance
Better resource utilization
Parallelism and responsiveness
Avoiding blocking and deadlocks
Scalability

You can learn more about how to use Python concurrency in the tutorial:

Python Concurrency: A Whirlwind Tour

We don’t have to adopt concurrency when performing file I/O, but not doing so can negatively impact the performance of our programs.

As such, we may experience several consequences, such as:

Reduced performance: Without utilizing concurrency, our file I/O operations may be slower and less efficient.
Resource underutilization: Without concurrency, our program may not be making optimal use of system resources.
Limited scalability: Without concurrency, our file I/O operations may have constrained scalability.
Poor responsiveness: Lack of concurrency in file I/O can impact the responsiveness of our application.

We now know about the importance of concurrency for performing faster file I/O.

You can learn more about this in the tutorial:

Faster File I/O With Concurrency

Start Now: Free Concurrent File I/O Crash Course

Lesson 02: One-off Background File I/O Tasks

Hi, we can run one-off file IO tasks in the background using a new thread or new child process.

A one-off file operation is an operation that is one file I/O task performed on demand.

For example, we may need to load some data from a file, store state to a file, or append some data to a file.

The task can be dispatched and any error handling or notifications can be handled by the task itself.

For a one-off file IO task to be completed in the background, it assumes some properties of the task, such as:

Independence. The task does not depend upon another task and is not depended upon by another task, meaning the program is not waiting for the file IO operation to complete.
Has Data. All data required by the task is provided to the task upfront, e.g. the data to save in the file and the name and path of the file where it is saved.
Error Handling. The task should manage its own error handling, such as handling and reporting exceptions as the program does not explicitly depend upon the task.
Isolated. There is only one or a very small number of tasks to perform, e.g. one-off. This means that we are not referring to batch file IO operations, like loading all files in a directory.

One-off file IO operations can be performed in the background of the main program.

This can be achieved using concurrency, such as a new thread.

The example below starts a new thread to save data to a file in the background.

We can use this as a template for background file I/O tasks in our own programs.

# SuperFastPython.com

# example of writing a file in the background with a thread

import threading

# task to write a file

def write_file(filename, data):

# open a file for writing text

with open(filename, 'w', encoding='utf-8') as handle:

# write string data to the file

handle.write(data)

# report a message

print(f'File saved: {filename}')

# protect the entry point

if __name__ == '__main__':

# report a message

print('Main is running')

# save the file

print('Saving file...')

# create and configure the new thread

thread = threading.Thread(target=write_file, args=('data.txt', 'Hello world'))

# start running the new thread in the background

thread.start()

# report a message

print('Main is done.')

Run the example.

Change the program to save more and different data and have the main program perform other tasks in the foreground.

Let me know what you come up with.

You can learn more about running file I/O tasks in the background in the tutorial:

Run One-Off File I/O Tasks in the Background

Free Concurrent File I/O Course

Get FREE access to my 7-day email course on concurrent File I/O.

Discover patterns for concurrent file I/O, how save files with a process pool, how to copy files with a thread pool, and how to append to a file from multiple threads safely.

Learn more

Lesson 03: Patterns for Concurrent File I/O

Hi, we can use concurrent programming patterns when speeding up file IO using concurrency in Python.

We typically need to perform the same file I/O task on many files, for example:

Delete all files in a directory.
Copy all files from one directory to another.
Save each in-memory dataset as a separate file.

These tasks are both independent and collectively slow to perform.

Naively these tasks are performed sequentially, one after the other.

If we can perform the operation using concurrency, it may be reduced by a factor of the number of CPU cores or a factor of the number of concurrent tasks, or similar, depending on the limiting factors of the task itself and the hardware on which it is run.

There are 4 reusable programming patterns for concurrent IO we may consider:

Pattern 1: Single Worker
Pattern 2: Pools of Workers
Pattern 3: Batched Tasks And Pools of Workers
Pattern 4: Pools of Workers within Each Worker

Let’s take a closer look at each pattern in turn.

Pattern 1: Single Worker

The simplest pattern is to create a single concurrent worker on demand for a file IO operation.

This approach would only be helpful in those cases where a single file IO operation can be executed safely in the background while the program continues on with other tasks not dependent on the file IO operation.

Pattern 2: Pools of Workers

We might consider using a pool of workers to complete a suite of similar file I/O operations.

Pools of workers are provided in Python using the modern executor framework, including the ThreadPoolExecutor and the ProcessPoolExecutor.

File IO tasks can be defined as a pure function that is given the data it needs as an argument and provides any loaded data via a return value.

Using a pool of workers allows the program to issue tasks and have them queued until a worker in the pool is available to consume and execute the task.

Pattern 3: Batched Tasks And Pools of Workers

There may be efficiency gains to be had in grouping tasks into batches when executing them with a pool of workers.

Many file I/O tasks may require data in order to be completed or may result in data being loaded.

If the tasks are performed within a pool of workers, then there may be some overhead in transmitting data from the main program to the workers for each task, and in transmitting data base from workers to the main program.

We may overcome some of this overhead by executing file I/O tasks in batches.

This might involve configuring the pool of workers itself to automatically collect groups of tasks into chunks for transmission to workers.

Alternatively, we may manually chunk the tasks, e.g. create batches of file names to load or delete, then issue each batch of tasks to a worker to process.

Pattern 4: Pools of Workers within Each Worker

We can take these patterns one step further by having a pool of workers within each worker in a pool of workers.

Not all variations of this pattern make sense.

A good use of this pattern is to have a pool of worker processes, where each worker process in the pool manages its own pool of worker threads.

In this way, a batch of file IO tasks can be issued to each worker in the pool for parallel execution and each batch is executed concurrently using threads.

We now know about patterns for concurrent file IO in Python.

It is critical that we perform experiments in order to discover what type of concurrency works well or best for your specific file I/O operation and system.

You can learn more about concurrent file I/O patterns in the tutorial:

Concurrent File I/O Programming Patterns

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Lesson 04: Concurrent File Writing

Hi, saving a single file to disk can be slow.

It can become painfully slow in situations where we may need to save thousands of files. The hope is that concurrency can be used to speed up file saving.

Consider a CSV file with 10 values per line and 10,000 lines.

It can take about 200 seconds to save 10,000 files of this type sequentially, one by one.

There are many ways to save files to disk concurrently.

For example, we could use a thread pool or a process pool and each task could save one file or a batch of files.

In this case, we will use a process pool via the ProcessPoolExecutor class and save one file per task.

We will use a pool of 8 worker processes. Each task will save one file with a unique filename and unique data.

All files will be saved in a tmp/ directory under the current directory where we save the program.

The complete example is listed below.

# SuperFastPython.com

# save many data files concurrently with a process pool

import os

import os.path

import concurrent.futures

import time

# save data to a file

def save_file(filepath, data):

# open the file

with open(filepath, 'w') as handle:

# save the data

handle.write(data)

# generate a csv file with v=10 values per line and l=10k lines

def generate_file(identifier, n_values=10, n_lines=10000):

# generate many lines of data

data = list()

for i in range(n_lines):

# generate a list of numeric values as strings

line = [str(identifier+i+j) for j in range(n_values)]

# convert list to string with values separated by commas

data.append(','.join(line))

# convert list of lines to a string separated by new lines

return '\n'.join(data)

# generate and save a file

def generate_and_save(path, identifier):

# generate data

data = generate_file(identifier)

# create a unique filename

filepath = os.path.join(path, f'data-{identifier:04d}.csv')

# save data file to disk

save_file(filepath, data)

# report progress

print(f'.saved {filepath}', flush=True)

# save many data files in a directory

if __name__ == '__main__':

# record the start time

time_start = time.perf_counter()

# define the local dir to save files

path = 'tmp'

# define the number of files to save

n_files = 10000

# create a local directory to save files

os.makedirs(path, exist_ok=True)

# create the process pool

with concurrent.futures.ProcessPoolExecutor(8) as exe:

# submit tasks to generate files

_ = [exe.submit(generate_and_save, path, i) for i in range(n_files)]

# calculate the duration

time_duration = time.perf_counter() - time_start

# report the duration

print(f'Took {time_duration} seconds')

Run the program and note how long it takes to execute.

On my system, it takes about 71 seconds to complete.

Depending on the speed of your hardware it should be about 2.5x to 3x faster than saving the files sequentially.

Try varying the number of worker processes and see if it makes a difference to the run time.

Let me know what you discover.

You can learn more about concurrent file saving in the tutorial:

Multithreaded File Saving in Python

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Lesson 05: Concurrent File Copying

Hi, copying files is typically really slow.

We can explore how to speed up file copying using concurrency.

In the previous lesson, we saw how we can create many files on disk.

Re-run the program from the previous lesson to create a tmp/ directory with many files.

We can then develop a program to copy all 10,000 files from one directory to another.

If we use the shutil.copy() function to copy all files sequentially from the tmp/ directory to a new directory named tmp2/ it will take about 10 seconds.

We can speed this up by using a thread pool.

Each task will copy one file from the source directory into the destination directory.

We can then issue one task per file in the source directory.

The example below implements this with a thread pool of 100 workers.

# SuperFastPython.com

# copy files from one directory to another concurrently with threads

import os

import os.path

import shutil

import concurrent.futures

import time

# copy a file from source to destination

def copy_file(src_path, dest_dir):

# copy source file to dest file

dest_path = shutil.copy(src_path, dest_dir)

# report progress

print(f'.copied {src_path} to {dest_path}')

# copy files from src to dest

if __name__ == '__main__':

# record the start time

time_start = time.perf_counter()

# define the source and destimation dirs

src = 'tmp'

dest = 'tmp2'

# create the destination directory if needed

os.makedirs(dest, exist_ok=True)

# create full paths for all files we wish to copy

files = [os.path.join(src,name) for name in os.listdir(src)]

# create the thread pool

with concurrent.futures.ThreadPoolExecutor(100) as exe:

# submit all copy tasks

_ = [exe.submit(copy_file, path, dest) for path in files]

print('Done')

# calculate the duration

time_duration = time.perf_counter() - time_start

# report the duration

print(f'Took {time_duration} seconds')

Run the program and note how long it takes to execute.

On my system, it takes about 3.7 seconds to complete.

Depending on the speed of your hardware it should be about 2x to 2.5x faster than copying the files sequentially.

Try varying the number of worker threads and see if it makes a difference to the run time.

Let me know what you discover.

You can learn more about concurrent file copying in the tutorial:

Multithreaded File Copying in Python

Lesson 06: Concurrent File Reading

Hi, loading a single file from disk can be slow.

It can become painfully slow in situations where you may need to load thousands of files. The hope is that concurrency can be used to speed up file saving.

In a previous lesson, we saw how we can create many files on disk.

Re-run the program from a previous lesson to create a tmp/ directory with many files.

We can then develop a program to load all files.

Loading files is faster than saving in general. Loading all 10,000 files sequentially should take about 2.5 seconds to complete.

We can develop a concurrent version that is more than 2x faster.

This is an advanced approach that uses a process pool where each worker in the process pool uses a thread pool. Each worker in the thread pool then loads a file.

This involves two parts.

The first part involves splitting the number of files to load into chunks. Each worker process will load one chunk of files. All chunks will be executed concurrently.

The second part involves the task that performs the loading. It will start a new thread pool with one worker thread per file to load. All files can then be loaded concurrently.

The complete example is listed below.

# SuperFastPython.com

# load many files concurrently with processes and threads in batch

import os

import os.path

import concurrent.futures

import time

# load a file and return the contents

def load_file(filepath):

# open the file

with open(filepath, 'r') as handle:

# return the contents

handle.read()

# return the contents of many files

def load_files(filepaths):

# create a thread pool

with concurrent.futures.ThreadPoolExecutor(len(filepaths)) as exe:

# load files

futures = [exe.submit(load_file, name) for name in filepaths]

# collect data

data_list = [future.result() for future in futures]

# return data and file paths

return (data_list, filepaths)

# load all files in a directory into memory

if __name__ == '__main__':

# record the start time

time_start = time.perf_counter()

# define the pat to load fils

path = 'tmp'

# prepare all of the paths

paths = [os.path.join(path, filepath) for filepath in os.listdir(path)]

# determine chunksize

n_workers = 8

chunksize = round(len(paths) / n_workers)

# create the process pool

with concurrent.futures.ProcessPoolExecutor(n_workers) as exe:

futures = list()

# split the load operations into chunks

for i in range(0, len(paths), chunksize):

# select a chunk of filenames

filepaths = paths[i:(i + chunksize)]

# submit the task

future = exe.submit(load_files, filepaths)

futures.append(future)

# process all results

for future in concurrent.futures.as_completed(futures):

# open the file and load the data

_, filepaths = future.result()

for filepath in filepaths:

# report progress

print(f'.loaded {filepath}')

# calculate the duration

time_duration = time.perf_counter() - time_start

# report the duration

print(f'Took {time_duration} seconds')

Run the program and note how long it takes to execute.

On my system, it takes about 1.4 seconds to complete.

Depending on the speed of your hardware it should be about 1.5x to 2x faster than loading the files sequentially.

Try varying the number of worker processes and worker threads and see if it makes a difference to the run time.

Let me know what you discover.

You can learn more about concurrent file saving in the tutorial:

Multithreaded File Loading in Python

Lesson 07: Concurrent File Appending

Hi, appending a file from multiple threads is not thread-safe and will result in overwritten data and file corruption.

We can explore how to append to a file from multiple threads.

One approach to safely append to a file from multiple threads is by using a mutual exclusion lock.

Python provides a mutual exclusion lock, also called a mutex, via the threading.Lock class.

A lock can be created and passed to each task executed in a new thread.

Then, before any thread attempts to write to the file it must obtain the lock. Only one thread can have the lock at any one time, and if writes to the file can only occur when the lock is held, then only one thread can write to the file at a time.

This will make writing to the file thread safe and will avoid any corruption of messages.

Tying this together, the complete example of writing to file from multiple threads using a lock is listed below.

# SuperFastPython.com

# write to a file from multiple threads using a shared lock

import time

import random

import threading

import concurrent.futures

# task that blocks for a moment and write to a file

def task(name, lock, file_handle):

# block for a moment

value = random.random()

time.sleep(value)

# obtain the lock

with lock:

# write to a file

file_handle.write(f'Task {name} blocked for {value} of a second.\n')

# write to a file from multiple threads

if __name__ == '__main__':

# open a file

with open('results.txt', 'a', encoding='utf-8') as handle:

# create a lock to protect the file

lock = threading.Lock()

# create the thread pool

with concurrent.futures.ThreadPoolExecutor(1000) as exe:

# dispatch all tasks

_ = [exe.submit(task, i, lock, handle) for i in range(10000)]

# wait for all tasks to complete

print('Done.')

Run the program and note how long it takes to execute.

Review the created file with the name “results.txt” and confirm that it contains 10,000 lines of data without corruption.

Try varying the number of threads and the number of lines written by each thread.

Try the same program with and without the use of the lock and compare the results.txt file.

Let me know what you discover.

You can learn more about concurrent file appending in the tutorial:

Multithreaded File Appending in Python

Thank You

Hi, thank you for letting me help you learn more about concurrent File I/O.

If you ever have any questions about this course or Python concurrency in general, please reach out. Just ask in the comments below.

If you enjoyed this course and want to know more, check out my new book:

Concurrent File I/O In Python

This book distills only what you need to know to get started and be effective with concurrent File I/O, super fast.

The book is divided into 15 chapters spread across 3 parts. It contains everything you need to get started, then get really good at concurrent File I/O, all in one book.

It’s exactly how I would teach you concurrent File I/O if we were sitting together, pair programming.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Did you enjoy this course?
Let me know in the comments below.

Photo by Angel Rodarte on Unsplash

Concurrent File I/O 7-Day Course

Concurrent NumPy Overview

Lesson 01: Faster File I/O With Concurrency

Lesson 02: One-off Background File I/O Tasks

Lesson 03: Patterns for Concurrent File I/O

Pattern 1: Single Worker

Pattern 2: Pools of Workers

Pattern 3: Batched Tasks And Pools of Workers

Pattern 4: Pools of Workers within Each Worker

Lesson 04: Concurrent File Writing

Lesson 05: Concurrent File Copying

Lesson 06: Concurrent File Reading

Lesson 07: Concurrent File Appending

Further Reading

Thank You

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent File I/O Fast
(without the frustration)

Additional menu

Concurrent NumPy Overview

Lesson 01: Faster File I/O With Concurrency

Lesson 02: One-off Background File I/O Tasks

Lesson 03: Patterns for Concurrent File I/O

Pattern 1: Single Worker

Pattern 2: Pools of Workers

Pattern 3: Batched Tasks And Pools of Workers

Pattern 4: Pools of Workers within Each Worker

Lesson 04: Concurrent File Writing

Lesson 05: Concurrent File Copying

Lesson 06: Concurrent File Reading

Lesson 07: Concurrent File Appending

Further Reading

Thank You

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Leave a Reply Cancel reply

Footer

Learn Concurrent File I/O Fast (without the frustration)

Learn Concurrent File I/O Fast
(without the frustration)