Concurrent File I/O 7-Day Course

October 18, 2023 Concurrent File I/O

File I/O can be so much faster with the careful use of concurrency.

This includes using Python threads and processes to read, write, and copy files, as well as how to safely append to a file from multiple threads.

This course provides you with a 7-day crash course in concurrency for File I/O.

You will get a taste of what is possible and hopefully, skills that you can bring to your next project.

Let's get started.

Concurrent NumPy Overview

Hi, thanks for your interest in Concurrent File I/O.

You have a lot of fun ahead, including:

You can download a .zip of all the source code for this course here:

Your first lesson in this series is just below.

Until then, a quick question:

Why are you interested in Concurrent File I/O?
Let me know in the comments below.

Lesson 01: Faster File I/O With Concurrency

Hi, file I/O operations are inherently slower compared to working with data in main memory.

File I/O operations are essential in computer programming for tasks that involve data persistence, file manipulation, and data exchange with external systems.

This is nearly all Python programs that we write!

The performance of file I/O is constrained by the underlying hardware of the hard drive, resulting in significantly slower execution times when reading from or writing to files.

There are several reasons why file I/O operations can be slow:

You can learn more about how to perform common file I/O tasks in Python in the tutorial:

The performance gap creates an opportunity for optimizing file I/O tasks with concurrency, which involves performing multiple file I/O operations simultaneously, freeing up the program to handle other tasks.

Concurrency is a programming paradigm that allows multiple operations to be performed out of sequence, such as asynchronously (at a later time) or simultaneously (in parallel).

Concurrent file I/O can allow a program to perform file I/O later in the background, while the program continues on with other tasks.

It may also reduce the burden of an I/O-bound program and perhaps allow it to transition to becoming CPU-bound or constrained by a resource other than disk access and transfer speeds.

Concurrency is important when performing file I/O for 5 main reasons:

  1. Improved performance
  2. Better resource utilization
  3. Parallelism and responsiveness
  4. Avoiding blocking and deadlocks
  5. Scalability

You can learn more about how to use Python concurrency in the tutorial:

We don't have to adopt concurrency when performing file I/O, but not doing so can negatively impact the performance of our programs.

As such, we may experience several consequences, such as:

We now know about the importance of concurrency for performing faster file I/O.

You can learn more about this in the tutorial:

Lesson 02: One-off Background File I/O Tasks

Hi, we can run one-off file IO tasks in the background using a new thread or new child process.

A one-off file operation is an operation that is one file I/O task performed on demand.

For example, we may need to load some data from a file, store state to a file, or append some data to a file.

The task can be dispatched and any error handling or notifications can be handled by the task itself.

For a one-off file IO task to be completed in the background, it assumes some properties of the task, such as:

One-off file IO operations can be performed in the background of the main program.

This can be achieved using concurrency, such as a new thread.

The example below starts a new thread to save data to a file in the background.

We can use this as a template for background file I/O tasks in our own programs.

# SuperFastPython.com
# example of writing a file in the background with a thread
import threading

# task to write a file
def write_file(filename, data):
    # open a file for writing text
    with open(filename, 'w', encoding='utf-8') as handle:
        # write string data to the file
        handle.write(data)
    # report a message
    print(f'File saved: {filename}')

# protect the entry point
if __name__ == '__main__':
    # report a message
    print('Main is running')
    # save the file
    print('Saving file...')
    # create and configure the new thread
    thread = threading.Thread(target=write_file, args=('data.txt', 'Hello world'))
    # start running the new thread in the background
    thread.start()
    # report a message
    print('Main is done.')

Run the example.

Change the program to save more and different data and have the main program perform other tasks in the foreground.

Let me know what you come up with.

You can learn more about running file I/O tasks in the background in the tutorial:

Lesson 03: Patterns for Concurrent File I/O

Hi, we can use concurrent programming patterns when speeding up file IO using concurrency in Python.

We typically need to perform the same file I/O task on many files, for example:

These tasks are both independent and collectively slow to perform.

Naively these tasks are performed sequentially, one after the other.

If we can perform the operation using concurrency, it may be reduced by a factor of the number of CPU cores or a factor of the number of concurrent tasks, or similar, depending on the limiting factors of the task itself and the hardware on which it is run.

There are 4 reusable programming patterns for concurrent IO we may consider:

Let's take a closer look at each pattern in turn.

Pattern 1: Single Worker

The simplest pattern is to create a single concurrent worker on demand for a file IO operation.

This approach would only be helpful in those cases where a single file IO operation can be executed safely in the background while the program continues on with other tasks not dependent on the file IO operation.

Pattern 2: Pools of Workers

We might consider using a pool of workers to complete a suite of similar file I/O operations.

Pools of workers are provided in Python using the modern executor framework, including the ThreadPoolExecutor and the ProcessPoolExecutor.

File IO tasks can be defined as a pure function that is given the data it needs as an argument and provides any loaded data via a return value.

Using a pool of workers allows the program to issue tasks and have them queued until a worker in the pool is available to consume and execute the task.

Pattern 3: Batched Tasks And Pools of Workers

There may be efficiency gains to be had in grouping tasks into batches when executing them with a pool of workers.

Many file I/O tasks may require data in order to be completed or may result in data being loaded.

If the tasks are performed within a pool of workers, then there may be some overhead in transmitting data from the main program to the workers for each task, and in transmitting data base from workers to the main program.

We may overcome some of this overhead by executing file I/O tasks in batches.

This might involve configuring the pool of workers itself to automatically collect groups of tasks into chunks for transmission to workers.

Alternatively, we may manually chunk the tasks, e.g. create batches of file names to load or delete, then issue each batch of tasks to a worker to process.

Pattern 4: Pools of Workers within Each Worker

We can take these patterns one step further by having a pool of workers within each worker in a pool of workers.

Not all variations of this pattern make sense.

A good use of this pattern is to have a pool of worker processes, where each worker process in the pool manages its own pool of worker threads.

In this way, a batch of file IO tasks can be issued to each worker in the pool for parallel execution and each batch is executed concurrently using threads.

We now know about patterns for concurrent file IO in Python.

It is critical that we perform experiments in order to discover what type of concurrency works well or best for your specific file I/O operation and system.

You can learn more about concurrent file I/O patterns in the tutorial:

Lesson 04: Concurrent File Writing

Hi, saving a single file to disk can be slow.

It can become painfully slow in situations where we may need to save thousands of files. The hope is that concurrency can be used to speed up file saving.

Consider a CSV file with 10 values per line and 10,000 lines.

It can take about 200 seconds to save 10,000 files of this type sequentially, one by one.

There are many ways to save files to disk concurrently.

For example, we could use a thread pool or a process pool and each task could save one file or a batch of files.

In this case, we will use a process pool via the ProcessPoolExecutor class and save one file per task.

We will use a pool of 8 worker processes. Each task will save one file with a unique filename and unique data.

All files will be saved in a tmp/ directory under the current directory where we save the program.

The complete example is listed below.

# SuperFastPython.com
# save many data files concurrently with a process pool
import os
import os.path
import concurrent.futures
import time

# save data to a file
def save_file(filepath, data):
    # open the file
    with open(filepath, 'w') as handle:
        # save the data
        handle.write(data)

# generate a csv file with v=10 values per line and l=10k lines
def generate_file(identifier, n_values=10, n_lines=10000):
    # generate many lines of data
    data = list()
    for i in range(n_lines):
        # generate a list of numeric values as strings
        line = [str(identifier+i+j) for j in range(n_values)]
        # convert list to string with values separated by commas
        data.append(','.join(line))
    # convert list of lines to a string separated by new lines
    return '\n'.join(data)

# generate and save a file
def generate_and_save(path, identifier):
    # generate data
    data = generate_file(identifier)
    # create a unique filename
    filepath = os.path.join(path, f'data-{identifier:04d}.csv')
    # save data file to disk
    save_file(filepath, data)
    # report progress
    print(f'.saved {filepath}', flush=True)

# save many data files in a directory
if __name__ == '__main__':
    # record the start time
    time_start = time.perf_counter()
    # define the local dir to save files
    path = 'tmp'
    # define the number of files to save
    n_files = 10000
    # create a local directory to save files
    os.makedirs(path, exist_ok=True)
    # create the process pool
    with concurrent.futures.ProcessPoolExecutor(8) as exe:
        # submit tasks to generate files
        _ = [exe.submit(generate_and_save, path, i) for i in range(n_files)]
    # calculate the duration
    time_duration = time.perf_counter() - time_start
    # report the duration
    print(f'Took {time_duration} seconds')

Run the program and note how long it takes to execute.

On my system, it takes about 71 seconds to complete.

Depending on the speed of your hardware it should be about 2.5x to 3x faster than saving the files sequentially.

Try varying the number of worker processes and see if it makes a difference to the run time.

Let me know what you discover.

You can learn more about concurrent file saving in the tutorial:

Lesson 05: Concurrent File Copying

Hi, copying files is typically really slow.

We can explore how to speed up file copying using concurrency.

In the previous lesson, we saw how we can create many files on disk.

Re-run the program from the previous lesson to create a tmp/ directory with many files.

We can then develop a program to copy all 10,000 files from one directory to another.

If we use the shutil.copy() function to copy all files sequentially from the tmp/ directory to a new directory named tmp2/ it will take about 10 seconds.

We can speed this up by using a thread pool.

Each task will copy one file from the source directory into the destination directory.

We can then issue one task per file in the source directory.

The example below implements this with a thread pool of 100 workers.

# SuperFastPython.com
# copy files from one directory to another concurrently with threads
import os
import os.path
import shutil
import concurrent.futures
import time

# copy a file from source to destination
def copy_file(src_path, dest_dir):
    # copy source file to dest file
    dest_path = shutil.copy(src_path, dest_dir)
    # report progress
    print(f'.copied {src_path} to {dest_path}')

# copy files from src to dest
if __name__ == '__main__':
    # record the start time
    time_start = time.perf_counter()
    # define the source and destimation dirs
    src = 'tmp'
    dest = 'tmp2'
    # create the destination directory if needed
    os.makedirs(dest, exist_ok=True)
    # create full paths for all files we wish to copy
    files = [os.path.join(src,name) for name in os.listdir(src)]
    # create the thread pool
    with concurrent.futures.ThreadPoolExecutor(100) as exe:
        # submit all copy tasks
        _ = [exe.submit(copy_file, path, dest) for path in files]
    print('Done')
     # calculate the duration
    time_duration = time.perf_counter() - time_start
    # report the duration
    print(f'Took {time_duration} seconds')

Run the program and note how long it takes to execute.

On my system, it takes about 3.7 seconds to complete.

Depending on the speed of your hardware it should be about 2x to 2.5x faster than copying the files sequentially.

Try varying the number of worker threads and see if it makes a difference to the run time.

Let me know what you discover.

You can learn more about concurrent file copying in the tutorial:

Lesson 06: Concurrent File Reading

Hi, loading a single file from disk can be slow.

It can become painfully slow in situations where you may need to load thousands of files. The hope is that concurrency can be used to speed up file saving.

In a previous lesson, we saw how we can create many files on disk.

Re-run the program from a previous lesson to create a tmp/ directory with many files.

We can then develop a program to load all files.

Loading files is faster than saving in general. Loading all 10,000 files sequentially should take about 2.5 seconds to complete.

We can develop a concurrent version that is more than 2x faster.

This is an advanced approach that uses a process pool where each worker in the process pool uses a thread pool. Each worker in the thread pool then loads a file.

This involves two parts.

The first part involves splitting the number of files to load into chunks. Each worker process will load one chunk of files. All chunks will be executed concurrently.

The second part involves the task that performs the loading. It will start a new thread pool with one worker thread per file to load. All files can then be loaded concurrently.

The complete example is listed below.

# SuperFastPython.com
# load many files concurrently with processes and threads in batch
import os
import os.path
import concurrent.futures
import time

# load a file and return the contents
def load_file(filepath):
    # open the file
    with open(filepath, 'r') as handle:
        # return the contents
        handle.read()

# return the contents of many files
def load_files(filepaths):
    # create a thread pool
    with concurrent.futures.ThreadPoolExecutor(len(filepaths)) as exe:
        # load files
        futures = [exe.submit(load_file, name) for name in filepaths]
        # collect data
        data_list = [future.result() for future in futures]
        # return data and file paths
        return (data_list, filepaths)

# load all files in a directory into memory
if __name__ == '__main__':
    # record the start time
    time_start = time.perf_counter()
    # define the pat to load fils
    path = 'tmp'
    # prepare all of the paths
    paths = [os.path.join(path, filepath) for filepath in os.listdir(path)]
    # determine chunksize
    n_workers = 8
    chunksize = round(len(paths) / n_workers)
    # create the process pool
    with concurrent.futures.ProcessPoolExecutor(n_workers) as exe:
        futures = list()
        # split the load operations into chunks
        for i in range(0, len(paths), chunksize):
            # select a chunk of filenames
            filepaths = paths[i:(i + chunksize)]
            # submit the task
            future = exe.submit(load_files, filepaths)
            futures.append(future)
        # process all results
        for future in concurrent.futures.as_completed(futures):
            # open the file and load the data
            _, filepaths = future.result()
            for filepath in filepaths:
                # report progress
                print(f'.loaded {filepath}')
    # calculate the duration
    time_duration = time.perf_counter() - time_start
    # report the duration
    print(f'Took {time_duration} seconds')

Run the program and note how long it takes to execute.

On my system, it takes about 1.4 seconds to complete.

Depending on the speed of your hardware it should be about 1.5x to 2x faster than loading the files sequentially.

Try varying the number of worker processes and worker threads and see if it makes a difference to the run time.

Let me know what you discover.

You can learn more about concurrent file saving in the tutorial:

Lesson 07: Concurrent File Appending

Hi, appending a file from multiple threads is not thread-safe and will result in overwritten data and file corruption.

We can explore how to append to a file from multiple threads.

One approach to safely append to a file from multiple threads is by using a mutual exclusion lock.

Python provides a mutual exclusion lock, also called a mutex, via the threading.Lock class.

A lock can be created and passed to each task executed in a new thread.

Then, before any thread attempts to write to the file it must obtain the lock. Only one thread can have the lock at any one time, and if writes to the file can only occur when the lock is held, then only one thread can write to the file at a time.

This will make writing to the file thread safe and will avoid any corruption of messages.

Tying this together, the complete example of writing to file from multiple threads using a lock is listed below.

# SuperFastPython.com
# write to a file from multiple threads using a shared lock
import time
import random
import threading
import concurrent.futures

# task that blocks for a moment and write to a file
def task(name, lock, file_handle):
    # block for a moment
    value = random.random()
    time.sleep(value)
    # obtain the lock
    with lock:
        # write to a file
        file_handle.write(f'Task {name} blocked for {value} of a second.\n')

# write to a file from multiple threads
if __name__ == '__main__':
    # open a file
    with open('results.txt', 'a', encoding='utf-8') as handle:
        # create a lock to protect the file
        lock = threading.Lock()
        # create the thread pool
        with concurrent.futures.ThreadPoolExecutor(1000) as exe:
            # dispatch all tasks
            _ = [exe.submit(task, i, lock, handle) for i in range(10000)]
    # wait for all tasks to complete
    print('Done.')

Run the program and note how long it takes to execute.

Review the created file with the name "results.txt" and confirm that it contains 10,000 lines of data without corruption.

Try varying the number of threads and the number of lines written by each thread.

Try the same program with and without the use of the lock and compare the results.txt file.

Let me know what you discover.

You can learn more about concurrent file appending in the tutorial:

Thank You

Hi, thank you for letting me help you learn more about concurrent File I/O.

If you ever have any questions about this course or Python concurrency in general, please reach out. Just ask in the comments below.

If you enjoyed this course and want to know more, check out my new book:

This book distills only what you need to know to get started and be effective with concurrent File I/O, super fast.

The book is divided into 15 chapters spread across 3 parts. It contains everything you need to get started, then get really good at concurrent File I/O, all in one book.

It's exactly how I would teach you concurrent File I/O if we were sitting together, pair programming.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Did you enjoy this course?
Let me know in the comments below.



If you enjoyed this tutorial, you will love my book: Concurrent File I/O in Python. It covers everything you need to master the topic with hands-on examples and clear explanations.