Threads are 4x Faster at Sharing Data Than Processes in Python

Processes are slower at transmitting data than threads.

The rationale is that all data transmitted between processes requires the use of inter-process communication, whereas threads can directly access shared memory.

We can design and run a controlled experiment to explicitly measure how much slower data transmission is between processes than between threads.

In this tutorial, you will discover the speed difference for data transmission when using processes vs threads.

Let’s get started.

Table of Contents

Threads Share Data Faster Than Processes

Python offers both thread-based and process-based concurrency.

Threads are provided via the threading module. The threading.Thread class can execute a target function in another thread.

This can be achieved by creating an instance of the threading.Thread class and specify the target function to execute via the target keyword. The thread can then be started by calling the start() function and it will execute the target function in another thread.

Processes are provided via the multiprocessing module. The multiprocessing.Process class can execute a target function in another process.

The multiprocessing.Process class has much the same API as the threading.Thread class and can be used to execute a target function in a child process.

You can learn more bout the similarities and differences between threads and processes in the tutorial:

Thread vs Process in Python

It is generally considered that it is faster to transfer data between threads than between processes.

Transferring data between threads is fast because threads share memory. In fact, threads don’t need to transfer data, they can just access the same shared memory.

Transferring data between processes is slow because data must be transmitted using inter-process communication. This means that Python objects must be serialized, transmitted, received, and deserialized. This adds a computationally expensive overhead.

But how much slower is this overhead of inter-process communication in Python?

We can design an experiment to explore this question.

Run loops using all CPUs, download your FREE book to learn how.

How to Benchmark Data Transfers for Threads and Processes

How can we benchmark data transfers consistently between threads and between processes?

This is a challenging experiment to perform.

As noted, threads do not need to transfer data. Instead, they can access the same shared memory.

Processes do not have shared memory, and instead must transmit data explicitly or implicitly with simulated shared memory.

We could benchmark data transfers for processes alone, but without a comparison to threads, we won’t know if the benchmark times are good or not.

One approach we can use is to share data explicitly between the units of computation using a shared queue.

One thread or process can produce a Python object and put it on the shared queue. Another thread or process can consume the Python object from the queue. This process can then be repeated many times and timed with threads and processes.

Using a queue to transmit data adds overhead to threads and processes. Specifically, it involves internal locking for thread and process safety as well as any internal objects created to represent the data on the queue and operations on internal data structures.

A queue for threads will be shared directly in memory, whereas a shared queue for processes will be backed by some inter-process communication mechanism, such as sockets and/or files and probably file-based locks. This introduces expected differences.

In fact, if the queue is hosted in a server process, then transmitted data may need to be seralized and deseralized more than once, depending on the details of the queue’s implementation.

We can also have the units of concurrency created before the experiment so that the time of thread and process creation can be excluded from the experiment. One way to do this is via a thread pool and a process pool. The worker thread can be reused across each experimental run.

This too introduces overhead for the internal objects managed within the pools and functions called.

As stated, this is a challenging experiment to design.

Using the same architecture of a shared queue, producer and considering tasks, and a pool of reusable workers will allow the architecture to be identical for transmitting data with threads and processes. The differences introduced will be the differences specific to data and safety management, e.g. in-process vs across-process.

Next, let’s develop the benchmark example of this architecture using processes.

Download Now: Free Multiprocessing PDF Cheat Sheet

Benchmark Sharing Data Between Processes

We can explore how long it takes to transmit data between processes in Python.

In this example, we will test the case of transmitting a list of one million integers from one process to another.

This will be achieved using a pool of worker processes and a queue that is created at the beginning of the experiment and reused for each experimental run. A producer task will create the data in one worker process and put it on the shared queue. A consumer task will retrieve the data in another worker process from the shared queue. This will then be repeated many times and timed.

Firstly, we can define a producer task that takes the shared queue as an argument, creates a large list of integers, and puts on the queue.

# task to generate data and send to the consumer

def producer_task(queue):

# generate data

data = [i for i in range(1000000)]

# send the data

queue.put(data)

Next, we can define a consumer task that takes the shared queue as an argument and retrieves the shared object.

# task to consume data sent from producer

def consumer_task(queue):

# retrieve the data

data = queue.get()

We can define the experimental test task. This function takes the pool of workers, the shared queue, and the number of repeats of the experiment to perform.

For each repeat it issues the consumer task first, then the producer task, then waits for the consumer task to complete.

This ordering is important. It ensures the consumer is waiting for work. Then starts the producer and finally waits for the consumer to complete which can only happen once the data has been received.

We will use the multiprocessing.Pool class API which has a thread-equivalent API in the multiprocessing.pool.ThreadPool class.

# run a test and time how long it takes

def test(pool, queue, n_repeats):

# repeat many times

for i in range(n_repeats):

# issue the consumer task

consumer = pool.apply_async(consumer_task, args=(queue,))

# issue the producer task

producer = pool.apply_async(producer_task, args=(queue,))

# wait for the consumer to get the data

consumer.wait()

So far the code is generic for both threads and processes.

Next, we can develop the process-specific section that drives the example.

Firstly, we will set the spawn start method so that the benchmark is consistent and runs on all platforms.

...

# set the start method

set_start_method('spawn')

Next, we can create a shared pool of workers.

We will use the multiprocessing.Pool, although the ProcessPoolExecutor would work just as well.

...

# create the pool of workers

with Pool(2) as pool:

# ...

You can learn more about the multiprocessing.Pool in the tutorial:

Multiprocessing Pool: The Complete Guide

Next, we can create the queue to be shared with all workers.

We cannot share a queue directly with workers in the pool as this will result in an error.

Instead, we can host the queue in a server process and share proxy objects with worker-child processes for interacting with the queue.

This requires creating a manager and then using the manager to create the queue and return a proxy object that can be shared.

...

# create the manager

with Manager() as manager:

# create the shared queue

queue = manager.Queue()

You can learn more about multiprocessing queues in the tutorial:

Multiprocessing Queue in Python

You can learn more about sharing a queue with child processes via a manager in the tutorial:

Multiprocessing Manager Share Queue in Python

Next, we can perform the test.

This requires that we record the start time, run the test, then record the end time. We can then calculate and report the duration of the test and the duration per test run within the test.

We will perform the sharing action between child processes 1,000 times.

...

# record the start time

time_start = time()

# perform the test

n_repeats = 1000

test(pool, queue, n_repeats)

# record the end time

time_end = time()

# report the total time

duration = time_end - time_start

print(f'Total Time {duration} seconds')

# report estimated time per task

per_task = duration / n_repeats

print(f'About {per_task:.3} seconds per task')

Tying this together, the complete example is listed below.

# SuperFastPython.com

# example benchmark data transfer between child processes

from time import time

from multiprocessing import Pool

from multiprocessing import Queue

from multiprocessing import Manager

from multiprocessing import set_start_method

# task to generate data and send to the consumer

def producer_task(queue):

# generate data

data = [i for i in range(1000000)]

# send the data

queue.put(data)

# task to consume data sent from producer

def consumer_task(queue):

# retrieve the data

data = queue.get()

# run a test and time how long it takes

def test(pool, queue, n_repeats):

# repeat many times

for i in range(n_repeats):

# issue the consumer task

consumer = pool.apply_async(consumer_task, args=(queue,))

# issue the producer task

producer = pool.apply_async(producer_task, args=(queue,))

# wait for the consumer to get the data

consumer.wait()

# entry point

if __name__ == '__main__':

# set the start method

set_start_method('spawn')

# create the pool of workers

with Pool(2) as pool:

# create the manager

with Manager() as manager:

# create the shared queue

queue = manager.Queue()

# record the start time

time_start = time()

# perform the test

n_repeats = 1000

test(pool, queue, n_repeats)

# record the end time

time_end = time()

# report the total time

duration = time_end - time_start

print(f'Total Time {duration} seconds')

# report estimated time per task

per_task = duration / n_repeats

print(f'About {per_task:.3} seconds per task')

Running the example first sets the spawn start method.

Next, the process pool is created with two workers.

The manager server process is created and is used to create the shared queue.

The experiment is run with 1,000 iterations using workers in the process pool and proxy objects for the shared queue.

In this case, the transfer between child processes and the experiment took about 161.917 seconds, which is about 2.6 minutes.

Dividing the total time by the number of iterations of transfers performed found that it took about 162 milliseconds per transfer of the list of 1,000,000 integers. Recall, there are 1,000 milliseconds in one second.

Ouch, that does feel slow.

Recall that using the queue does impose a cost with internal objects and locks, as does the process pool.

Nevertheless, a constant fraction of that 10th of a second involves the serialization, transmission, and deserialization of the Python object.

1 2	Total Time 161.91740798950195 seconds About 0.162 seconds per task

Next, let’s look at performing the same experiment using threads.

Free Python Multiprocessing Course

Download your FREE multiprocessing PDF cheat sheet and get BONUS access to my free 7-day crash course on the multiprocessing API.

Discover how to use the Python multiprocessing module including how to create and start child processes and how to use a mutex locks and semaphores.

Learn more

Benchmark Sharing Data Between Threads

We can update the example to use threads instead of child processes.

Firstly, we can use a multiprocessing.pool.ThreadPool instead of a multiprocessing.Pool. It has the same API and uses threads internally instead of processes.

...

# create the pool of workers

with ThreadPool(2) as pool:

# ...

You can learn more about how to use the ThreadPool in the tutorial:

ThreadPool in Python: The Complete Guide

A manager is not needed in this case, instead, we can create the queue.Queue and share it directly with new threads.

...

# create the shared queue

queue = Queue()

You can learn more about thread-safe queues in the tutorial:

Thread-Safe Queue in Python

The rest of the example is the same and does not require modification.

The complete example with these changes is listed below.

# SuperFastPython.com

# example benchmark data transfer between threads

from time import time

from multiprocessing.pool import ThreadPool

from queue import Queue

# task to generate data and send to the consumer

def producer_task(queue):

# generate data

data = [i for i in range(1000000)]

# send the data

queue.put(data)

# task to consume data sent from producer

def consumer_task(queue):

# retrieve the data

data = queue.get()

# run a test and time how long it takes

def test(pool, queue, n_repeats):

# repeat many times

for i in range(n_repeats):

# issue the consumer task

consumer = pool.apply_async(consumer_task, args=(queue,))

# issue the producer task

producer = pool.apply_async(producer_task, args=(queue,))

# wait for the consumer to get the data

consumer.wait()

# entry point

if __name__ == '__main__':

# create the pool of workers

with ThreadPool(2) as pool:

# create the shared queue

queue = Queue()

# record the start time

time_start = time()

# perform the test

n_repeats = 1000

test(pool, queue, n_repeats)

# record the end time

time_end = time()

# report the total time

duration = time_end - time_start

print(f'Total Time {duration:.3} seconds')

# report estimated time per task

per_task = duration / n_repeats

print(f'About {per_task:.3} seconds per task')

Running the example first creates the thread pool and the shared queue.

The experiment is then run with 1,000 iterations of transferring the 1,000,000 item list between threads in the thread pool.

The total time and time per iteration is then recorded.

In this case, we can see that the overall experiment took about 40.6 seconds to complete.

Dividing this figure by the number of iterations, 1,000, shows that the average transfer took about 40.6 milliseconds.

1 2	Total Time 40.6 seconds About 0.0406 seconds per task

Next, let’s compare the speed of data transfers with processes vs threads.

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Comparison of Data Transfer Speed With Processes vs Threads

Transferring data between child processes is slower using processes than threads.

This is the generally accepted belief that can be justified using simple logic. Data between processes must be serialized and deserialized, whereas threads can share memory directly.

In this case, we can put actual numbers to the difference between data transfers with threads and processes.

The results show that using the same architecture of workers and a shared queue that processes took about 161.917 seconds whereas threads took about 40.6 seconds to complete the experiment.

That is a difference of about 121.317 seconds (about 2 minutes) or 3.99x. This means that threads are nearly 4x faster at transferring data than processes, at least with the specific fixed overhead of using workers and a shared queue introduced in this experiment.

Looking at the time per transfer, we can see that processes took about 0.162 seconds per transfer, whereas threads took about 0.0406 seconds per transfer.

That is a difference of 0.121 seconds, or again a 3.99x speed difference.

Transferring data between processes should be kept to a minimum wherever possible and these results highlight why. It’s really slow relative to transmitting data between threads.

Threads are well suited to those I/O bound tasks that often require transmitting data, e.g. loading and sending to central processing or collecting and storing. Process-based concurrency is better suited to CPU-bound tasks, but more so to those tasks that require little data transmission before or after, ideally acquiring their own data as needed if possible.

This was not a perfect comparison between threads and processes because of the use of pools of workers and queues, but the architecture of the experiment was consistent between processes and threads and it is a good first attempt.

If you have ideas on better-designed experiments or improvements over this experiment, please let me know in the comments below. I’m sure we can do better.