Processes are slower at transmitting data than threads.
The rationale is that all data transmitted between processes requires the use of inter-process communication, whereas threads can directly access shared memory.
We can design and run a controlled experiment to explicitly measure how much slower data transmission is between processes than between threads.
In this tutorial, you will discover the speed difference for data transmission when using processes vs threads.
Let’s get started.
Threads Share Data Faster Than Processes
Python offers both thread-based and process-based concurrency.
Threads are provided via the threading module. The threading.Thread class can execute a target function in another thread.
This can be achieved by creating an instance of the threading.Thread class and specify the target function to execute via the target keyword. The thread can then be started by calling the start() function and it will execute the target function in another thread.
Processes are provided via the multiprocessing module. The multiprocessing.Process class can execute a target function in another process.
The multiprocessing.Process class has much the same API as the threading.Thread class and can be used to execute a target function in a child process.
You can learn more bout the similarities and differences between threads and processes in the tutorial:
It is generally considered that it is faster to transfer data between threads than between processes.
Transferring data between threads is fast because threads share memory. In fact, threads don’t need to transfer data, they can just access the same shared memory.
Transferring data between processes is slow because data must be transmitted using inter-process communication. This means that Python objects must be serialized, transmitted, received, and deserialized. This adds a computationally expensive overhead.
But how much slower is this overhead of inter-process communication in Python?
We can design an experiment to explore this question.
Run loops using all CPUs, download your FREE book to learn how.
How to Benchmark Data Transfers for Threads and Processes
How can we benchmark data transfers consistently between threads and between processes?
This is a challenging experiment to perform.
As noted, threads do not need to transfer data. Instead, they can access the same shared memory.
Processes do not have shared memory, and instead must transmit data explicitly or implicitly with simulated shared memory.
We could benchmark data transfers for processes alone, but without a comparison to threads, we won’t know if the benchmark times are good or not.
One approach we can use is to share data explicitly between the units of computation using a shared queue.
One thread or process can produce a Python object and put it on the shared queue. Another thread or process can consume the Python object from the queue. This process can then be repeated many times and timed with threads and processes.
Using a queue to transmit data adds overhead to threads and processes. Specifically, it involves internal locking for thread and process safety as well as any internal objects created to represent the data on the queue and operations on internal data structures.
A queue for threads will be shared directly in memory, whereas a shared queue for processes will be backed by some inter-process communication mechanism, such as sockets and/or files and probably file-based locks. This introduces expected differences.
In fact, if the queue is hosted in a server process, then transmitted data may need to be seralized and deseralized more than once, depending on the details of the queue’s implementation.
We can also have the units of concurrency created before the experiment so that the time of thread and process creation can be excluded from the experiment. One way to do this is via a thread pool and a process pool. The worker thread can be reused across each experimental run.
This too introduces overhead for the internal objects managed within the pools and functions called.
As stated, this is a challenging experiment to design.
Using the same architecture of a shared queue, producer and considering tasks, and a pool of reusable workers will allow the architecture to be identical for transmitting data with threads and processes. The differences introduced will be the differences specific to data and safety management, e.g. in-process vs across-process.
Next, let’s develop the benchmark example of this architecture using processes.
Benchmark Sharing Data Between Processes
We can explore how long it takes to transmit data between processes in Python.
In this example, we will test the case of transmitting a list of one million integers from one process to another.
This will be achieved using a pool of worker processes and a queue that is created at the beginning of the experiment and reused for each experimental run. A producer task will create the data in one worker process and put it on the shared queue. A consumer task will retrieve the data in another worker process from the shared queue. This will then be repeated many times and timed.
Firstly, we can define a producer task that takes the shared queue as an argument, creates a large list of integers, and puts on the queue.
1 2 3 4 5 6 |
# task to generate data and send to the consumer def producer_task(queue): # generate data data = [i for i in range(1000000)] # send the data queue.put(data) |
Next, we can define a consumer task that takes the shared queue as an argument and retrieves the shared object.
1 2 3 4 |
# task to consume data sent from producer def consumer_task(queue): # retrieve the data data = queue.get() |
We can define the experimental test task. This function takes the pool of workers, the shared queue, and the number of repeats of the experiment to perform.
For each repeat it issues the consumer task first, then the producer task, then waits for the consumer task to complete.
This ordering is important. It ensures the consumer is waiting for work. Then starts the producer and finally waits for the consumer to complete which can only happen once the data has been received.
We will use the multiprocessing.Pool class API which has a thread-equivalent API in the multiprocessing.pool.ThreadPool class.
1 2 3 4 5 6 7 8 9 10 |
# run a test and time how long it takes def test(pool, queue, n_repeats): # repeat many times for i in range(n_repeats): # issue the consumer task consumer = pool.apply_async(consumer_task, args=(queue,)) # issue the producer task producer = pool.apply_async(producer_task, args=(queue,)) # wait for the consumer to get the data consumer.wait() |
So far the code is generic for both threads and processes.
Next, we can develop the process-specific section that drives the example.
Firstly, we will set the spawn start method so that the benchmark is consistent and runs on all platforms.
1 2 3 |
... # set the start method set_start_method('spawn') |
Next, we can create a shared pool of workers.
We will use the multiprocessing.Pool, although the ProcessPoolExecutor would work just as well.
1 2 3 4 |
... # create the pool of workers with Pool(2) as pool: # ... |
You can learn more about the multiprocessing.Pool in the tutorial:
Next, we can create the queue to be shared with all workers.
We cannot share a queue directly with workers in the pool as this will result in an error.
Instead, we can host the queue in a server process and share proxy objects with worker-child processes for interacting with the queue.
This requires creating a manager and then using the manager to create the queue and return a proxy object that can be shared.
1 2 3 4 5 |
... # create the manager with Manager() as manager: # create the shared queue queue = manager.Queue() |
You can learn more about multiprocessing queues in the tutorial:
You can learn more about sharing a queue with child processes via a manager in the tutorial:
Next, we can perform the test.
This requires that we record the start time, run the test, then record the end time. We can then calculate and report the duration of the test and the duration per test run within the test.
We will perform the sharing action between child processes 1,000 times.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
... # record the start time time_start = time() # perform the test n_repeats = 1000 test(pool, queue, n_repeats) # record the end time time_end = time() # report the total time duration = time_end - time_start print(f'Total Time {duration} seconds') # report estimated time per task per_task = duration / n_repeats print(f'About {per_task:.3} seconds per task') |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# SuperFastPython.com # example benchmark data transfer between child processes from time import time from multiprocessing import Pool from multiprocessing import Queue from multiprocessing import Manager from multiprocessing import set_start_method # task to generate data and send to the consumer def producer_task(queue): # generate data data = [i for i in range(1000000)] # send the data queue.put(data) # task to consume data sent from producer def consumer_task(queue): # retrieve the data data = queue.get() # run a test and time how long it takes def test(pool, queue, n_repeats): # repeat many times for i in range(n_repeats): # issue the consumer task consumer = pool.apply_async(consumer_task, args=(queue,)) # issue the producer task producer = pool.apply_async(producer_task, args=(queue,)) # wait for the consumer to get the data consumer.wait() # entry point if __name__ == '__main__': # set the start method set_start_method('spawn') # create the pool of workers with Pool(2) as pool: # create the manager with Manager() as manager: # create the shared queue queue = manager.Queue() # record the start time time_start = time() # perform the test n_repeats = 1000 test(pool, queue, n_repeats) # record the end time time_end = time() # report the total time duration = time_end - time_start print(f'Total Time {duration} seconds') # report estimated time per task per_task = duration / n_repeats print(f'About {per_task:.3} seconds per task') |
Running the example first sets the spawn start method.
Next, the process pool is created with two workers.
The manager server process is created and is used to create the shared queue.
The experiment is run with 1,000 iterations using workers in the process pool and proxy objects for the shared queue.
In this case, the transfer between child processes and the experiment took about 161.917 seconds, which is about 2.6 minutes.
Dividing the total time by the number of iterations of transfers performed found that it took about 162 milliseconds per transfer of the list of 1,000,000 integers. Recall, there are 1,000 milliseconds in one second.
Ouch, that does feel slow.
Recall that using the queue does impose a cost with internal objects and locks, as does the process pool.
Nevertheless, a constant fraction of that 10th of a second involves the serialization, transmission, and deserialization of the Python object.
1 2 |
Total Time 161.91740798950195 seconds About 0.162 seconds per task |
Next, let’s look at performing the same experiment using threads.
Free Python Multiprocessing Course
Download your FREE multiprocessing PDF cheat sheet and get BONUS access to my free 7-day crash course on the multiprocessing API.
Discover how to use the Python multiprocessing module including how to create and start child processes and how to use a mutex locks and semaphores.
Benchmark Sharing Data Between Threads
We can update the example to use threads instead of child processes.
Firstly, we can use a multiprocessing.pool.ThreadPool instead of a multiprocessing.Pool. It has the same API and uses threads internally instead of processes.
1 2 3 4 |
... # create the pool of workers with ThreadPool(2) as pool: # ... |
You can learn more about how to use the ThreadPool in the tutorial:
A manager is not needed in this case, instead, we can create the queue.Queue and share it directly with new threads.
1 2 3 |
... # create the shared queue queue = Queue() |
You can learn more about thread-safe queues in the tutorial:
The rest of the example is the same and does not require modification.
The complete example with these changes is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# SuperFastPython.com # example benchmark data transfer between threads from time import time from multiprocessing.pool import ThreadPool from queue import Queue # task to generate data and send to the consumer def producer_task(queue): # generate data data = [i for i in range(1000000)] # send the data queue.put(data) # task to consume data sent from producer def consumer_task(queue): # retrieve the data data = queue.get() # run a test and time how long it takes def test(pool, queue, n_repeats): # repeat many times for i in range(n_repeats): # issue the consumer task consumer = pool.apply_async(consumer_task, args=(queue,)) # issue the producer task producer = pool.apply_async(producer_task, args=(queue,)) # wait for the consumer to get the data consumer.wait() # entry point if __name__ == '__main__': # create the pool of workers with ThreadPool(2) as pool: # create the shared queue queue = Queue() # record the start time time_start = time() # perform the test n_repeats = 1000 test(pool, queue, n_repeats) # record the end time time_end = time() # report the total time duration = time_end - time_start print(f'Total Time {duration:.3} seconds') # report estimated time per task per_task = duration / n_repeats print(f'About {per_task:.3} seconds per task') |
Running the example first creates the thread pool and the shared queue.
The experiment is then run with 1,000 iterations of transferring the 1,000,000 item list between threads in the thread pool.
The total time and time per iteration is then recorded.
In this case, we can see that the overall experiment took about 40.6 seconds to complete.
Dividing this figure by the number of iterations, 1,000, shows that the average transfer took about 40.6 milliseconds.
1 2 |
Total Time 40.6 seconds About 0.0406 seconds per task |
Next, let’s compare the speed of data transfers with processes vs threads.
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Comparison of Data Transfer Speed With Processes vs Threads
Transferring data between child processes is slower using processes than threads.
This is the generally accepted belief that can be justified using simple logic. Data between processes must be serialized and deserialized, whereas threads can share memory directly.
In this case, we can put actual numbers to the difference between data transfers with threads and processes.
The results show that using the same architecture of workers and a shared queue that processes took about 161.917 seconds whereas threads took about 40.6 seconds to complete the experiment.
That is a difference of about 121.317 seconds (about 2 minutes) or 3.99x. This means that threads are nearly 4x faster at transferring data than processes, at least with the specific fixed overhead of using workers and a shared queue introduced in this experiment.
Looking at the time per transfer, we can see that processes took about 0.162 seconds per transfer, whereas threads took about 0.0406 seconds per transfer.
That is a difference of 0.121 seconds, or again a 3.99x speed difference.
Transferring data between processes should be kept to a minimum wherever possible and these results highlight why. It’s really slow relative to transmitting data between threads.
Threads are well suited to those I/O bound tasks that often require transmitting data, e.g. loading and sending to central processing or collecting and storing. Process-based concurrency is better suited to CPU-bound tasks, but more so to those tasks that require little data transmission before or after, ideally acquiring their own data as needed if possible.
This was not a perfect comparison between threads and processes because of the use of pools of workers and queues, but the architecture of the experiment was consistent between processes and threads and it is a good first attempt.
If you have ideas on better-designed experiments or improvements over this experiment, please let me know in the comments below. I’m sure we can do better.
Further Reading
This section provides additional resources that you may find helpful.
Python Multiprocessing Books
- Python Multiprocessing Jump-Start, Jason Brownlee (my book!)
- Multiprocessing API Interview Questions
- Multiprocessing API Cheat Sheet
I would also recommend specific chapters in the books:
- Effective Python, Brett Slatkin, 2019.
- See: Chapter 7: Concurrency and Parallelism
- High Performance Python, Ian Ozsvald and Micha Gorelick, 2020.
- See: Chapter 9: The multiprocessing Module
- Python in a Nutshell, Alex Martelli, et al., 2017.
- See: Chapter: 14: Threads and Processes
Guides
- Python Multiprocessing: The Complete Guide
- Python Multiprocessing Pool: The Complete Guide
- Python ProcessPoolExecutor: The Complete Guide
APIs
References
Takeaways
You now know the speed difference for data transmission when using processes vs threads.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Devon Janse van Rensburg on Unsplash
Do you have any questions?