Last Updated on September 29, 2023
You can parallelize numpy programs using multiprocessing.
It is likely that using process-based concurrency via multiprocessing to parallelize a numpy program will result in worse overall performance.
In this tutorial, you will discover the impact of using processing-based concurrency to parallelize numpy programs.
Let’s get started.
Should We Use Multiprocessing to Parallelize NumPy?
It is good practice to use process-based concurrency for CPU-bound tasks in Python.
Process-based concurrency is provided via the multiprocessing module, allowing functions to run in new processes via the Process class or a reusable pool of workers via the ProcessPoolExecutor or Pool classes.
You can learn more about using multiprocessing for concurrency in the guide:
CPU-bound tasks are those tasks that do not wait on input or output but rather run as fast as the CPU can execute instructions.
Examples of CPU-bound tasks include mathematical calculations, such as those we may perform on arrays with numpy.
Processes are preferred for CPU-bound tasks because they are able to achieve full parallelism. This is unlike threads in Python that are typically limited by the global interpreter lock (GIL).
As such, we may naively parallelize numpy tasks using multiprocessing.
Should we use process-based concurrency via the multiprocessing module to parallelize numpy programs in Python?
Run loops using all CPUs, download your FREE book to learn how.
Performance of Adding Multiprocessing to NumPy Programs is Typically Worse
Using process-based concurrency to parallelize numpy programs typically results in worse performance.
The reason for this is that typically we want to parallelize operations on numpy arrays.
This means that numpy arrays need to be shared among processes, such as via a function argument or via a Queue.
This is slow because the numpy array data must be transmitted between processes using inter-process communication. It requires that a copy of the array be created and pickled, transmitted via a socket connection or file, then unpickled by the receiving process.
This can be up to 4-times slower than sharing data between threads.
You can learn more about the slow speed of transferring data between processes in the tutorial:
As such, the computational cost of sharing numpy arrays between processes for the purpose of achieving a speedup via parallelism often overwhelms any speed benefit you might achieve.
You can see an example of this in the tutorial:
Using smaller arrays of data may not help.
At worst, using multiprocessing to parallelize a numpy program on modestly sized arrays offers little or no speedup.
You can see an example of this in the tutorial:
There are workarounds for sharing numpy arrays between processes efficiently, such as using an array backed by a RawArray or SharedMemory, or using an array hosted in a Manager server process.
You can learn more about these workarounds in the tutorial:
Using these more efficient methods for sharing numpy arrays between processes can add significant complexity to a program.
One area where using multiprocessing may help is if each that is parallelized to separate processes is independent and does not involve sharing numpy arrays.
For example, perhaps each task loads an array from a file or a database, performs mathematical operations upon it or uses it in modeling, then stores the result directly.
A task of this type might be a good candidate for parallelizing via multiprocessing.
Generally, threads are a better candidate for multiprocessing.
This is because most numpy functions release the global interpreter lock, allowing tasks that operate on numpy arrays to achieve full parallelism.
You can learn more about this in the tutorial:
You can see an example of this speed-up in the tutorial:
Now that we know that multiprocessing is probably a poor choice for parallelism with numpy, let’s look at a worked example.
Example Multiprocessing Slower Due to Copying Arrays
We can explore an example where parallelizing a simple numpy program results in worse performance, even with modestly sized arrays.
In this example, we will create two lists of numpy arrays then apply a mathematical operation to each pair.
We will then explore different versions of this program, first with no multiprocessing (but with default BLAS multithreading), a parallelized version with multiprocessing, then various tuned versions of the multiprocessing example that attempt to achieve better performance.
These examples clearly demonstrate that although the example is simple and the size of data shared between processes is small, we cannot achieve better performance using process-based parallelism.
Single-Threaded Example (with BLAS threads)
We can develop a simple numpy program that has a series of tasks that can be parallelized.
In this example, we will create two lists of arrays filled with dummy values. Each array will be 2,000 x 2,000 units and each list will have 100 arrays.
We will then multiply each pair of arrays together using matrix multiplication to produce a result array. We Will define a separate function to perform the multiplication on each pair of arrays as this will help with parallelizing the task later.
Matrix multiplication is implemented using a multithreaded algorithm via a BLAS library, which is installed along with numpy on most systems.
You can learn more about numpy and BLAS in the tutorial:
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# example of BLAS single-threaded matrix multiplication from numpy import ones from time import time # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # protect the entry point if __name__ == '__main__': # record the start time start = time() # prepare lists of matrices n = 2000 data1 = [ones((n,n)) for _ in range(100)] data2 = [ones((n,n)) for _ in range(100)] # apply math operation to items results = [operation(item1, item2) for item1,item2 in zip(data1,data2)] # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example first creates two lists of arrays.
The arrays are then multiplied together producing a list of result arrays.
The overall time taken to execute the program is then reported.
In this case, the program takes about 11.249 seconds to complete on my system.
It may take more or fewer seconds to complete on your system, depending on the speed of your hardware.
1 |
Took 11.249 seconds |
Next, let’s look at how to parallelize the example using process-based concurrency.
Multiprocessing Example
We can update the previous example to execute each matrix multiplication task in parallel using multiprocessing.
One way we can achieve this is to create a pool of worker processes and issue each task to the pool to be completed as fast as possible.
This can be achieved using the multiprocessing.Pool class and the starmap() function.
If you are new to the Pool class, you can learn more in the tutorial:
The starmap() function is like the built-in map() function except that it issues each task to the process pool and supports multiple arguments for each function call.
You can learn more about the Pool.starmap() method in the tutorial:
The complete example with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# example of multiprocessing matrix multiplication with process pool from numpy import ones from time import time from multiprocessing.pool import Pool # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # protect the entry point if __name__ == '__main__': # record the start time start = time() # prepare lists of matrices n = 2000 data1 = [ones((n,n)) for _ in range(100)] data2 = [ones((n,n)) for _ in range(100)] # create thread pool with Pool() as pool: # prepare arguments args = [(item1,item2) for item1,item2 in zip(data1,data2)] # issue tasks to thread pool results = pool.starmap(operation, args) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example first creates two lists of arrays.
The Pool is then created with the default number of worker processes.
Each matrix multiplication task is then issued to the pool of processes and executed as fast as possible.
Each matrix multiplication is executed using multiple threads via the BLAS algorithm executed behind the scenes.
The overall time to complete the program is then reported.
In this case, on my system, it takes about 125.346 seconds.
This is about 114.097 seconds slower or a slow-down of about 11x.
1 |
Took 125.346 seconds |
The reason for the slow-down is that each 2,000 x 2,000 array has to be transmitted to a child process and the result array has to be transmitted back to the parent process.
Next, let’s see if we can improve the performance of the program with some tweaks.
Tuned Multiprocessing Example
We can update the previous example and attempt to tune the performance.
It can be a good practice to have one task per physical CPU core for CPU-bound tasks.
We are executing a CPU-bound task of matrix multiplication, so we can configure the number of process workers in the pool to equal the number of physical CPU cores. I have 4 cores, so the example will use 4 workers, but update to match the number of cores in your system.
You can learn more about configuring the number of workers in the process pool in the tutorial:
It can also be a good practice to transmit tasks to worker processes in batches or groups.
The starmap() function does this by default, but we can tune the number of tasks in each group via the “chunksize” argument. Perhaps a good value is to divide the tasks into 4 groups, one for each worker process, e.g. 25 matrix multiplications per worker.
You can learn more about tuning the “chunksize” argument in the tutorial:
The complete example with these changes is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# example of tuned multiprocessing matrix multiplication with process pool from numpy import ones from time import time from multiprocessing.pool import Pool # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # protect the entry point if __name__ == '__main__': # record the start time start = time() # prepare lists of matrices n = 2000 data1 = [ones((n,n)) for _ in range(100)] data2 = [ones((n,n)) for _ in range(100)] # create thread pool with Pool(4) as pool: # prepare arguments args = [(item1,item2) for item1,item2 in zip(data1,data2)] # issue tasks to thread pool results = pool.starmap(operation, args, chunksize=25) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example creates the lists of arrays and the thread pool with 4 workers.
The tasks are issued to the workers in 4 groups of 25 tasks.
The overall time is reported, in this case as 112.407 seconds.
This shows that using some best practices did offer a speed benefit of about 12.939 seconds, but it is still 100 seconds slower than the non-multiprocessing version.
1 |
Took 112.407 seconds |
Next, let’s explore whether turning off BLAS threads offers any benefits.
Further Tuned Multiprocessing Example
The matrix multiplication is implemented using a multithreaded algorithm via a BLAS library.
It is possible that the threads executing each matrix multiplication and interacting with the worker processes result in lower overall performance.
We can disable multithreading in the BLAS library so matrix multiplication is single-threaded and then compare performance.
You can learn more about configuring the number of threads in the BLAS library in the tutorial:
The complete example with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# example of tuned multiprocessing matrix multiplication with process pool from os import environ environ['OMP_NUM_THREADS'] = '1' from numpy import ones from time import time from multiprocessing.pool import Pool # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # protect the entry point if __name__ == '__main__': # record the start time start = time() # prepare lists of matrices n = 2000 data1 = [ones((n,n)) for _ in range(100)] data2 = [ones((n,n)) for _ in range(100)] # create thread pool with Pool(4) as pool: # prepare arguments args = [(item1,item2) for item1,item2 in zip(data1,data2)] # issue tasks to thread pool results = pool.starmap(operation, args, chunksize=25) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example first disables multithreading in the BLAS library.
The lists of arrays are created, the pool of workers is created and all tasks are issued and executed in the pool.
The overall time taken is reported, which in this case was about 125.448 seconds.
This is roughly equivalent to the un-optimized multiprocessing example and is worse than not using BLAS multithreading with the same Pool configuration.
This suggests that the BLAS threads are probably not interacting with the worker processes and interactions of this type are not the cause of the slowdown.
1 |
Took 125.448 seconds |
Free Concurrent NumPy Course
Get FREE access to my 7-day email course on concurrent NumPy.
Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent NumPy in Python, Jason Brownlee (my book!)
Guides
- Concurrent NumPy 7-Day Course
- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes
Documentation
- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API
NumPy APIs
Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Takeaways
You now know about the impact of using processing-based concurrency to parallelize numpy programs.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Samuel Pucher on Unsplash
Do you have any questions?