Last Updated on September 29, 2023
You can combine BLAS threads and multiprocessing in a NumPy program.
Maximizing these types of parallelism can help you fully utilize your CPU cores for a given application and achieve a speed-up compared to not using any or only one level of parallelism. It can also be tricky as sometimes combining multiprocessing with NumPy BLAS threads as it can result in no benefit or even worse performance.
In this tutorial, you will discover how to combine and test multiprocessing and NumPy-based parallelism in order to achieve the best performance in Python.
Let’s get started.
Parallelism in Python and NumPy
Python supports parallelism using process-based concurrency via the multiprocessing module.
Work can be partitioned into subtasks and each subtask can be executed in a separate Python process in parallel.
Parallelism is supported by Python threads, but only under special circumstances such as when performing IO. This is due to the lack of thread safety in the Python interpreter.
You can learn more about parallelism in Python using the multiprocessing module in the guide:
NumPy supports parallelism for some operations that use third-party libraries such as BLAS, LINPACK, and LAPACK.
Examples include matrix multiplication via numpy.dot() and matrix decomposition like SVD via numpy.linalg.svd(). These functions call the BLAS library API, the implementation of which (such as OpenBLAS) is likely implemented in C or FORTRAN and uses threads.
As such, some NumPy functions support parallelism automatically.
You can learn more about BLAS threads in NumPy in the tutorial:
Which type of parallelism should we use in a given program?
Run loops using all CPUs, download your FREE book to learn how.
Case Study Parallelism With Matrix Data
Consider a situation where we load data as NumPy arrays and need to process each array in some way.
Assume that the data is numerical and that the operation is a NumPy function that calls down to BLAS and can leverage threads.
For example:
- Perhaps we need to load image data, video data, or numerical data as arrays.
- Perhaps each array needs a matrix operation applied, such as fit a model or computing a decomposition.
We can simplify this to a contrived situation that involves creating a square matrix of random numbers and multiplying the matrix with another matrix.
- How should we structure the program to best take advantage of parallelism on one machine?
Put another way:
- How can we use all CPU cores in this use case?
I would argue that we cannot know. We must discover what works well or best for a specific program. It depends on the number of matrices, the size of the matrices, and the complexity of the operation to be performed.
There are perhaps 4 main approaches to test, they are:
- No multiprocessing and no BLAS threads. This is the default or naive approach that assumes we have a single CPU core. Whatever we do should be better than this case, but we need to know what the worst case is so we can beat it.
- No multiprocessing with NumPy threads. This delegates all parallelism to NumPy and BLAS, which really means that we only get parallelism for the operation on the loaded data. This is a naive implementation and will do better than the above baseline. We could attempt to tweak it by specifying the number of threads to create to match the number of physical rather than logical CPU cores. This can help sometimes.
- Multiprocessing with no BLAS threads. This turns off the parallelism in NumPy and focuses efforts on parallelism at a high level in the application, the task level.
- Multiprocessing with BLAS threads. This combines parallelism at the task level using multiprocessing and at the operation level using NumPy. The risk is that the threads may step on each other, fighting for execution time and threading the operating system with excessive context switching.
If you are faced with the situation described above (and are running on one multi-core machine), I’d recommend testing each of these general approaches to parallelism and discovering what works well or best on your system.
Next, let’s explore each of these cases with a simplified version of the case study.
Example of No Threads
The first case is the baseline.
This case involves not using task-level parallelism via the multiprocessing module and not using threads in BLAS.
We can configure the number of threads used by the BLAS library implementation via an environment variable, such as the OMP_NUM_THREADS if you have OpenBLAS installed (the most common).
You can find what BLAS library is installed by calling numpy.show_config() then look up the API documentation for that library to check the environment variable to set.
An environment variable can be set prior to running the Python program, or from within Python via the os.environ() function prior to importing NumPy.
Learn more about how to configure the number BLAS threads in the tutorial:
The example below disables NumPy BLAS multithreading. It then creates two lists of 100 matrices. Each matrix is a square 2,000 x 2,000 element matrix of random floating point values.
Am operation is then applied to each pair of 100 matrices, in this case, a simple matrix multiplication.
This is a contrived test case, but we can see how creating random matrices could be replaced with loading datasets or images, and how the matrix multiplication operation could be applied with some relevant operation required on the loaded data supported by NumPy that calls down into BLAS.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# example of loading matrices and multiplying them together (no threads) from os import environ # turn off threads in numpy (with openblas) environ['OMP_NUM_THREADS'] = '1' from numpy.random import rand # function for creating a test matrix def load_data(n=2000): # square matrix of random floats return rand(n, n) # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # load data data1 = [load_data() for _ in range(100)] data2 = [load_data() for _ in range(100)] # apply operation to items results = [operation(item1, item2) for item1,item2 in zip(data1,data2)] |
Running the example on my system takes about 33.1 seconds.
We could improve the benchmarking by running the example many times and taking the average time.
This could be further automated using a module like timeit. I leave this as an exercise for the reader.
Here, the numbers are directional rather than definitive.
I don’t want to give hard numbers for “this” is better than “that” for all cases. I want to teach you how you can benchmark your own use case and discover what works well/best for you.
Next, let’s re-run the example with BLAS threads.
Free Concurrent NumPy Course
Get FREE access to my 7-day email course on concurrent NumPy.
Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.
Example of Default Number of BLAS Threads
We can update the example to use threads to perform the matrix multiplication via BLAS.
Most BLAS libraries use threads by default, therefore we can simply remove the explicit setting of the OMP_NUM_THREADS environment variable and use the default number of threads.
The complete example with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# example of loading matrices and multiplying them together (default numpy threads) from numpy.random import rand # function for creating a test matrix def load_data(n=2000): # square matrix of random floats return rand(n, n) # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # load data data1 = [load_data() for _ in range(100)] data2 = [load_data() for _ in range(100)] # apply operation to items results = [operation(item1, item2) for item1,item2 in zip(data1,data2)] |
Running the example takes about 15.9 seconds on my system.
This is interesting. I have 4 physical CPU cores (8 logical cores), and might have thought that the example would be 4x faster, not 2x faster.
Generating random matrices is slow, perhaps that’s it.
Nevertheless, we beat the baseline, but can we do better?
Example of Tuned Number of NumPy (BLAS) Threads
When installing the BLAS library it is probably compiled for the specific hardware and operating system.
As such it may be configured with a default number of threads based on the number of logical CPU cores in the system. Recall each physical CPU core may be considered 2 logical CPU cores if we have support for hyperthreading, which most CPUs do these days.
You can learn more about physical vs logical CPUs in the tutorial:
This means in my system the default number of BLAS threads may be set to 8.
Alternatively, the BLAS library may use an arbitrary number of default threads, such as 24.
We may see a benefit by explicitly setting the number of threads supported by BLAS to be equal to the number of physical CPU cores.
I will use 4 in this case but change this to match the number of physical cores in your system, or try different values and discover what works best.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# example of loading matrices and multiplying them together (optimal numpy threads) from os import environ # set threads equal to number of physical cores (with openblas) environ['OMP_NUM_THREADS'] = '4' from numpy.random import rand # function for creating a test matrix def load_data(n=2000): # square matrix of random floats return rand(n, n) # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # load data data1 = [load_data() for _ in range(100)] data2 = [load_data() for _ in range(100)] # apply operation to items results = [operation(item1, item2) for item1,item2 in zip(data1,data2)] |
Running the example takes about 15.9 seconds on my system.
This is no different to the default, so no win here (on this specific case on my specific hardware).
Did you see a better result?
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Example of Multiprocessing Without BLAS Threads
We can implement-task level parallelism using multiprocessing.
Specifically, we can define a task, then issue it to a pool of workers to execute as soon as possible.
We can use the multiprocessing.Pool class to create a pool of worker processes, one worker per physical CPU core in the system. I will use 4 workers in this case, but you can tune it for the number of cores in your system.
You can learn more about the process pool in the tutorial:
The task can be defined as generating two square matrices of random numbers and calculating a matrix multiplication.
Parallelism in the NumPy matrix multiplication can be disabled by configuring BLAS to use a single thread.
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# example with multiprocessing and no numpy threads from os import environ # turn off threads in numpy (with openblas) environ['OMP_NUM_THREADS'] = '1' from numpy.random import rand from multiprocessing import Pool # function for creating a test matrix def load_data(n=2000): # square matrix of random floats return rand(n, n) # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # function that defines a single task def task(): # load items item1 = load_data() item2 = load_data() # perform operation result = operation(item1, item2) return result # entry point if __name__ == '__main__': # create process pool with Pool(4) as pool: # issue tasks and gather results results = [pool.apply_async(task) for _ in range(100)] # close the process pool pool.close() # wait for tasks to complete pool.join() |
Running the example takes about 11.5 seconds.
This is closer to the 4x speed-up I would expect with 4 physical CPU cores. For example, 33.1 * 0.25 = 8.275. The 3-second discrepancy (11 minus 8 seconds) is probably overhead in the Pool (and elsewhere).
Perhaps in this scenario, or just in this contrived version of the use case, task-level parallelism is a better approach than the operation-level parallel, at least with the configurations tested so far.
Example of Multiprocessing And BLAS Threads
Finally, we can try to combine task and operation-level parallelism.
That is, process-based parallelism with the multiprocessing module and parallelism of the matrix operation in NumPy via threads in the BLAS library implementation.
This is tricky as the threads, at the operating system level, will likely step on each other. Recall each Python process has a main thread.
Python developers often try to speed up NumPy programs by using multiprocessing with NumPy arrays, unaware of the BLAS-level parallelism provided automatically. As such, they may get equal or worse performance and have no idea why.
In this case, we will use the default number of BLAS threads and configure the multiprocessing pool to use 4 workers, one for each physical CPU core. After testing on my system, this seemed to give better performance than other configurations, e.g. 2 BLAS threads and 2 workers, and other configs.
Try a suite of configurations on your system based on the number of CPU cores and see what works.
Recall, setting BLAS to 1 thread and the pool to one process is equivalent to the baseline example (no threads) performed above, although will be slightly slower given the overhead of managing tasks by a worker in the pool.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# example with multiprocessing and numpy threads from numpy.random import rand from multiprocessing import Pool # function for creating a test matrix def load_data(n=2000): # square matrix of random floats return rand(n, n) # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # function that defines a single task def task(): # load items item1 = load_data() item2 = load_data() # perform operation result = operation(item1, item2) return result # entry point if __name__ == '__main__': # create process pool with Pool(4) as pool: # issue tasks and gather results results = [pool.apply_async(task) for _ in range(100)] # close the process pool pool.close() # wait for tasks to complete pool.join() |
In this case, the example took about 12.1 seconds to complete on my system.
Hmm.
Naively, I would have bet on a program structure like this, with some tuned configuration of BLAS threads and Pool workers, would have been the best. This false expectation is why we experiment and test!
I would guess that there might be an optimal arrangement for task and operation level parallelism, but perhaps the tools of the Pool class and BLAS are too blunt to find it.
Comparison of Results
We have tested a number of instances of the case study with what we have termed operation-level parallelism and task-level parallelism.
The table below summarizes the cases tested and their wall clock time (reported by my IDE).
1 2 3 4 5 6 7 |
Multiprocessing | NumPy (BLAS) Threads | Time (sec) --------------------------------------------------- No | No | 33.1 No | Yes (default) | 15.9 No | Yes (tailored) | 15.9 Yes | No | 11.5 Yes | Yes (default) | 12.1 |
In this case (the specific number and size of data) on my system (hardware and software), it seems that task-level parallelism via multiprocessing was faster overall.
So what should you do in your situation?
Test.
If you are using NumPy, take a close look at the functions to see if they are calling down into BLAS. If you’re not sure, check this tutorial:
If you’re still not sure, contrive a test with one BLAS thread and multiple BLAS threads and compare the performance.
Then, test your use case with different levels of parallelism, such as those listed above.
Use the code examples above as templates.
Don’t assume what approach will be fastest, devise experiments and discover what approach is fastest with your data on your hardware.
You can always go further too.
- Loading data could be parallel.
- Generating data can be made parallel.
- Using alternative NumPy functions can use more or fewer BLAS and LAPACK functions and in turn achieve more or less parallelism.
- Issuing calls down to BLAS in batch might be possible in some cases, e.g. see the numpy.linalg.multi_dot() function.
- Custom operations can be defined down in C/FORTRAN in third-party code to use native threads then called from Python.
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent NumPy in Python, Jason Brownlee (my book!)
Guides
- Concurrent NumPy 7-Day Course
- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes
Documentation
- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API
NumPy APIs
Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
Takeaways
You now know how to combine and test multiprocessing and NumPy parallelism in order to achieve the best performance.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by logojackmowo Yao on Unsplash
Do you have any questions?