Last Updated on September 29, 2023
You can parallelize numpy operations in Python using threads.
Many numpy operations are implemented using multithreaded algorithms, such as those that call the BLAS library. Additionally, most numpy functions release the GIL, allowing tasks that call numpy functions to be parallelized using the GIL.
Nevertheless, parallelizing numpy programs using threads does not always result in a speed-up. Worse still, attempting to parallelize numpy functions that implement a multithreaded algorithm can result in worse overall performance.
In this tutorial, you will discover that parallelizing tasks in numpy programs with threads does not always result in better performance.
Let’s get started.
Can Using Threads With NumPy Be Slower?
NumPy supports many linear algebra functions with vectors and arrays by calling into efficient third-party libraries such as BLAS and LAPACK.
BLAS and LAPACK are specifications for efficient vector, matrix, and linear algebra operations that are implemented using third-party libraries, such as MKL and OpenBLAS. Many of the operations in these libraries are multithreaded.
You can learn more about NumPy and BLAS in the tutorial:
Additionally, Numpy performs most math operations like arithmetic and more advanced operations using its own C functions. These functions release the global interpreter lock (GIL) in Python, allowing other Python threads to run.
The implication is that math operations on arrays in numpy can be executed in parallel using Python threads.
You can learn more about numpy releasing the GIL in the tutorial:
As such, it is good practice to speed-up a numpy program by using threads. Nevertheless, this is not always the case.
It is possible to parallelize tasks in a numpy program and see no benefit. In some cases, it is even possible to see worse performance.
Run loops using all CPUs, download your FREE book to learn how.
Using Threads Can Result in No Benefit or Worse Performance
Adding threads to a numpy program does not always improve performance.
Sometimes, parallelizing tasks in a numpy program can offer no benefit.
Additionally, in some cases, adding threads can result in significantly worse performance.
There are a number of reasons for this, such as:
- BLAS threads (math operation level) compete with Python threads (task level).
- BLAS math operations are not multithreaded for a specific use case.
- Multithreading a numpy function offers no benefit.
Let’s take a closer look at each case in turn.
BLAS Threads Compete with Python Threads
The most common cause is if threads are used to parallelize a math operation that is already multithreaded.
For example, a math operation like a matrix multiplication with the dot() is already executed using a multithreaded algorithm via the third-party BLAS library. If the program needs to execute the multithreaded math operation many times and is parallelized as a task in a thread pool, then the threads at operation level (BLAS threads) and the threads at the task level (Python threads) will compete for resources.
Each thread will be executing a CPU-bound task and there may be more CPU-bound tasks executing concurrently than the system can support, e.g. more than the number of physical or logical CPU cores. This means the operating system will context switch between the threads in an effort to run all tasks in parallel.
This may add an unwanted overhead that may result in the overall completion of the program taking longer than if it was only parallelized at the math operation level (BLAS threads) or at the task level (Python threads).
It may be possible to strike a balance between BLAS and Python threads within the same program and achieve better overall performance.
You can see an exploration of this in the tutorial:
BLAS Operation Not Multithreaded In Some Use Cases
Many linear algebra math functions are implemented using a multithreaded algorithm via an installed BLAS library.
Nevertheless, not all uses of these math functions can be multithreaded.
For example, we may perform a vector multiplication using the dot() function.
This function will call down to BLAS and will execute a multithreaded algorithm.
Nevertheless, the nature of the vector-to-vector multiplication may not be able to be executed using multiple threads because of the sequential nature of the operation.
Alternatively, perhaps a different function is required to perform the desired math operation, such as calling the numpy.linalg.svd() function instead of calculating an SVD manually.
Multithreading a Numpy Function Offers no Benefit
Most numpy math functions release the GIL.
This means that we can parallelize tasks that execute numpy functions with threads.
Nevertheless, even though the design of the program appears appropriate, the implementation may result in no benefit.
An example is filling a numpy array via the fill() method.
This function releases the GIL. Therefore, if we had to fill many arrays in parallel, we might expect that we can execute these array fill tasks in parallel using threads and achieve a speed-up.
This may not be the case in general. Alternatively, it may not be the case with some specific ways that the task can be executed, such as the choice of method on a ThreadPool or the choice of configuration such as “chunksize” on a map() function.
Now that we are familiar with some ways in which we may not get a lift performance when adding threads to numpy programs, let’s explore some worked examples
Example of Adding Threads Results in Worse Performance
We can explore an example of multithreading a task that calls a multithreaded numpy function.
This can result in worse performance in some cases, such as when there are more overall CPU-bound threads then there are physical or logical CPU cores in the system.
In this case, we will explore parallelizing multiple matrix multiplication tasks.
BLAS-level Multithreaded Matrix Multiplication
We can develop a baseline case before we attempt to parallelize it with Python threads.
In this example, we will define two lists of 100 matrices and then multiply each pair of matrices. The matrix multiplication operation executed via the dot() method on the numpy array is implemented using BLAS and uses a multithreaded algorithm.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# example of multithreaded matrix multiplication from numpy import ones from time import time # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # record the start time start = time() # prepare lists of matrices n = 2000 data1 = [ones((n,n)) for _ in range(100)] data2 = [ones((n,n)) for _ in range(100)] # apply math operation to items results = [operation(item1, item2) for item1,item2 in zip(data1,data2)] # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example first creates two lists of 100 matrices, each with the size 2,000 x 2,000 initialized with one values.
Each pair of matrices are then multiplied by calling a custom function that in turn calls the dot() method. We use a custom function to perform the math operation so it is easy to parallelize in the next section.
Finally, the overall time taken for the program to execute is reported.
In this case, it took about 10.705 seconds to execute on my system.
It may take more or fewer seconds to run, depending on the speed of your system.
This provides a baseline that we can compare to the parallelized version of the same program.
1 |
Took 10.705 seconds |
Next, let’s update the program to parallelize the tasks with threads.
ThreadPool and BLAS-level Multithreaded Matrix Multiplication
We can update the previous example to parallelize the tasks using Python threads.
In this case, we can use the multiprocessing.pool.ThreadPool class that provides a pool of reusable worker threads. We can then issue the multiplication of each pair of matrices to the pool for execution using the starmap() function.
If you are new to the ThreadPool class, you can get started with the guide:
The complete example with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# example of multithreaded matrix multiplication with thread pool from numpy import ones from time import time from multiprocessing.pool import ThreadPool # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # record the start time start = time() # prepare lists of matrices n = 2000 data1 = [ones((n,n)) for _ in range(100)] data2 = [ones((n,n)) for _ in range(100)] # create thread pool with ThreadPool() as pool: # prepare arguments args = [(item1,item2) for item1,item2 in zip(data1,data2)] # issue tasks to thread pool results = pool.starmap(operation, args) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example first creates the two lists of matrices, as before.
The ThreadPool is then created and the pairs of matrices are prepared as arguments for each task. The tasks are then issued to the ThreadPool via the starmap() function and the call blocks until all pairs of matrix multiplication are complete.
Finally, the overall time for the program to complete is reported.
In this case, the program took about 17.354 seconds to run on my system.
This is about 6.649 seconds longer than the single-threaded version.
1 |
Took 17.354 seconds |
The reason that the program took longer to complete is that the worker threads in the ThreadPool attempting to execute matrix multiplication as fast as possible are competing with the threads in BLAS attempting to perform each individual matrix multiplication operation as fast as possible.
I have 4 physical CPU cores in my system, which is 8 logical CPU cores given hyperthreading. If BLAS is using 8 threads for each matrix multiplication by default and the ThreadPool has 8 worker threads by default, then this means there are potentially 16 CPU-bound tasks attempting to execute at the same time on 4 CPU cores. This would result in significant overhead as the operating system switches between all threads, attempting to complete them all as fast as possible.
A better solution would be to limit the overall number of threads running within the program, such as only using BLAS parallelism, only using the ThreadPool for parallelism, or finding a middle ground between both BLAS and Python threads that offers overall better performance.
Note, we can tune the above example to achieve better performance, perhaps even equivalent performance to the BLAS-only example by tuning the “chunksize” argument of the starmap() function to 25.
For example:
1 2 3 |
... # issue tasks to thread pool results = pool.starmap(operation, args, chunksize=25) |
Still, this does not result in better overall performance, only near equivalent performance to not using the ThreadPool at all.
You can learn more about adding threads to multithreaded BLAS functions and tuning for performance in the tutorial:
Next, let’s look at an example where BLAS threads may offer no benefit.
Free Concurrent NumPy Course
Get FREE access to my 7-day email course on concurrent NumPy.
Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.
Example of BLAS Threads Offering No Benefit
We can explore the case of using a numpy function that calls a multithreaded BLAS function, that offers no benefit.
In this case, we will explore the multiplication of one vector by another, e.g. a vector dot product via the dot() function.
We can confirm that the dot product of two vectors is not multithreaded by first executing the program with no BLAS threads and then the same operation with BLAS threads enabled.
Single-Threaded Vector Multiplication
We can establish a baseline by calculating a vector dot product without BLAS threads.
This can be achieved by configuring the installed BLAS library to not use threads when performing multithreaded BLAS operations.
For example:
1 2 3 |
... from os import environ environ['OMP_NUM_THREADS'] = '1' |
You can learn more about configuring the number of threads used by BLAS in the tutorial:
We can then create two vectors with 1 billion elements and multiply them together and report the overall time the program takes to execute.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of single threaded vector multiplication from os import environ environ['OMP_NUM_THREADS'] = '1' from time import time from numpy import ones # record the start time start = time() # size of arrays n = 1000000000 # create vectors data1 = ones(n) data2 = ones(n) # vector multiplication result = data1.dot(data2) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the program first creates the two 1 billion element vectors then multiples them together.
The overall time taken for the program to execute is then reported.
In this case, the program took about 5.750 seconds to execute on my system.
1 |
Took 5.750 seconds |
Next, we can execute the same program with BLAS threads enabled.
Multithreaded Vector Multiplication
We can update the previous example to allow BLAS to use thread for matrix and vector multiplication.
This can be achieved by removing the explicit configuration of the number of BLAS threads and allowing BLAS to use the default number of threads.
The complete example with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# example of multithreaded vector multiplication from time import time from numpy import ones # record the start time start = time() # size of arrays n = 1000000000 # create vectors data1 = ones(n) data2 = ones(n) # vector multiplication result = data1.dot(data2) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the program first creates the two 1 billion element vectors then multiples them together.
The overall time taken for the program to execute is then reported.
In this case, the program took about 5.462 seconds to execute on my system.
The difference is about 0.288 milliseconds, which is probably within the margin of error.
1 |
Took 5.462 seconds |
It is probably fair to say that although the array multiplication function is implemented using a multithreaded algorithm, it does not offer a benefit in this case when multiplying vectors, probably because of the single dimension of the arrays.
This highlights that we cannot rely on a multithreaded BLAS function to always offer a speedup. Perhaps in some cases, we can explore our own multithreaded design instead.
Next, let’s explore multithreading a numpy function that releases the GIL and does not receive a performance benefit
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Example of Multithreaded Task Offering No Benefit
Most numpy functions release the GIL, allowing Python threads to be used and achieve full parallelism,
Parallelizing tasks that execute numpy functions with threads do not always offer a lift in performance.
In this case, we will demonstrate this by filling multiple arrays with a single value in parallel using threads with the fill() method.
The fill() method does release the GIL and can be parallelized in some circumstances, such as filling different parts of the same array in parallel.
You can see an example of this in the tutorial:
We may not get a benefit by filling multiple different arrays in parallel.
Single-Threaded NumPy Array Fill
Before we explore the parallel case, we can define a single-threaded program that fills many arrays.
The example below defines 100 matrices that are 5,000 x 5,000 elements in size. Each matrix is then filled with zero values sequentially and the overall time taken to complete the program is reported.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# example of single-threaded array filling from numpy import ones from time import time # record the start time start = time() # prepare data n = 5000 # prepare data data = [ones((n,n)) for _ in range(100)] # perform operation on each pair of arrays [item.fill(0) for item in data] # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the program first creates a list of 100 arrays with the given size.
The list of arrays is then traversed and each is filled with zero values.
In this case, the program takes about 7.615 seconds to complete on my system.
This provides a baseline in performance to compare to the parallel case.
1 |
Took 7.615 seconds |
Next, let’s update the example to fill the arrays in parallel using threads.
Multithreaded NumPy Array Fill
We can update the previous example to fill the arrays in parallel using threads.
We will create a ThreadPool with the default number of worker threads, then execute each fill() task to the pool to be executed.
This requires defining a custom function that takes an array as an argument and calls the fill() method on it with the given value.
All tasks are issued to the pool at once via the map() function with the default “chunksize” argument.
The complete example with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# example of multithreaded array filling from numpy import ones from time import time from multiprocessing.pool import ThreadPool # function for performing operation on matrices def operation(item): # apply array operation item.fill(0) # record the start time start = time() # prepare data n = 5000 # prepare data data = [ones((n,n)) for _ in range(100)] # create thread pool with ThreadPool() as pool: # issue tasks to thread pool _ = pool.map(operation, data) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the program first creates a list of 100 arrays with the given size.
The ThreadPool is then created and all tasks are issued to the pool. This call blocks until all tasks are completed.
Finally, the overall time taken to execute the program is reported.
In this case, the program takes about 7.810 seconds to complete on my system.
This is about 0.195 seconds slower (195 milliseconds) than the single-threaded version and is probably within the margin of measurement error.
1 |
Took 7.810 seconds |
This example highlights that even though we are executing a numpy function that releases the GIL in parallel with threads, we do not see the expected speed-up.
Note, after some experimentation, it seems that tuning the “chunksize” argument of the map() function does not appear to offer better performance on this problem.
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent NumPy in Python, Jason Brownlee (my book!)
Guides
- Concurrent NumPy 7-Day Course
- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes
Documentation
- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API
NumPy APIs
Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
Takeaways
You now know that parallelizing tasks in numpy programs with threads does not always result in better overall performance.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Jan Böttinger on Unsplash
Do you have any questions?