Last Updated on September 29, 2023
You can configure the number of threads used by BLAS via numpy with the threadpoolctl library.
In this tutorial, you will discover how to control the number of threads used by BLAS with threadpoolctl library.
Let’s get started.
Need to Limit Number of BLAS Threads
NumPy is an array library in Python.
It makes use of third-party libraries to perform array functions efficiently.
One example is the BLAS and LAPACK specifications used by NumPy to execute vector, matrix, and linear algebra operations.
The benefit of BLAS libraries is that it makes specific linear algebra operations more efficient by using advanced algorithms and multithreading.
You can learn more about BLAS and LAPACK for numpy in the tutorial:
Many BLAS and LAPACK functions are multithreaded, including matrix operations like matrix multiplication (dot product) and matrix decompositions like singular value decomposition (SVD).
We may need to configure the number of threads used by the installed BLAS library.
This may be for many reasons, such as:
- We want the BLAS calculation to be single-threaded.
- We want the BLAS calculation to use fewer threads.
- We want the BLAS calculation to use one thread per physical CPU core.
- We want the BLAS calculation to use one thread per logical CPU core.
These cases may come up in our program if we are using concurrency elsewhere in our program, such as via Python threads, Python processes, or Asyncio coroutines.
How can we configure the number of threads used by BLAS in NumPy?
Run loops using all CPUs, download your FREE book to learn how.
How to Limit BLAS Threads With threadpoolctl
One way to configure the number of threads used by BLAS via numpy is via an environment variable.
You can learn more about this in the tutorial:
Another way to configure the number of threads used by BLAS via numpy is with the threadpoolctl library.
The threadpoolctl library is an open-source project that provides tools to configure the number of threads used in external libraries, such as BLAS and OpenMP.
Python helpers to limit the number of threads used in the threadpool-backed of common native libraries used for scientific computing and data science (e.g. BLAS and OpenMP).
— threadpoolctl GitHub Project
Firstly, we must install the threadpoolctl library using a Python package manager, such as pip.
For example:
1 |
sudo pip install threadpoolctl |
The API provides methods to configure the number of threads used by BLAS within a scope, such as within a context manager.
This can be achieved via the threadpoolctl.threadpool_limits() function and specifies the maximum number of threads via the “limits” argument and the underlying library as “blas” via the “user_api” argument.
For example:
1 2 3 4 |
... # limit the number of blas threads with threadpool_limits(limits=4, user_api='blas'): # ... |
This will limit the number of threads used by BLAS within the scope of the context manager.
This is helpful as we can change the number of threads to be equal to the number of physical cores in a block of code, or configure a block of code to not use BLAS multithreading at all, such as if we want to use a ThreadPool.
The threadpoolctl works by configuring the number of BLAS directly at runtime via the public API.
threadpoolctl only interacts with native libraries via their public runtime APIs.
— threadpoolctl GitHub Project
You can learn more about the project here:
Now that we know how to configure the number of threads used by BLAS via NumPy, let’s look at some worked examples.
Example of Matrix Multiplication with All BLAS Threads
Before we explore how to control the number of BLAS threads with the threadpoolctl library, let’s first develop an example that uses the default number of BLAS threads.
In this example, we will create two modestly large matrices and multiply them together.
Numpy matrix multiplication can be achieved using the dot() method on the arrays. The operation is performed using the installed BLAS library and is multithreaded.
Firstly, we will define two 8,000×8,000 matrices of one values.
1 2 3 4 5 6 |
... # size of arrays n = 8000 # create an array of random values data1 = ones((n, n)) data2 = ones((n, n)) |
We will then multiply one matrix by the other.
1 2 3 |
... # matrix multiplication result = data1.dot(data2) |
We will time the whole operation and report the duration.
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# example of numpy matrix multiplication with blas threads from numpy import ones from time import time # size of arrays n = 8000 # create an array of random values data1 = ones((n, n)) data2 = ones((n, n)) # record start time start = time() # matrix multiplication result = data1.dot(data2) # calculate duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example first creates two numpy arrays and then multiplies them together.
The operation is performed using the BLAS library with multiple threads.
In this case, on my system, the operation takes about 7.117 seconds.
We will use this as a baseline and compare it to the same example when controlling the number of threads used by BLAS with threadpoolctl.
Try repeating the operation a few times and average the duration. We do not need a precise measurement, instead, we are interested in a directional change when controlling the number of threads via the threadpoolctl library.
1 |
Took 7.117 seconds |
Next, let’s update the example to control the number of threads used by BLAS.
Free Concurrent NumPy Course
Get FREE access to my 7-day email course on concurrent NumPy.
Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.
Example of Matrix Multiplication with Limited BLAS Threads
We can update the previous example to control the number of threads used by BLAS via the threadpoolctl library.
There are two common scenarios we can explore.
The first is to configure BLAS to use one thread per physical CPU core. This can offer better performance for some operations, such as matrix multiplication.
The other scenario is to turn off the multithreaded capability of BLAS. This is helpful in programs where we want to use Python multithreading for parallelism instead of parallelism at the operation level in BLAS.
Let’s take a closer look at each example.
Example of One Thread Per Physical CPU Core
In this example, we can configure BLAS to use one thread per physical CPU core in the system.
I have 4 physical CPU cores in my system, so we will configure the threadpool_limits() function to use 4 worker threads in BLAS when performing the matrix multiplication.
1 2 3 4 5 |
... # limit the number of blas threads with threadpool_limits(limits=4, user_api='blas'): # matrix multiplication result = data1.dot(data2) |
Update the example to use the number of physical CPU cores in your system.
If you need help with determining the number of CPU cores in your system or more information on physical vs logical CPU cores, see the tutorial:
Tying this together the complete example with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# example of numpy matrix multiplication with limited blas threads from numpy import ones from time import time from threadpoolctl import threadpool_limits # size of arrays n = 8000 # create an array of random values data1 = ones((n, n)) data2 = ones((n, n)) # record start time start = time() # limit the number of blas threads with threadpool_limits(limits=4, user_api='blas'): # matrix multiplication result = data1.dot(data2) # calculate duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example first creates the two matrices.
The number of threads is then configured to be equal to the number of physical CPU cores.
The matrix operation is then performed using BLAS.
In this case, the example takes about 4.863 seconds.
This is about 2.254 seconds faster than the same operation with the default number of BLAS threads, or a speedup of about 1.46x.
1 |
Took 4.863 seconds |
Next, let’s look at how we can turn-off BLAS multithreading in numpy.
Example of Single-Threaded BLAS Operations
We can update the example to disable BLAS threads using the threadpoolctl library.
This can be achieved via the threadpool_limits() and setting the “limits” argument to 1.
This will use a single BLAS thread to perform the matrix multiplication.
For example:
1 2 3 4 5 |
... # limit the number of blas threads with threadpool_limits(limits=1, user_api='blas'): # matrix multiplication result = data1.dot(data2) |
Tying this together, the complete example with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# example of numpy matrix multiplication with no blas threads from numpy import ones from time import time from threadpoolctl import threadpool_limits # size of arrays n = 8000 # create an array of random values data1 = ones((n, n)) data2 = ones((n, n)) # record start time start = time() # limit the number of blas threads with threadpool_limits(limits=1, user_api='blas'): # matrix multiplication result = data1.dot(data2) # calculate duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example first creates the two matrices.
The number of threads is then configured to be one.
The matrix operation is then performed using BLAS.
In this case, the operation took about 16.165 seconds.
This is about 9.048 seconds slower than the default number of threads used by BLAS, or a slowdown of about 2.27x.
1 |
Took 16.165 seconds |
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent NumPy in Python, Jason Brownlee (my book!)
Guides
- Concurrent NumPy 7-Day Course
- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes
Documentation
- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API
NumPy APIs
Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
Takeaways
You now know how to control the number of threads used by BLAS via numpy with the threadpoolctl library.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Rikku Sama on Unsplash
Do you have any questions?