Last Updated on September 29, 2023
You can configure the number of threads used by BLAS functions called by NumPy by setting the OMP_NUM_THREADS environment variable.
In this tutorial, you will discover how to configure the number of threads used by BLAS and LAPACK via NumPy.
Let’s get started.
Need to Configure the Number of Threads Used By BLAS
NumPy is an array library in Python.
It makes use of third-party libraries to perform array functions efficiently.
One example is the BLAS and LAPACK specifications used by NumPy to execute vector, matrix, and linear algebra operations.
The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software, LAPACK for example.
— BLAS (Basic Linear Algebra Subprograms)
The benefit of BLAS libraries is that it makes specific linear algebra operations more efficient by using advanced algorithms and multithreading.
The NumPy linear algebra functions rely on BLAS and LAPACK to provide efficient low level implementations of standard linear algebra algorithms. Those libraries may be provided by NumPy itself using C versions of a subset of their reference implementations but, when possible, highly optimized libraries that take advantage of specialized processor functionality are preferred.
— Linear algebra (numpy.linalg), NumPy API.
You can learn more about BLAS and LAPACK in the tutorial:
Many BLAS and LAPACK functions are multithreaded, including matrix operations like matrix multiplication (dot product) and matrix decompositions like singular value decomposition (SVD).
We may need to configure the number of threads used by the installed BLAS library.
This may be for many reasons, such as:
- We want the BLAS calculation to be single-threaded.
- We want the BLAS calculation to use fewer threads.
- We want the BLAS calculation to use one thread per physical CPU core.
- We want the BLAS calculation to use one thread per logical CPU core.
These cases may come up in our program if we are using concurrency elsewhere, such as via Python threads, Python processes, or Asyncio coroutines.
How can we configure the number of threads used by BLAS?
Run loops using all CPUs, download your FREE book to learn how.
How to Configure the Number of BLAS Threads
We can configure the number of threads used by BLAS via an environment variable.
The specific environment variable used to configure the number of threads depends on the specific BLAS library implementation that you have installed.
If you are not sure which BLAS library you have installed, there are a few ways to find out. See this tutorial:
The following lists some common BLAS libraries and the environment variable that may be used to configure the number of threads used by BLAS.
- OpenMP: OMP_NUM_THREADS
- OpenBLAS: OPENBLAS_NUM_THREADS
- MKL: MKL_NUM_THREADS
- VecLIB/Accelerate: VECLIB_MAXIMUM_THREADS
- NumExpr: NUMEXPR_NUM_THREADS
The numpy documentation recommends setting the OMP_NUM_THREADS environment variable. It may have internal logic to map this to the environment variable for the specific BLAS library installed.
Note that for performant linear algebra NumPy uses a BLAS backend such as OpenBLAS or MKL, which may use multiple threads that may be controlled by environment variables such as OMP_NUM_THREADS depending on what is used.
— Number of Threads used for Linear Algebra, Numpy API documentation.
The environment variable can be set prior to executing your Python script, for example by exporting it:
1 2 3 |
... export OMP_NUM_THREADS=4 python ./program.py |
Or on one line when executing the script:
1 |
OMP_NUM_THREADS=4 python ./program.py |
It can also be set explicitly from within the Python program via the os.environ() function.
For example:
1 2 3 |
... from os import environ environ['OMP_NUM_THREADS'] = '4' |
This must be executed before numpy is imported, e.g. before BLAS is loaded and configured for the program runtime.
If you have many libraries installed and are having trouble setting the number of threads, try the following:
1 2 3 4 5 6 7 8 |
# set the number of threads for many common libraries from os import environ N_THREADS = '4' environ['OMP_NUM_THREADS'] = N_THREADS environ['OPENBLAS_NUM_THREADS'] = N_THREADS environ['MKL_NUM_THREADS'] = N_THREADS environ['VECLIB_MAXIMUM_THREADS'] = N_THREADS environ['NUMEXPR_NUM_THREADS'] = N_THREADS |
Now that we know how to configure the number of threads used by the BLAS library, let’s explore how we may use it in practice, such as confirming our NumPy does indeed support multithreaded functions.
Configure Multithreaded BLAS Operations in NumPy (default)
As a first step, we can execute a multithreaded BLAS function from numpy using the default number of threads.
The default number of threads used by the installed BLAS library is determined when the library is installed.
Your BLAS library may use an arbitrary default number of threads. Alternatively, if the library was compiled as part of the installation, which is common then it may automatically detect the number of logical CPU cores in your system and use that as the default.
We cannot easily check the default number of threads used by the installed BLAS library.
Nevertheless, we can execute a multithreaded BLAS function from numpy and have it executed using the default number of threads.
The example below creates two 8,000 x 8,000 matrices of random numbers and multiples them together using the dot() function. This function will call down into the installed BLAS library and is multithreaded.
This example will use the default number of threads configured for the installed BLAS library.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# example of multithreaded matrix multiplication with default number of threads from time import time from numpy.random import rand # record the start time start = time() # size of arrays n = 8000 # create an array of random values data1 = rand(n, n) data2 = rand(n, n) # matrix multiplication result = data1.dot(data2) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example first creates two matrices of random floating point values.
The matrices are then multiplied together.
The duration of the whole program is then reported. In this case, the program took about 8.003 seconds to complete.
1 |
Took 8.003 seconds |
Next, let’s update the example so that it is multithreaded.
Free Concurrent NumPy Course
Get FREE access to my 7-day email course on concurrent NumPy.
Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.
Configure Single-Threaded BLAS Operation in NumPy
We can update the above example so that the multithreaded BLAS operation is executed using a single thread.
This can be achieved by setting the environment variable that specifies the number of BLAS threads to use prior to loading the installed BLAS library.
In this case, we will specify the environment variable as part of the Python program, before numpy is imported.
We will set the environment variable recommended by the NumPy documentation, which is OMP_NUM_THREADS.
Update the example to use the environment variable for the BLAS library installed on your system. If you are not sure, try different environment variables (listed above) until you see a dramatic increase in the run time of the program.
The complete example updated to be single-threaded is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of single-threaded matrix multiplication from os import environ environ['OMP_NUM_THREADS'] = '1' from time import time from numpy.random import rand # record the start time start = time() # size of arrays n = 8000 # create an array of random values data1 = rand(n, n) data2 = rand(n, n) # matrix multiplication result = data1.dot(data2) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the updated example on my system took about 17.110 seconds.
You should see a dramatic increase in the runtime of the program compared to the default example in the previous section.
If you don’t, confirm the BLAS library that is installed on your system and seek out the API documentation in order to determine the environment variable used by the library to configure the number of threads.
1 |
Took 17.110 seconds |
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Configure BLAS Threads For Physical CPU Cores
Modern systems typically have multiple CPU cores.
Modern CPU cores use hyperthreading, allowing each core to be treated like two logical CPU cores by the operating system, offering up to a 30% speedup for some operations.
We may choose to configure the BLAS library to use one thread per physical CPU core in our system. This can offer a speed-up for those operations that do not benefit from hyperthreading.
I have 4 physical CPU cores in my system, so the example below configures BLAS to use 4 threads for multithreaded operations. Update the example to match the number of physical cores in your system.
If you’re not sure how many CPU cores you have in your system, see the tutorial:
The complete example with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of matrix multiplication with one thread per physical cpu core from os import environ environ['OMP_NUM_THREADS'] = '4' from time import time from numpy.random import rand # record the start time start = time() # size of arrays n = 8000 # create an array of random values data1 = rand(n, n) data2 = rand(n, n) # matrix multiplication result = data1.dot(data2) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example takes about 6.304 seconds on my system.
This is faster than the default number of threads. It may suggest that, at least on my system, I may benefit from configuring BLAS to use one thread per physical CPU core when performing matrix multiplications, at least with matrices of this size.
Lots of qualifications, I know. Always test.
1 |
Took 6.304 seconds |
Did you see a benefit on your system?
Let me know the results of the program on your system in the comments below.
Next, let’s repeat the test with one thread per logical CPU core.
Configure BLAS Threads For Logical CPU Cores
As we saw in the previous example, modern systems are multicore and modern cores use hyperthreading.
I have 4 cores in my system. With hyperthreading, the operating system sees this as 8 cores.
In this example, we can update the configuration to use one BLAS thread per logical CPU core and compare performance.
Update the number of threads to match the number of logical CPU cores in your system.
If you’re not sure how many logical CPU cores you have, see the tutorial:
The complete example with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of matrix multiplication with one thread per logical cpu core from os import environ environ['OMP_NUM_THREADS'] = '8' from time import time from numpy.random import rand # record the start time start = time() # size of arrays n = 8000 # create an array of random values data1 = rand(n, n) data2 = rand(n, n) # matrix multiplication result = data1.dot(data2) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example takes about 8.016 seconds on my system.
This is about as fast as the default case. It suggests that perhaps the default configuration of the BLAS library on my system may be using one BLAS thread per logical CPU in my system. A sensible default.
1 |
Took 8.016 seconds |
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent NumPy in Python, Jason Brownlee (my book!)
Guides
- Concurrent NumPy 7-Day Course
- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes
Documentation
- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API
NumPy APIs
Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
Takeaways
You now know how to configure the number of threads used by BLAS and LAPACK via NumPy.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Do you have any questions?