Configure the Number of BLAS/LAPACK Threads for NumPy

Last Updated on September 29, 2023

You can configure the number of threads used by BLAS functions called by NumPy by setting the OMP_NUM_THREADS environment variable.

In this tutorial, you will discover how to configure the number of threads used by BLAS and LAPACK via NumPy.

Let’s get started.

Table of Contents

Need to Configure the Number of Threads Used By BLAS

NumPy is an array library in Python.

It makes use of third-party libraries to perform array functions efficiently.

One example is the BLAS and LAPACK specifications used by NumPy to execute vector, matrix, and linear algebra operations.

The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software, LAPACK for example.
— BLAS (Basic Linear Algebra Subprograms)

The benefit of BLAS libraries is that it makes specific linear algebra operations more efficient by using advanced algorithms and multithreading.

The NumPy linear algebra functions rely on BLAS and LAPACK to provide efficient low level implementations of standard linear algebra algorithms. Those libraries may be provided by NumPy itself using C versions of a subset of their reference implementations but, when possible, highly optimized libraries that take advantage of specialized processor functionality are preferred.
— Linear algebra (numpy.linalg), NumPy API.

You can learn more about BLAS and LAPACK in the tutorial:

What is BLAS and LAPACK in NumPy

Many BLAS and LAPACK functions are multithreaded, including matrix operations like matrix multiplication (dot product) and matrix decompositions like singular value decomposition (SVD).

We may need to configure the number of threads used by the installed BLAS library.

This may be for many reasons, such as:

We want the BLAS calculation to be single-threaded.
We want the BLAS calculation to use fewer threads.
We want the BLAS calculation to use one thread per physical CPU core.
We want the BLAS calculation to use one thread per logical CPU core.

These cases may come up in our program if we are using concurrency elsewhere, such as via Python threads, Python processes, or Asyncio coroutines.

How can we configure the number of threads used by BLAS?

Run loops using all CPUs, download your FREE book to learn how.

How to Configure the Number of BLAS Threads

We can configure the number of threads used by BLAS via an environment variable.

The specific environment variable used to configure the number of threads depends on the specific BLAS library implementation that you have installed.

If you are not sure which BLAS library you have installed, there are a few ways to find out. See this tutorial:

How to Check the BLAS Library Used by NumPy

The following lists some common BLAS libraries and the environment variable that may be used to configure the number of threads used by BLAS.

OpenMP: OMP_NUM_THREADS
OpenBLAS: OPENBLAS_NUM_THREADS
MKL: MKL_NUM_THREADS
VecLIB/Accelerate: VECLIB_MAXIMUM_THREADS
NumExpr: NUMEXPR_NUM_THREADS

The numpy documentation recommends setting the OMP_NUM_THREADS environment variable. It may have internal logic to map this to the environment variable for the specific BLAS library installed.

Note that for performant linear algebra NumPy uses a BLAS backend such as OpenBLAS or MKL, which may use multiple threads that may be controlled by environment variables such as OMP_NUM_THREADS depending on what is used.
— Number of Threads used for Linear Algebra, Numpy API documentation.

The environment variable can be set prior to executing your Python script, for example by exporting it:

...

export OMP_NUM_THREADS=4

python ./program.py

Or on one line when executing the script:

1	OMP_NUM_THREADS=4 python ./program.py

It can also be set explicitly from within the Python program via the os.environ() function.

For example:

...

from os import environ

environ['OMP_NUM_THREADS'] = '4'

This must be executed before numpy is imported, e.g. before BLAS is loaded and configured for the program runtime.

If you have many libraries installed and are having trouble setting the number of threads, try the following:

# set the number of threads for many common libraries

from os import environ

N_THREADS = '4'

environ['OMP_NUM_THREADS'] = N_THREADS

environ['OPENBLAS_NUM_THREADS'] = N_THREADS

environ['MKL_NUM_THREADS'] = N_THREADS

environ['VECLIB_MAXIMUM_THREADS'] = N_THREADS

environ['NUMEXPR_NUM_THREADS'] = N_THREADS

Now that we know how to configure the number of threads used by the BLAS library, let’s explore how we may use it in practice, such as confirming our NumPy does indeed support multithreaded functions.

Start Now: Free Concurrent NumPy Crash Course

Configure Multithreaded BLAS Operations in NumPy (default)

As a first step, we can execute a multithreaded BLAS function from numpy using the default number of threads.

The default number of threads used by the installed BLAS library is determined when the library is installed.

Your BLAS library may use an arbitrary default number of threads. Alternatively, if the library was compiled as part of the installation, which is common then it may automatically detect the number of logical CPU cores in your system and use that as the default.

We cannot easily check the default number of threads used by the installed BLAS library.

Nevertheless, we can execute a multithreaded BLAS function from numpy and have it executed using the default number of threads.

The example below creates two 8,000 x 8,000 matrices of random numbers and multiples them together using the dot() function. This function will call down into the installed BLAS library and is multithreaded.

This example will use the default number of threads configured for the installed BLAS library.

The complete example is listed below.

# example of multithreaded matrix multiplication with default number of threads

from time import time

from numpy.random import rand

# record the start time

start = time()

# size of arrays

n = 8000

# create an array of random values

data1 = rand(n, n)

data2 = rand(n, n)

# matrix multiplication

result = data1.dot(data2)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example first creates two matrices of random floating point values.

The matrices are then multiplied together.

The duration of the whole program is then reported. In this case, the program took about 8.003 seconds to complete.

1	Took 8.003 seconds

Next, let’s update the example so that it is multithreaded.

Free Concurrent NumPy Course

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

Learn more

Configure Single-Threaded BLAS Operation in NumPy

We can update the above example so that the multithreaded BLAS operation is executed using a single thread.

This can be achieved by setting the environment variable that specifies the number of BLAS threads to use prior to loading the installed BLAS library.

In this case, we will specify the environment variable as part of the Python program, before numpy is imported.

We will set the environment variable recommended by the NumPy documentation, which is OMP_NUM_THREADS.

Update the example to use the environment variable for the BLAS library installed on your system. If you are not sure, try different environment variables (listed above) until you see a dramatic increase in the run time of the program.

The complete example updated to be single-threaded is listed below.

# example of single-threaded matrix multiplication

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from time import time

from numpy.random import rand

# record the start time

start = time()

# size of arrays

n = 8000

# create an array of random values

data1 = rand(n, n)

data2 = rand(n, n)

# matrix multiplication

result = data1.dot(data2)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the updated example on my system took about 17.110 seconds.

You should see a dramatic increase in the runtime of the program compared to the default example in the previous section.

If you don’t, confirm the BLAS library that is installed on your system and seek out the API documentation in order to determine the environment variable used by the library to configure the number of threads.

1	Took 17.110 seconds

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Configure BLAS Threads For Physical CPU Cores

Modern systems typically have multiple CPU cores.

Modern CPU cores use hyperthreading, allowing each core to be treated like two logical CPU cores by the operating system, offering up to a 30% speedup for some operations.

We may choose to configure the BLAS library to use one thread per physical CPU core in our system. This can offer a speed-up for those operations that do not benefit from hyperthreading.

I have 4 physical CPU cores in my system, so the example below configures BLAS to use 4 threads for multithreaded operations. Update the example to match the number of physical cores in your system.

If you’re not sure how many CPU cores you have in your system, see the tutorial:

Number of CPUs in Python

The complete example with this change is listed below.

# example of matrix multiplication with one thread per physical cpu core

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from time import time

from numpy.random import rand

# record the start time

start = time()

# size of arrays

n = 8000

# create an array of random values

data1 = rand(n, n)

data2 = rand(n, n)

# matrix multiplication

result = data1.dot(data2)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example takes about 6.304 seconds on my system.

This is faster than the default number of threads. It may suggest that, at least on my system, I may benefit from configuring BLAS to use one thread per physical CPU core when performing matrix multiplications, at least with matrices of this size.

Lots of qualifications, I know. Always test.

1	Took 6.304 seconds

Did you see a benefit on your system?
Let me know the results of the program on your system in the comments below.

Next, let’s repeat the test with one thread per logical CPU core.

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Configure BLAS Threads For Logical CPU Cores

As we saw in the previous example, modern systems are multicore and modern cores use hyperthreading.

I have 4 cores in my system. With hyperthreading, the operating system sees this as 8 cores.

In this example, we can update the configuration to use one BLAS thread per logical CPU core and compare performance.

Update the number of threads to match the number of logical CPU cores in your system.

If you’re not sure how many logical CPU cores you have, see the tutorial:

Number of CPUs in Python

The complete example with this change is listed below.

# example of matrix multiplication with one thread per logical cpu core

from os import environ

environ['OMP_NUM_THREADS'] = '8'

from time import time

from numpy.random import rand

# record the start time

start = time()

# size of arrays

n = 8000

# create an array of random values

data1 = rand(n, n)

data2 = rand(n, n)

# matrix multiplication

result = data1.dot(data2)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example takes about 8.016 seconds on my system.

This is about as fast as the default case. It suggests that perhaps the default configuration of the BLAS library on my system may be using one BLAS thread per logical CPU in my system. A sensible default.

1	Took 8.016 seconds

Takeaways

You now know how to configure the number of threads used by BLAS and LAPACK via NumPy.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by Steenium on Unsplash

Configure the Number of BLAS/LAPACK Threads for NumPy

Need to Configure the Number of Threads Used By BLAS

How to Configure the Number of BLAS Threads

Configure Multithreaded BLAS Operations in NumPy (default)

Configure Single-Threaded BLAS Operation in NumPy

Configure BLAS Threads For Physical CPU Cores

Configure BLAS Threads For Logical CPU Cores

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent NumPy Fast
(without the frustration)

Additional menu

Need to Configure the Number of Threads Used By BLAS

How to Configure the Number of BLAS Threads

Configure Multithreaded BLAS Operations in NumPy (default)

Configure Single-Threaded BLAS Operation in NumPy

Configure BLAS Threads For Physical CPU Cores

Configure BLAS Threads For Logical CPU Cores

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Leave a Reply Cancel reply

Footer

Learn Concurrent NumPy Fast (without the frustration)

Learn Concurrent NumPy Fast
(without the frustration)