Limit BLAS Threads in Numpy with threadpoolctl

Last Updated on September 29, 2023

You can configure the number of threads used by BLAS via numpy with the threadpoolctl library.

In this tutorial, you will discover how to control the number of threads used by BLAS with threadpoolctl library.

Let’s get started.

Table of Contents

Need to Limit Number of BLAS Threads

NumPy is an array library in Python.

It makes use of third-party libraries to perform array functions efficiently.

One example is the BLAS and LAPACK specifications used by NumPy to execute vector, matrix, and linear algebra operations.

The benefit of BLAS libraries is that it makes specific linear algebra operations more efficient by using advanced algorithms and multithreading.

You can learn more about BLAS and LAPACK for numpy in the tutorial:

What is BLAS and LAPACK in NumPy

Many BLAS and LAPACK functions are multithreaded, including matrix operations like matrix multiplication (dot product) and matrix decompositions like singular value decomposition (SVD).

We may need to configure the number of threads used by the installed BLAS library.

This may be for many reasons, such as:

We want the BLAS calculation to be single-threaded.
We want the BLAS calculation to use fewer threads.
We want the BLAS calculation to use one thread per physical CPU core.
We want the BLAS calculation to use one thread per logical CPU core.

These cases may come up in our program if we are using concurrency elsewhere in our program, such as via Python threads, Python processes, or Asyncio coroutines.

How can we configure the number of threads used by BLAS in NumPy?

Run loops using all CPUs, download your FREE book to learn how.

How to Limit BLAS Threads With threadpoolctl

One way to configure the number of threads used by BLAS via numpy is via an environment variable.

You can learn more about this in the tutorial:

Configure the Number of NumPy BLAS Threads

Another way to configure the number of threads used by BLAS via numpy is with the threadpoolctl library.

The threadpoolctl library is an open-source project that provides tools to configure the number of threads used in external libraries, such as BLAS and OpenMP.

Python helpers to limit the number of threads used in the threadpool-backed of common native libraries used for scientific computing and data science (e.g. BLAS and OpenMP).
— threadpoolctl GitHub Project

Firstly, we must install the threadpoolctl library using a Python package manager, such as pip.

For example:

1	sudo pip install threadpoolctl

The API provides methods to configure the number of threads used by BLAS within a scope, such as within a context manager.

This can be achieved via the threadpoolctl.threadpool_limits() function and specifies the maximum number of threads via the “limits” argument and the underlying library as “blas” via the “user_api” argument.

For example:

...

# limit the number of blas threads

with threadpool_limits(limits=4, user_api='blas'):

# ...

This will limit the number of threads used by BLAS within the scope of the context manager.

This is helpful as we can change the number of threads to be equal to the number of physical cores in a block of code, or configure a block of code to not use BLAS multithreading at all, such as if we want to use a ThreadPool.

The threadpoolctl works by configuring the number of BLAS directly at runtime via the public API.

threadpoolctl only interacts with native libraries via their public runtime APIs.
— threadpoolctl GitHub Project

You can learn more about the project here:

Thread-pool Controls (threadpoolctl), GitHub Project.

Now that we know how to configure the number of threads used by BLAS via NumPy, let’s look at some worked examples.

Start Now: Free Concurrent NumPy Crash Course

Example of Matrix Multiplication with All BLAS Threads

Before we explore how to control the number of BLAS threads with the threadpoolctl library, let’s first develop an example that uses the default number of BLAS threads.

In this example, we will create two modestly large matrices and multiply them together.

Numpy matrix multiplication can be achieved using the dot() method on the arrays. The operation is performed using the installed BLAS library and is multithreaded.

Firstly, we will define two 8,000×8,000 matrices of one values.

...

# size of arrays

n = 8000

# create an array of random values

data1 = ones((n, n))

data2 = ones((n, n))

We will then multiply one matrix by the other.

...

# matrix multiplication

result = data1.dot(data2)

We will time the whole operation and report the duration.

Tying this together, the complete example is listed below.

# example of numpy matrix multiplication with blas threads

from numpy import ones

from time import time

# size of arrays

n = 8000

# create an array of random values

data1 = ones((n, n))

data2 = ones((n, n))

# record start time

start = time()

# matrix multiplication

result = data1.dot(data2)

# calculate duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example first creates two numpy arrays and then multiplies them together.

The operation is performed using the BLAS library with multiple threads.

In this case, on my system, the operation takes about 7.117 seconds.

We will use this as a baseline and compare it to the same example when controlling the number of threads used by BLAS with threadpoolctl.

Try repeating the operation a few times and average the duration. We do not need a precise measurement, instead, we are interested in a directional change when controlling the number of threads via the threadpoolctl library.

1	Took 7.117 seconds

Next, let’s update the example to control the number of threads used by BLAS.

Free Concurrent NumPy Course

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

Learn more

Example of Matrix Multiplication with Limited BLAS Threads

We can update the previous example to control the number of threads used by BLAS via the threadpoolctl library.

There are two common scenarios we can explore.

The first is to configure BLAS to use one thread per physical CPU core. This can offer better performance for some operations, such as matrix multiplication.

The other scenario is to turn off the multithreaded capability of BLAS. This is helpful in programs where we want to use Python multithreading for parallelism instead of parallelism at the operation level in BLAS.

Let’s take a closer look at each example.

Example of One Thread Per Physical CPU Core

In this example, we can configure BLAS to use one thread per physical CPU core in the system.

I have 4 physical CPU cores in my system, so we will configure the threadpool_limits() function to use 4 worker threads in BLAS when performing the matrix multiplication.

...

# limit the number of blas threads

with threadpool_limits(limits=4, user_api='blas'):

# matrix multiplication

result = data1.dot(data2)

Update the example to use the number of physical CPU cores in your system.

If you need help with determining the number of CPU cores in your system or more information on physical vs logical CPU cores, see the tutorial:

Number of CPUs in Python

Tying this together the complete example with this change is listed below.

# example of numpy matrix multiplication with limited blas threads

from numpy import ones

from time import time

from threadpoolctl import threadpool_limits

# size of arrays

n = 8000

# create an array of random values

data1 = ones((n, n))

data2 = ones((n, n))

# record start time

start = time()

# limit the number of blas threads

with threadpool_limits(limits=4, user_api='blas'):

# matrix multiplication

result = data1.dot(data2)

# calculate duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example first creates the two matrices.

The number of threads is then configured to be equal to the number of physical CPU cores.

The matrix operation is then performed using BLAS.

In this case, the example takes about 4.863 seconds.

This is about 2.254 seconds faster than the same operation with the default number of BLAS threads, or a speedup of about 1.46x.

1	Took 4.863 seconds

Next, let’s look at how we can turn-off BLAS multithreading in numpy.

Example of Single-Threaded BLAS Operations

We can update the example to disable BLAS threads using the threadpoolctl library.

This can be achieved via the threadpool_limits() and setting the “limits” argument to 1.

This will use a single BLAS thread to perform the matrix multiplication.

For example:

...

# limit the number of blas threads

with threadpool_limits(limits=1, user_api='blas'):

# matrix multiplication

result = data1.dot(data2)

Tying this together, the complete example with this change is listed below.

# example of numpy matrix multiplication with no blas threads

from numpy import ones

from time import time

from threadpoolctl import threadpool_limits

# size of arrays

n = 8000

# create an array of random values

data1 = ones((n, n))

data2 = ones((n, n))

# record start time

start = time()

# limit the number of blas threads

with threadpool_limits(limits=1, user_api='blas'):

# matrix multiplication

result = data1.dot(data2)

# calculate duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example first creates the two matrices.

The number of threads is then configured to be one.

The matrix operation is then performed using BLAS.

In this case, the operation took about 16.165 seconds.

This is about 9.048 seconds slower than the default number of threads used by BLAS, or a slowdown of about 2.27x.

1	Took 16.165 seconds

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Takeaways

You now know how to control the number of threads used by BLAS via numpy with the threadpoolctl library.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by Rikku Sama on Unsplash

Limit BLAS Threads in Numpy with threadpoolctl

Need to Limit Number of BLAS Threads

How to Limit BLAS Threads With threadpoolctl

Example of Matrix Multiplication with All BLAS Threads

Example of Matrix Multiplication with Limited BLAS Threads

Example of One Thread Per Physical CPU Core

Example of Single-Threaded BLAS Operations

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent NumPy Fast
(without the frustration)

Additional menu

Need to Limit Number of BLAS Threads

How to Limit BLAS Threads With threadpoolctl

Example of Matrix Multiplication with All BLAS Threads

Example of Matrix Multiplication with Limited BLAS Threads

Example of One Thread Per Physical CPU Core

Example of Single-Threaded BLAS Operations

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Leave a Reply Cancel reply

Footer

Learn Concurrent NumPy Fast (without the frustration)

Learn Concurrent NumPy Fast
(without the frustration)