Which NumPy Functions Are Multithreaded

Last Updated on September 29, 2023

Some NumPy functions will execute in parallel using multithreading automatically and behind the scenes.

In this tutorial, you will discover which NumPy functions support parallelism via multithreading in Python.

Let’s get started.

Table of Contents

Which NumPy Functions Are Multithreaded?

NumPy supports multithreading by default.

Some NumPy functions make use of the BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) APIs. Open-source implementations of these libraries, like OpenBLAS will multi-thread some function calls automatically behind the scenes.

You can learn more about BLAS and LAPACK in numpy in the tutorial:

What is BLAS and LAPACK in NumPy

Multithreading happens in the C or FORTRAN implementation of the function call. This is separate from Python, meaning the threads are not subjected to the Python global interpreter lock (GIL) and are able to run in parallel.

This can offer a dramatic speed-up of common function calls used in linear algebra.

The problem is, it is not clear which NumPy operators and functions call the BLAS and LAPACK APIs. In turn, it is not clear which NumPy functions are multithreaded.

Which NumPy functions are multithreaded?

Run loops using all CPUs, download your FREE book to learn how.

List of Multithreaded NumPy Functions

The NumPy API documentation does not provide a list of functions that are multithreaded.

Similarly, open-source implementations of the BLAS API do not list which functions make use of parallelism using threads, and which do not.

Instead, the only reliable way to know if a NumPy function has a multithreaded implementation is to run the function with and without BLAS thread support.

The NumPy API does give some help.

It provides a linear algebra API page that lists most, perhaps all, of the NumPy functions that explicitly call the BLAS and LAPACK APIs.

Linear algebra (numpy.linalg)

A subset of these functions supports parallelism using threads.

The NumPy linear algebra functions rely on BLAS and LAPACK to provide efficient low level implementations of standard linear algebra algorithms. Those libraries may be provided by NumPy itself using C versions of a subset of their reference implementations but, when possible, highly optimized libraries that take advantage of specialized processor functionality are preferred. Examples of such libraries are OpenBLAS, MKL (TM), and ATLAS. Because those libraries are multithreaded and processor dependent, environmental variables and external packages such as threadpoolctl may be needed to control the number of threads or specify the processor architecture.
— Linear algebra (numpy.linalg)

The list of NumPy functions that call BLAS and LAPACK APIs that may offer automatic parallelism via threads is as follows:

Matrix and vector products

numpy.dot()
numpy.linalg.multi_dot(()
numpy.vdot()
numpy.inner()
numpy.outer()
numpy.matmul()
numpy.tensordot()
numpy.einsum(()
numpy.einsum_path()
numpy.linalg.matrix_power()
numpy.kron()

Decompositions

numpy.linalg.cholesky()
numpy.linalg.qr()
numpy.linalg.svd()

Matrix eigenvalues

numpy.linalg.eig()
numpy.linalg.eigh()
numpy.linalg.eigvals()
numpy.linalg.eigvalsh()

Norms and other numbers

numpy.linalg.norm()
numpy.linalg.cond()
numpy.linalg.det()
numpy.linalg.matrix_rank()
numpy.linalg.slogdet()
numpy.trace()

Solving equations and inverting matrices

numpy.linalg.solve()
numpy.linalg.tensorsolve()
numpy.linalg.lstsq()
numpy.linalg.inv()
numpy.linalg.pinv()
numpy.linalg.tensorinv()

Now that we know what functions call the BLAS and LAPACK APIs, we can test some of the more common functions to see if they support parallelism.

Start Now: Free Concurrent NumPy Crash Course

Examples of Multithreaded NumPy Functions

We can determine if a NumPy function that calls the BLAS or LAPACK APIs supports parallelism via threads by running an experiment.

The implementation will choose to use multithreaded parallelism if the operation itself supports partitioning into subtasks, and if the size of the data is sufficiently large to benefit from subtasks, although not too large to overwhelm main memory.

We can test a given operation with modestly sized NumPy arrays firstly with BLAS multithreaded disabled, then enabled. If there is a speed difference, we know the operation does not support parallelism via threads.

Many open-source BLAS implementations (e.g. OpenBLAS that I have installed) allow the number of threads to be configured via the OMP_NUM_THREADS environment variable.

You may need to check which BLAS implementation you have installed, then locate the appropriate environment variable. For help with this, see the tutorial:

Configure the Number of BLAS/LAPACK Threads for NumPy

The environment variable can be set prior to running the Python program, or set explicitly in the Python program prior to importing NumPy.

We will use the latter approach in the below examples that demonstrate functions that are known to support parallelism via multithreading.

If you discover more functions that support parallelism via multithreading, let me know in the comments below.

Matrix-Matrix Multiplication

Matrix multiplication in NumPy supports multithreading.

This can be achieved via the dot() function on an array, the numpy.dot() function, and the “@” operator.

Preliminary testing suggests multithreading is only supported for matrix-matrix multiplication. Tests with matrix-vector and vector-vector multiplication (e.g. via vdot()) do not appear to support parallelism.

The example below shows matrix multiplication with BLAS multithreading disabled.

# example of numpy matrix-matrix multiplication without threads

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from numpy.random import rand

# size of arrays

n = 8000

# create an array of random values

data1 = rand(n, n)

data2 = rand(n, n)

# matrix-matrix multiplication

result = data1.dot(data2)

Running the example multiplies two matrices and takes about 17.1 seconds

The example below is updated to use 4 threads, one for each physical CPU core in my system.

Update to match the number of cores in your system.

# example of numpy matrix-matrix multiplication with threads

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from numpy.random import rand

# size of arrays

n = 8000

# create an array of random values

data1 = rand(n, n)

data2 = rand(n, n)

# matrix-matrix multiplication

result = data1.dot(data2)

Running the example takes about 6.3 seconds on my system.

This highlights that matrix-matrix multiplication supports parallelism via threads.

Eigendecomposition

Calculating an eigendecomposition in NumPy supports multithreading.

This can be achieved via the numpy.linalg.eig() function that in turn calls down to LAPACK.

This is implemented using the _geev LAPACK routines which compute the eigenvalues and eigenvectors of general square arrays.
— numpy.linalg.eig

The example below calculates an eigendecomposition for a square matrix of random values without threading.

# example of numpy eigen decomposition without threads

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from numpy.random import rand

from numpy.linalg import eig

# size of arrays

n = 3000

# create an array of random values

data = rand(n, n)

# decomposition

w, v = eig(data)

The example below completes in about 13.9 seconds.

We can then update the example to use threads, limited to one thread per physical CPU core.

# example of numpy eigen decomposition with threads

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from numpy.random import rand

from numpy.linalg import eig

# size of arrays

n = 3000

# create an array of random values

data = rand(n, n)

# decomposition

w, v = eig(data)

Running the updated example completes in about 11.6 seconds, offering a modest speed-up.

This highlights that calculating an eigendecomposition supports parallelism via threads.

Singular Value Decomposition (SVD)

Calculating a singular value decomposition in NumPy supports multithreading.

This can be achieved by calling the numpy.linalg.svd() function.

Other similar decompositions are multithreaded, including:

QR Decomposition via numpy.linalg.qr()
Cholesky decomposition via the numpy.linalg.cholesky()

The example below calculates a SVD for a square matrix of random values without threading.

# example of numpy svd without threads

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from numpy.random import rand

from numpy.linalg import svd

# size of arrays

n = 4000

# create an array of random values

data = rand(n, n)

# decomposition

u, s, vh = svd(data)

Running the example calculates the SVD in about 27.6 seconds.

We can then update the example to use threads, one for each physical CPU core in the system.

# example of numpy svd with threads

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from numpy.random import rand

from numpy.linalg import svd

# size of arrays

n = 4000

# create an array of random values

data = rand(n, n)

# decomposition

u, s, vh = svd(data)

Running the example takes about 21.5, offering a modest speed-up.

This highlights that calculating an SVD in NumPy supports parallelism via threads.

Free Concurrent NumPy Course

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

Learn more

Takeaways

You now know which NumPy functions support parallelism via multithreading in Python

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by John Cameron on Unsplash

Which NumPy Functions Are Multithreaded

Which NumPy Functions Are Multithreaded?