Some NumPy functions will execute in parallel using multithreading automatically and behind the scenes.

In this tutorial, you will discover which NumPy functions support parallelism via multithreading in Python.

Let’s get started.

Table of Contents

## Which NumPy Functions Are Multithreaded?

NumPy supports multithreading by default.

Some NumPy functions make use of the BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) APIs. Open-source implementations of these libraries, like OpenBLAS will multi-thread some function calls automatically behind the scenes.

You can learn more about BLAS and LAPACK in numpy in the tutorial:

Multithreading happens in the C or FORTRAN implementation of the function call. This is separate from Python, meaning the threads are not subjected to the Python global interpreter lock (GIL) and are able to run in parallel.

This can offer a dramatic speed-up of common function calls used in linear algebra.

The problem is, it is not clear which NumPy operators and functions call the BLAS and LAPACK APIs. In turn, it is not clear which NumPy functions are multithreaded.

**Which NumPy functions are multithreaded?**

## List of Multithreaded NumPy Functions

The NumPy API documentation does not provide a list of functions that are multithreaded.

Similarly, open-source implementations of the BLAS API do not list which functions make use of parallelism using threads, and which do not.

Instead, the only reliable way to know if a NumPy function has a multithreaded implementation is to run the function with and without BLAS thread support.

The NumPy API does give some help.

It provides a linear algebra API page that lists most, perhaps all, of the NumPy functions that explicitly call the BLAS and LAPACK APIs.

A subset of these functions supports parallelism using threads.

The NumPy linear algebra functions rely on BLAS and LAPACK to provide efficient low level implementations of standard linear algebra algorithms. Those libraries may be provided by NumPy itself using C versions of a subset of their reference implementations but, when possible, highly optimized libraries that take advantage of specialized processor functionality are preferred. Examples of such libraries are OpenBLAS, MKL (TM), and ATLAS. Because those libraries are multithreaded and processor dependent, environmental variables and external packages such as threadpoolctl may be needed to control the number of threads or specify the processor architecture.

— Linear algebra (numpy.linalg)

The list of NumPy functions that call BLAS and LAPACK APIs that may offer automatic parallelism via threads is as follows:

### Matrix and vector products

- numpy.dot()
- numpy.linalg.multi_dot(()
- numpy.vdot()
- numpy.inner()
- numpy.outer()
- numpy.matmul()
- numpy.tensordot()
- numpy.einsum(()
- numpy.einsum_path()
- numpy.linalg.matrix_power()
- numpy.kron()

### Decompositions

- numpy.linalg.cholesky()
- numpy.linalg.qr()
- numpy.linalg.svd()

### Matrix eigenvalues

- numpy.linalg.eig()
- numpy.linalg.eigh()
- numpy.linalg.eigvals()
- numpy.linalg.eigvalsh()

### Norms and other numbers

- numpy.linalg.norm()
- numpy.linalg.cond()
- numpy.linalg.det()
- numpy.linalg.matrix_rank()
- numpy.linalg.slogdet()
- numpy.trace()

### Solving equations and inverting matrices

- numpy.linalg.solve()
- numpy.linalg.tensorsolve()
- numpy.linalg.lstsq()
- numpy.linalg.inv()
- numpy.linalg.pinv()
- numpy.linalg.tensorinv()

Now that we know what functions call the BLAS and LAPACK APIs, we can test some of the more common functions to see if they support parallelism.

## Examples of Multithreaded NumPy Functions

We can determine if a NumPy function that calls the BLAS or LAPACK APIs supports parallelism via threads by running an experiment.

The implementation will choose to use multithreaded parallelism if the operation itself supports partitioning into subtasks, and if the size of the data is sufficiently large to benefit from subtasks, although not too large to overwhelm main memory.

We can test a given operation with modestly sized NumPy arrays firstly with BLAS multithreaded disabled, then enabled. If there is a speed difference, we know the operation does not support parallelism via threads.

Many open-source BLAS implementations (e.g. OpenBLAS that I have installed) allow the number of threads to be configured via the **OMP_NUM_THREADS** environment variable.

You may need to check which BLAS implementation you have installed, then locate the appropriate environment variable. For help with this, see the tutorial:

The environment variable can be set prior to running the Python program, or set explicitly in the Python program prior to importing NumPy.

We will use the latter approach in the below examples that demonstrate functions that are known to support parallelism via multithreading.

If you discover more functions that support parallelism via multithreading, let me know in the comments below.

### Matrix-Matrix Multiplication

Matrix multiplication in NumPy supports multithreading.

This can be achieved via the **dot()** function on an array, the **numpy.dot()** function, and the “**@**” operator.

Preliminary testing suggests multithreading is only supported for matrix-matrix multiplication. Tests with matrix-vector and vector-vector multiplication (e.g. via **vdot()**) do not appear to support parallelism.

The example below shows matrix multiplication with BLAS multithreading disabled.

1 2 3 4 5 6 7 8 9 10 11 |
# example of numpy matrix-matrix multiplication without threads from os import environ environ['OMP_NUM_THREADS'] = '1' from numpy.random import rand # size of arrays n = 8000 # create an array of random values data1 = rand(n, n) data2 = rand(n, n) # matrix-matrix multiplication result = data1.dot(data2) |

Running the example multiplies two matrices and takes about 17.1 seconds

The example below is updated to use 4 threads, one for each physical CPU core in my system.

Update to match the number of cores in your system.

1 2 3 4 5 6 7 8 9 10 11 |
# example of numpy matrix-matrix multiplication with threads from os import environ environ['OMP_NUM_THREADS'] = '4' from numpy.random import rand # size of arrays n = 8000 # create an array of random values data1 = rand(n, n) data2 = rand(n, n) # matrix-matrix multiplication result = data1.dot(data2) |

Running the example takes about 6.3 seconds on my system.

This highlights that matrix-matrix multiplication supports parallelism via threads.

### Eigendecomposition

Calculating an eigendecomposition in NumPy supports multithreading.

This can be achieved via the **numpy.linalg.eig()** function that in turn calls down to LAPACK.

This is implemented using the _geev LAPACK routines which compute the eigenvalues and eigenvectors of general square arrays.

— numpy.linalg.eig

The example below calculates an eigendecomposition for a square matrix of random values without threading.

1 2 3 4 5 6 7 8 9 10 11 |
# example of numpy eigen decomposition without threads from os import environ environ['OMP_NUM_THREADS'] = '1' from numpy.random import rand from numpy.linalg import eig # size of arrays n = 3000 # create an array of random values data = rand(n, n) # decomposition w, v = eig(data) |

The example below completes in about 13.9 seconds.

We can then update the example to use threads, limited to one thread per physical CPU core.

1 2 3 4 5 6 7 8 9 10 11 |
# example of numpy eigen decomposition with threads from os import environ environ['OMP_NUM_THREADS'] = '4' from numpy.random import rand from numpy.linalg import eig # size of arrays n = 3000 # create an array of random values data = rand(n, n) # decomposition w, v = eig(data) |

Running the updated example completes in about 11.6 seconds, offering a modest speed-up.

This highlights that calculating an eigendecomposition supports parallelism via threads.

### Singular Value Decomposition (SVD)

Calculating a singular value decomposition in NumPy supports multithreading.

This can be achieved by calling the **numpy.linalg.svd()** function.

Other similar decompositions are multithreaded, including:

- QR Decomposition via
**numpy.linalg.qr()** - Cholesky decomposition via the
**numpy.linalg.cholesky()**

The example below calculates a SVD for a square matrix of random values without threading.

1 2 3 4 5 6 7 8 9 10 11 |
# example of numpy svd without threads from os import environ environ['OMP_NUM_THREADS'] = '1' from numpy.random import rand from numpy.linalg import svd # size of arrays n = 4000 # create an array of random values data = rand(n, n) # decomposition u, s, vh = svd(data) |

Running the example calculates the SVD in about 27.6 seconds.

We can then update the example to use threads, one for each physical CPU core in the system.

1 2 3 4 5 6 7 8 9 10 11 |
# example of numpy svd with threads from os import environ environ['OMP_NUM_THREADS'] = '4' from numpy.random import rand from numpy.linalg import svd # size of arrays n = 4000 # create an array of random values data = rand(n, n) # decomposition u, s, vh = svd(data) |

Running the example takes about 21.5, offering a modest speed-up.

This highlights that calculating an SVD in NumPy supports parallelism via threads.

## Takeaways

You now know which NumPy functions support parallelism via multithreading in Python

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

Photo by John Cameron on Unsplash

## Leave a Reply