Numpy Multithreaded Matrix Functions (up to 3x faster)

April 20, 2023 Concurrent NumPy

You can calculate matrix linear algebra functions in parallel with NumPy.

In this tutorial, you will discover how to calculate multithreaded matrix linear algebra functions with NumPy.

Let's get started.

Numpy Linear Algebra Matrix Functions are Multithreaded

Linear algebra is a field of mathematics concerned with linear equations with arrays and matrices of numbers.

Numpy is a Python library for working with arrays of numbers. As such, it implements many linear algebra functions in the numpy.linalg module.

Many of the numpy.linalg functions for working with matrices call down into the BLAS and LAPACK libraries. These libraries provide efficient implementations of common matrix functions. Some of these efficient implementations are multithreaded.

The NumPy linear algebra functions rely on BLAS and LAPACK to provide efficient low level implementations of standard linear algebra algorithms.

-- Linear algebra (numpy.linalg)

You can learn more about the BLAS library in numpy in the tutorial:

This means that many linear algebra functions in numpy will execute in parallel using multiple CPU cores by default.

Which Numpy Matrix Functions are Multithreaded

The numpy.linalg documentation does not specify which functions are multithreaded.

Nor does the BLAS or LAPACK specifications or specific library implementations such as OpenBLAS.

Nevertheless, most of the linear algebra functions are multithreaded, including:

You can learn more about the BLAS-supported linear algebra functions supported in the numpy.linalg module in the tutorial:

Now that we know that common linear algebra matrix functions are multithreaded, let's look at some worked examples.

Parallel Matrix Multiplication (Dot Product)

In this section, we will explore a parallel implementation of matrix multiplication.

In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product, has the number of rows of the first and the number of columns of the second matrix.

-- Matrix multiplication, Wikipedia.

We can implement the matrix multiplication with NumPy via the numpy.dot() function and via the dot() method on numpy array objects. we can also use the "@" operator.

Dot product of two arrays.

-- numpy.dot API

Let's take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded Matrix Multiplication

In this example, we will configure BLAS to use a single thread.

We will create two square matrices with the dimensions 8,000 x 8,000 then multiply them together.

The complete example is listed below.

# example of single threaded matrix multiplication
from os import environ
environ['OMP_NUM_THREADS'] = '1'
from time import time
from numpy.random import rand
# record the start time
start = time()
# size of arrays
n = 8000
# create an array of random values
data1 = rand(n, n)
data2 = rand(n, n)
# matrix multiplication
result = data1.dot(data2)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example took about 16.934 seconds on my system.

It may take more or fewer seconds, depending on the hardware in your system and the versions of Python, Numpy, Scipy, and the BLAS library installed.

Took 16.934 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded Matrix Multiplication

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# example of multithreaded matrix multiplication
from os import environ
environ['OMP_NUM_THREADS'] = '4'
from time import time
from numpy.random import rand
# record the start time
start = time()
# size of arrays
n = 8000
# create an array of random values
data1 = rand(n, n)
data2 = rand(n, n)
# matrix multiplication
result = data1.dot(data2)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example took about 5.706 seconds.

That is about 11.228 seconds faster or a speed-up of about 2.97x.

Took 5.706 seconds

Parallel Matrix Power (Exponent)

In this section, we will explore a parallel implementation of the matrix power.

That is, raising a matrix to a given exponent.

We can implement the matrix multiplication with NumPy via the numpy.linalg.matrix_power() function.

Raise a square matrix to the (integer) power n.

-- numpy.linalg.matrix_power API.

Let's take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded Matrix Power

In this example, we will configure BLAS to use a single thread.

We will create a single square matrix with the dimensions 4,000 x 4,000 then raise the matrix to the power 10.

The complete example is listed below.

# example of single threaded matrix power
from os import environ
environ['OMP_NUM_THREADS'] = '1'
from time import time
from numpy.random import rand
from numpy.linalg import matrix_power
# record the start time
start = time()
# size of arrays
n = 4000
# create an array of random values
data = rand(n, n)
# matrix power
result = matrix_power(data, 10)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example took about 8.159 seconds on my system.

Took 8.159 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded Matrix Power

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# example of multithreaded matrix power
from os import environ
environ['OMP_NUM_THREADS'] = '4'
from time import time
from numpy.random import rand
from numpy.linalg import matrix_power
# record the start time
start = time()
# size of arrays
n = 4000
# create an array of random values
data = rand(n, n)
# matrix power
result = matrix_power(data, 10)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example took about 2.624 seconds.

That is about 5.535 seconds faster or a speed-up of about 3.11x.

Took 2.624 seconds

Parallel Matrix Determinant (Det)

In this section, we will explore a parallel implementation of the matrix determinant.

In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. It characterizes some properties of the matrix and the linear map represented by the matrix.

-- Determinant, Wikipedia.

We can implement the matrix determinant in Python via the numpy.linalg.det() function.

Compute the determinant of an array.

-- numpy.linalg.det API

Let's take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded Matrix Determinant

In this example, we will configure BLAS to use a single thread.

We will create a single square matrix with the dimensions 10,000 x 10,000 and then calculate the determinant.

The complete example is listed below.

# example of single threaded matrix determinant
from os import environ
environ['OMP_NUM_THREADS'] = '1'
from time import time
from numpy import ones
from numpy.linalg import det
# record the start time
start = time()
# size of arrays
n = 10000
# create a data array
data = ones((n, n))
# matrix determinant
result = det(data)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example took about 11.892 seconds to complete on my system.

Took 11.892 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded Matrix Determinant

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# example of multithreaded matrix determinant
from os import environ
environ['OMP_NUM_THREADS'] = '4'
from time import time
from numpy import ones
from numpy.linalg import det
# record the start time
start = time()
# size of arrays
n = 10000
# create a data array
data = ones((n, n))
# matrix determinant
result = det(data)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example took about 4.526 seconds.

That is about 7.366 seconds faster or a speed-up of about 2.63x.

Took 4.526 seconds

Parallel Matrix Inverse

In this section, we will explore a parallel implementation of the matrix inverse.

In linear algebra, an n-by-n square matrix A is called invertible (also nonsingular or nondegenerate), if there exists an n-by-n square matrix B such that AB = BA = In, where In denotes the n-by-n identity matrix and the multiplication used is ordinary matrix multiplication.

-- Invertible matrix, Wikipedia.

We can implement the matrix inverse in Python via the numpy.linalg.inv() function.

Compute the (multiplicative) inverse of a matrix.

-- numpy.linalg.inv API

Let's take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded Matrix Inverse

In this example, we will configure BLAS to use a single thread.

We will create a single square matrix with the dimensions 5,000 x 5,000 then calculate the matrix inverse.

The complete example is listed below.

# example of single threaded matrix inverse
from os import environ
environ['OMP_NUM_THREADS'] = '1'
from time import time
from numpy.random import rand
from numpy.linalg import inv
# record the start time
start = time()
# size of arrays
n = 5000
# create an array of random values
data = rand(n, n)
# matrix inverse
result = inv(data)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example took about 6.286 seconds to complete on my system.

Took 6.286 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded Matrix Inverse

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# example of multithreaded matrix inverse
from os import environ
environ['OMP_NUM_THREADS'] = '4'
from time import time
from numpy.random import rand
from numpy.linalg import inv
# record the start time
start = time()
# size of arrays
n = 5000
# create an array of random values
data = rand(n, n)
# matrix inverse
result = inv(data)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example took about 2.435 seconds.

That is about 2.421 seconds faster or a speed-up of about 2.58x.

Took 3.851 seconds

Parallel Matrix Pseudo Inverse (Moore-Penrose Method)

In this section, we will explore a parallel implementation of the matrix pseudo-inverse.

In mathematics, and in particular linear algebra, the Moore–Penrose inverse A+ of a matrix A is the most widely known generalization of the inverse matrix.

-- Moore–Penrose inverse, Wikipedia.

We can implement the matrix pseudo inverse in Python via the numpy.linalg.pinv() function.

Compute the (Moore-Penrose) pseudo-inverse of a matrix. Calculate the generalized inverse of a matrix using its singular-value decomposition (SVD) and including all large singular values.

-- numpy.linalg.pinv API

Let's take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded Matrix Pseudo Inverse

In this example, we will configure BLAS to use a single thread.

We will create a single square matrix with the dimensions 3,000 x 3,000 then calculate the matrix pseudo inverse.

The complete example is listed below.

# example of single threaded matrix pseudo inverse
from os import environ
environ['OMP_NUM_THREADS'] = '1'
from time import time
from numpy.random import rand
from numpy.linalg import pinv
# record the start time
start = time()
# size of array
n = 3000
# create an array of random values
data = rand(n, n)
# matrix pseudo inverse
result = pinv(data)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example took about 12.950 seconds to complete on my system.

Took 12.950 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded Matrix Pseudo Inverse

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# example of multithreaded matrix pseudo inverse
from os import environ
environ['OMP_NUM_THREADS'] = '4'
from time import time
from numpy.random import rand
from numpy.linalg import pinv
# record the start time
start = time()
# size of array
n = 3000
# create an array of random values
data = rand(n, n)
# matrix pseudo inverse
result = pinv(data)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example took about 11.799 seconds.

That is about 3.460 seconds faster or a speed-up of about 1.36x.

Took 9.490 seconds

Takeaways

You now know how to calculate multithreaded matrix linear algebra functions with NumPy.



If you enjoyed this tutorial, you will love my book: Concurrent NumPy in Python. It covers everything you need to master the topic with hands-on examples and clear explanations.