Numpy Multithreaded Matrix Functions (up to 3x faster)

Last Updated on September 29, 2023

You can calculate matrix linear algebra functions in parallel with NumPy.

In this tutorial, you will discover how to calculate multithreaded matrix linear algebra functions with NumPy.

Let’s get started.

Numpy Linear Algebra Matrix Functions are Multithreaded

Linear algebra is a field of mathematics concerned with linear equations with arrays and matrices of numbers.

Numpy is a Python library for working with arrays of numbers. As such, it implements many linear algebra functions in the numpy.linalg module.

Many of the numpy.linalg functions for working with matrices call down into the BLAS and LAPACK libraries. These libraries provide efficient implementations of common matrix functions. Some of these efficient implementations are multithreaded.

The NumPy linear algebra functions rely on BLAS and LAPACK to provide efficient low level implementations of standard linear algebra algorithms.
— Linear algebra (numpy.linalg)

You can learn more about the BLAS library in numpy in the tutorial:

What is BLAS and LAPACK in NumPy

This means that many linear algebra functions in numpy will execute in parallel using multiple CPU cores by default.

Run loops using all CPUs, download your FREE book to learn how.

Which Numpy Matrix Functions are Multithreaded

The numpy.linalg documentation does not specify which functions are multithreaded.

Nor does the BLAS or LAPACK specifications or specific library implementations such as OpenBLAS.

Nevertheless, most of the linear algebra functions are multithreaded, including:

Matrix and vector products: such as numpy.dot() and numpy.linalg.matrix_power()
Norms and other numbers: such as numpy.linalg.det() and numpy.linalg.matrix_rank()
Inverting matrices: such as numpy.linalg.inv() and numpy.linalg.pinv()

You can learn more about the BLAS-supported linear algebra functions supported in the numpy.linalg module in the tutorial:

Which NumPy Functions Are Multithreaded

Now that we know that common linear algebra matrix functions are multithreaded, let’s look at some worked examples.

Start Now: Free Concurrent NumPy Crash Course

Parallel Matrix Multiplication (Dot Product)

In this section, we will explore a parallel implementation of matrix multiplication.

In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product, has the number of rows of the first and the number of columns of the second matrix.
— Matrix multiplication, Wikipedia.

We can implement the matrix multiplication with NumPy via the numpy.dot() function and via the dot() method on numpy array objects. we can also use the “@” operator.

Dot product of two arrays.
— numpy.dot API

Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded Matrix Multiplication

In this example, we will configure BLAS to use a single thread.

We will create two square matrices with the dimensions 8,000 x 8,000 then multiply them together.

The complete example is listed below.

# example of single threaded matrix multiplication

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from time import time

from numpy.random import rand

# record the start time

start = time()

# size of arrays

n = 8000

# create an array of random values

data1 = rand(n, n)

data2 = rand(n, n)

# matrix multiplication

result = data1.dot(data2)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 16.934 seconds on my system.

It may take more or fewer seconds, depending on the hardware in your system and the versions of Python, Numpy, Scipy, and the BLAS library installed.

1	Took 16.934 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded Matrix Multiplication

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# example of multithreaded matrix multiplication

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from time import time

from numpy.random import rand

# record the start time

start = time()

# size of arrays

n = 8000

# create an array of random values

data1 = rand(n, n)

data2 = rand(n, n)

# matrix multiplication

result = data1.dot(data2)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 5.706 seconds.

That is about 11.228 seconds faster or a speed-up of about 2.97x.

1	Took 5.706 seconds

Free Concurrent NumPy Course

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

Learn more

Parallel Matrix Power (Exponent)

In this section, we will explore a parallel implementation of the matrix power.

That is, raising a matrix to a given exponent.

We can implement the matrix multiplication with NumPy via the numpy.linalg.matrix_power() function.

Raise a square matrix to the (integer) power n.
— numpy.linalg.matrix_power API.

Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded Matrix Power

In this example, we will configure BLAS to use a single thread.

We will create a single square matrix with the dimensions 4,000 x 4,000 then raise the matrix to the power 10.

The complete example is listed below.

# example of single threaded matrix power

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from time import time

from numpy.random import rand

from numpy.linalg import matrix_power

# record the start time

start = time()

# size of arrays

n = 4000

# create an array of random values

data = rand(n, n)

# matrix power

result = matrix_power(data, 10)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 8.159 seconds on my system.

1	Took 8.159 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded Matrix Power

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# example of multithreaded matrix power

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from time import time

from numpy.random import rand

from numpy.linalg import matrix_power

# record the start time

start = time()

# size of arrays

n = 4000

# create an array of random values

data = rand(n, n)

# matrix power

result = matrix_power(data, 10)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 2.624 seconds.

That is about 5.535 seconds faster or a speed-up of about 3.11x.

1	Took 2.624 seconds

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Parallel Matrix Determinant (Det)

In this section, we will explore a parallel implementation of the matrix determinant.

In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. It characterizes some properties of the matrix and the linear map represented by the matrix.
— Determinant, Wikipedia.

We can implement the matrix determinant in Python via the numpy.linalg.det() function.

Compute the determinant of an array.
— numpy.linalg.det API

Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded Matrix Determinant

In this example, we will configure BLAS to use a single thread.

We will create a single square matrix with the dimensions 10,000 x 10,000 and then calculate the determinant.

The complete example is listed below.

# example of single threaded matrix determinant

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from time import time

from numpy import ones

from numpy.linalg import det

# record the start time

start = time()

# size of arrays

n = 10000

# create a data array

data = ones((n, n))

# matrix determinant

result = det(data)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 11.892 seconds to complete on my system.

1	Took 11.892 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded Matrix Determinant

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# example of multithreaded matrix determinant

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from time import time

from numpy import ones

from numpy.linalg import det

# record the start time

start = time()

# size of arrays

n = 10000

# create a data array

data = ones((n, n))

# matrix determinant

result = det(data)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 4.526 seconds.

That is about 7.366 seconds faster or a speed-up of about 2.63x.

1	Took 4.526 seconds

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Parallel Matrix Inverse

In this section, we will explore a parallel implementation of the matrix inverse.

In linear algebra, an n-by-n square matrix A is called invertible (also nonsingular or nondegenerate), if there exists an n-by-n square matrix B such that AB = BA = In, where In denotes the n-by-n identity matrix and the multiplication used is ordinary matrix multiplication.
— Invertible matrix, Wikipedia.

We can implement the matrix inverse in Python via the numpy.linalg.inv() function.

Compute the (multiplicative) inverse of a matrix.
— numpy.linalg.inv API

Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded Matrix Inverse

In this example, we will configure BLAS to use a single thread.

We will create a single square matrix with the dimensions 5,000 x 5,000 then calculate the matrix inverse.

The complete example is listed below.

# example of single threaded matrix inverse

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from time import time

from numpy.random import rand

from numpy.linalg import inv

# record the start time

start = time()

# size of arrays

n = 5000

# create an array of random values

data = rand(n, n)

# matrix inverse

result = inv(data)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 6.286 seconds to complete on my system.

1	Took 6.286 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded Matrix Inverse

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# example of multithreaded matrix inverse

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from time import time

from numpy.random import rand

from numpy.linalg import inv

# record the start time

start = time()

# size of arrays

n = 5000

# create an array of random values

data = rand(n, n)

# matrix inverse

result = inv(data)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 2.435 seconds.

That is about 2.421 seconds faster or a speed-up of about 2.58x.

1	Took 3.851 seconds

Parallel Matrix Pseudo Inverse (Moore-Penrose Method)

In this section, we will explore a parallel implementation of the matrix pseudo-inverse.

In mathematics, and in particular linear algebra, the Moore–Penrose inverse A+ of a matrix A is the most widely known generalization of the inverse matrix.
— Moore–Penrose inverse, Wikipedia.

We can implement the matrix pseudo inverse in Python via the numpy.linalg.pinv() function.

Compute the (Moore-Penrose) pseudo-inverse of a matrix. Calculate the generalized inverse of a matrix using its singular-value decomposition (SVD) and including all large singular values.
— numpy.linalg.pinv API

Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded Matrix Pseudo Inverse

In this example, we will configure BLAS to use a single thread.

We will create a single square matrix with the dimensions 3,000 x 3,000 then calculate the matrix pseudo inverse.

The complete example is listed below.

# example of single threaded matrix pseudo inverse

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from time import time

from numpy.random import rand

from numpy.linalg import pinv

# record the start time

start = time()

# size of array

n = 3000

# create an array of random values

data = rand(n, n)

# matrix pseudo inverse

result = pinv(data)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 12.950 seconds to complete on my system.

1	Took 12.950 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded Matrix Pseudo Inverse

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# example of multithreaded matrix pseudo inverse

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from time import time

from numpy.random import rand

from numpy.linalg import pinv

# record the start time

start = time()

# size of array

n = 3000

# create an array of random values

data = rand(n, n)

# matrix pseudo inverse

result = pinv(data)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 11.799 seconds.

That is about 3.460 seconds faster or a speed-up of about 1.36x.

1	Took 9.490 seconds

Takeaways

You now know how to calculate multithreaded matrix linear algebra functions with NumPy.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by Markus Winkler on Unsplash

Numpy Multithreaded Matrix Functions (up to 3x faster)

Numpy Linear Algebra Matrix Functions are Multithreaded

Which Numpy Matrix Functions are Multithreaded

Parallel Matrix Multiplication (Dot Product)

Single Threaded Matrix Multiplication

Multithreaded Matrix Multiplication

Parallel Matrix Power (Exponent)

Single Threaded Matrix Power

Multithreaded Matrix Power

Parallel Matrix Determinant (Det)

Single Threaded Matrix Determinant

Multithreaded Matrix Determinant

Parallel Matrix Inverse

Single Threaded Matrix Inverse

Multithreaded Matrix Inverse

Parallel Matrix Pseudo Inverse (Moore-Penrose Method)

Single Threaded Matrix Pseudo Inverse

Multithreaded Matrix Pseudo Inverse

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent NumPy Fast
(without the frustration)

Additional menu

Numpy Linear Algebra Matrix Functions are Multithreaded

Which Numpy Matrix Functions are Multithreaded

Parallel Matrix Multiplication (Dot Product)

Single Threaded Matrix Multiplication

Multithreaded Matrix Multiplication

Parallel Matrix Power (Exponent)

Single Threaded Matrix Power

Multithreaded Matrix Power

Parallel Matrix Determinant (Det)

Single Threaded Matrix Determinant

Multithreaded Matrix Determinant

Parallel Matrix Inverse

Single Threaded Matrix Inverse

Multithreaded Matrix Inverse

Parallel Matrix Pseudo Inverse (Moore-Penrose Method)

Single Threaded Matrix Pseudo Inverse

Multithreaded Matrix Pseudo Inverse

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Do you have any questions?Cancel reply

Footer

Learn Concurrent NumPy Fast (without the frustration)

Learn Concurrent NumPy Fast
(without the frustration)