Last Updated on September 29, 2023
You can calculate matrix linear algebra functions in parallel with NumPy.
In this tutorial, you will discover how to calculate multithreaded matrix linear algebra functions with NumPy.
Let’s get started.
Numpy Linear Algebra Matrix Functions are Multithreaded
Linear algebra is a field of mathematics concerned with linear equations with arrays and matrices of numbers.
Numpy is a Python library for working with arrays of numbers. As such, it implements many linear algebra functions in the numpy.linalg module.
Many of the numpy.linalg functions for working with matrices call down into the BLAS and LAPACK libraries. These libraries provide efficient implementations of common matrix functions. Some of these efficient implementations are multithreaded.
The NumPy linear algebra functions rely on BLAS and LAPACK to provide efficient low level implementations of standard linear algebra algorithms.
— Linear algebra (numpy.linalg)
You can learn more about the BLAS library in numpy in the tutorial:
This means that many linear algebra functions in numpy will execute in parallel using multiple CPU cores by default.
Run loops using all CPUs, download your FREE book to learn how.
Which Numpy Matrix Functions are Multithreaded
The numpy.linalg documentation does not specify which functions are multithreaded.
Nor does the BLAS or LAPACK specifications or specific library implementations such as OpenBLAS.
Nevertheless, most of the linear algebra functions are multithreaded, including:
- Matrix and vector products: such as numpy.dot() and numpy.linalg.matrix_power()
- Norms and other numbers: such as numpy.linalg.det() and numpy.linalg.matrix_rank()
- Inverting matrices: such as numpy.linalg.inv() and numpy.linalg.pinv()
You can learn more about the BLAS-supported linear algebra functions supported in the numpy.linalg module in the tutorial:
Now that we know that common linear algebra matrix functions are multithreaded, let’s look at some worked examples.
Parallel Matrix Multiplication (Dot Product)
In this section, we will explore a parallel implementation of matrix multiplication.
In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product, has the number of rows of the first and the number of columns of the second matrix.
— Matrix multiplication, Wikipedia.
We can implement the matrix multiplication with NumPy via the numpy.dot() function and via the dot() method on numpy array objects. we can also use the “@” operator.
Dot product of two arrays.
— numpy.dot API
Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.
Single Threaded Matrix Multiplication
In this example, we will configure BLAS to use a single thread.
We will create two square matrices with the dimensions 8,000 x 8,000 then multiply them together.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of single threaded matrix multiplication from os import environ environ['OMP_NUM_THREADS'] = '1' from time import time from numpy.random import rand # record the start time start = time() # size of arrays n = 8000 # create an array of random values data1 = rand(n, n) data2 = rand(n, n) # matrix multiplication result = data1.dot(data2) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 16.934 seconds on my system.
It may take more or fewer seconds, depending on the hardware in your system and the versions of Python, Numpy, Scipy, and the BLAS library installed.
1 |
Took 16.934 seconds |
Next, we can run the same example with a multithreaded algorithm.
Multithreaded Matrix Multiplication
We can update the above example to use one thread per physical CPU core in the system.
I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of multithreaded matrix multiplication from os import environ environ['OMP_NUM_THREADS'] = '4' from time import time from numpy.random import rand # record the start time start = time() # size of arrays n = 8000 # create an array of random values data1 = rand(n, n) data2 = rand(n, n) # matrix multiplication result = data1.dot(data2) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 5.706 seconds.
That is about 11.228 seconds faster or a speed-up of about 2.97x.
1 |
Took 5.706 seconds |
Free Concurrent NumPy Course
Get FREE access to my 7-day email course on concurrent NumPy.
Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.
Parallel Matrix Power (Exponent)
In this section, we will explore a parallel implementation of the matrix power.
That is, raising a matrix to a given exponent.
We can implement the matrix multiplication with NumPy via the numpy.linalg.matrix_power() function.
Raise a square matrix to the (integer) power n.
— numpy.linalg.matrix_power API.
Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.
Single Threaded Matrix Power
In this example, we will configure BLAS to use a single thread.
We will create a single square matrix with the dimensions 4,000 x 4,000 then raise the matrix to the power 10.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of single threaded matrix power from os import environ environ['OMP_NUM_THREADS'] = '1' from time import time from numpy.random import rand from numpy.linalg import matrix_power # record the start time start = time() # size of arrays n = 4000 # create an array of random values data = rand(n, n) # matrix power result = matrix_power(data, 10) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 8.159 seconds on my system.
1 |
Took 8.159 seconds |
Next, we can run the same example with a multithreaded algorithm.
Multithreaded Matrix Power
We can update the above example to use one thread per physical CPU core in the system.
I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of multithreaded matrix power from os import environ environ['OMP_NUM_THREADS'] = '4' from time import time from numpy.random import rand from numpy.linalg import matrix_power # record the start time start = time() # size of arrays n = 4000 # create an array of random values data = rand(n, n) # matrix power result = matrix_power(data, 10) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 2.624 seconds.
That is about 5.535 seconds faster or a speed-up of about 3.11x.
1 |
Took 2.624 seconds |
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Parallel Matrix Determinant (Det)
In this section, we will explore a parallel implementation of the matrix determinant.
In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. It characterizes some properties of the matrix and the linear map represented by the matrix.
— Determinant, Wikipedia.
We can implement the matrix determinant in Python via the numpy.linalg.det() function.
Compute the determinant of an array.
— numpy.linalg.det API
Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.
Single Threaded Matrix Determinant
In this example, we will configure BLAS to use a single thread.
We will create a single square matrix with the dimensions 10,000 x 10,000 and then calculate the determinant.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of single threaded matrix determinant from os import environ environ['OMP_NUM_THREADS'] = '1' from time import time from numpy import ones from numpy.linalg import det # record the start time start = time() # size of arrays n = 10000 # create a data array data = ones((n, n)) # matrix determinant result = det(data) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 11.892 seconds to complete on my system.
1 |
Took 11.892 seconds |
Next, we can run the same example with a multithreaded algorithm.
Multithreaded Matrix Determinant
We can update the above example to use one thread per physical CPU core in the system.
I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of multithreaded matrix determinant from os import environ environ['OMP_NUM_THREADS'] = '4' from time import time from numpy import ones from numpy.linalg import det # record the start time start = time() # size of arrays n = 10000 # create a data array data = ones((n, n)) # matrix determinant result = det(data) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 4.526 seconds.
That is about 7.366 seconds faster or a speed-up of about 2.63x.
1 |
Took 4.526 seconds |
Parallel Matrix Inverse
In this section, we will explore a parallel implementation of the matrix inverse.
In linear algebra, an n-by-n square matrix A is called invertible (also nonsingular or nondegenerate), if there exists an n-by-n square matrix B such that AB = BA = In, where In denotes the n-by-n identity matrix and the multiplication used is ordinary matrix multiplication.
— Invertible matrix, Wikipedia.
We can implement the matrix inverse in Python via the numpy.linalg.inv() function.
Compute the (multiplicative) inverse of a matrix.
— numpy.linalg.inv API
Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.
Single Threaded Matrix Inverse
In this example, we will configure BLAS to use a single thread.
We will create a single square matrix with the dimensions 5,000 x 5,000 then calculate the matrix inverse.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of single threaded matrix inverse from os import environ environ['OMP_NUM_THREADS'] = '1' from time import time from numpy.random import rand from numpy.linalg import inv # record the start time start = time() # size of arrays n = 5000 # create an array of random values data = rand(n, n) # matrix inverse result = inv(data) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 6.286 seconds to complete on my system.
1 |
Took 6.286 seconds |
Next, we can run the same example with a multithreaded algorithm.
Multithreaded Matrix Inverse
We can update the above example to use one thread per physical CPU core in the system.
I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of multithreaded matrix inverse from os import environ environ['OMP_NUM_THREADS'] = '4' from time import time from numpy.random import rand from numpy.linalg import inv # record the start time start = time() # size of arrays n = 5000 # create an array of random values data = rand(n, n) # matrix inverse result = inv(data) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 2.435 seconds.
That is about 2.421 seconds faster or a speed-up of about 2.58x.
1 |
Took 3.851 seconds |
Parallel Matrix Pseudo Inverse (Moore-Penrose Method)
In this section, we will explore a parallel implementation of the matrix pseudo-inverse.
In mathematics, and in particular linear algebra, the Moore–Penrose inverse A+ of a matrix A is the most widely known generalization of the inverse matrix.
— Moore–Penrose inverse, Wikipedia.
We can implement the matrix pseudo inverse in Python via the numpy.linalg.pinv() function.
Compute the (Moore-Penrose) pseudo-inverse of a matrix. Calculate the generalized inverse of a matrix using its singular-value decomposition (SVD) and including all large singular values.
— numpy.linalg.pinv API
Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.
Single Threaded Matrix Pseudo Inverse
In this example, we will configure BLAS to use a single thread.
We will create a single square matrix with the dimensions 3,000 x 3,000 then calculate the matrix pseudo inverse.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of single threaded matrix pseudo inverse from os import environ environ['OMP_NUM_THREADS'] = '1' from time import time from numpy.random import rand from numpy.linalg import pinv # record the start time start = time() # size of array n = 3000 # create an array of random values data = rand(n, n) # matrix pseudo inverse result = pinv(data) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 12.950 seconds to complete on my system.
1 |
Took 12.950 seconds |
Next, we can run the same example with a multithreaded algorithm.
Multithreaded Matrix Pseudo Inverse
We can update the above example to use one thread per physical CPU core in the system.
I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# example of multithreaded matrix pseudo inverse from os import environ environ['OMP_NUM_THREADS'] = '4' from time import time from numpy.random import rand from numpy.linalg import pinv # record the start time start = time() # size of array n = 3000 # create an array of random values data = rand(n, n) # matrix pseudo inverse result = pinv(data) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 11.799 seconds.
That is about 3.460 seconds faster or a speed-up of about 1.36x.
1 |
Took 9.490 seconds |
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent NumPy in Python, Jason Brownlee (my book!)
Guides
- Concurrent NumPy 7-Day Course
- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes
Documentation
- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API
NumPy APIs
Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
Takeaways
You now know how to calculate multithreaded matrix linear algebra functions with NumPy.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Markus Winkler on Unsplash
Do you have any questions?