Last Updated on September 29, 2023
You can calculate matrix decompositions in parallel with NumPy.
NumPy uses the BLAS library to calculate matrix decompositions, and implementations of the BLAS library installed with numpy, such as OpenBLAS and ATLAS will implement multithreaded versions of matrix decomposition algorithms. This means matrix decomposition functions in numpy are parallel by default.
In this tutorial, you will discover how to calculate parallel matrix decomposition algorithms with NumPy in Python.
Let’s get started.
What Are Matrix Decompositions
A matrix decomposition is a mathematical operation that splits a matrix into its constituent parts.
The products of the matrix may be required by other linear algebra operations, or make other mathematical operations tractable or more efficient to calculate.
In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of problems.
— Matrix decomposition, Wikipedia.
There are many types of matrix decompositions.
The differences may be due to differences in the approach, such as different algorithms, or differences in the result, such as different matrix products or operations appropriate for different matrix types.
In numerical analysis, different decompositions are used to implement efficient matrix algorithms.
— Matrix decomposition, Wikipedia.
Run loops using all CPUs, download your FREE book to learn how.
Numpy Matrix Decompositions are Multithreaded
The numpy and scipy libraries provide an implementation of many commonly used matrix decompositions.
For example, the following matrix decompositions are available:
- LU Decomposition via scipy.linalg.lu()
- Cholesky Decomposition via numpy.linalg.cholesky()
- QR Decomposition via numpy.linalg.qr()
- Singular Value Decomposition (SVD) via numpy.linalg.svd()
- Eigendecomposition via numpy.linalg.eig()
You can learn more about the linear algebra operations supported by numpy and scipy here:
These matrix decomposition algorithms are implemented efficiently in third-party libraries, such as BLAS, which stands for “Basic Linear Algebra Subprograms”.
The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software, LAPACK for example.
— BLAS (Basic Linear Algebra Subprograms)
The BLAS implementation is installed on your system. Common examples include OpenBLAS, ATLAS, and MKL.
You can learn more about BLAS and LAPACK in numpy in the tutorial:
These libraries typically provide multithreaded implementation matrix decompositions, allowing the algorithms to take full advantage of all CPU cores in your system.
As such, executing matrix decompositions in Python via Numpy or Scipy will be multithreaded by default.
By default, the BLAS library installed on your system will use all logical CPU cores. For example, if you have 4 physical CPU cores with hyperthreading, then your operating system will see and use 8 cores. BLAS will in turn create 8 threads to use when executing parallel algorithms like matrix decompositions.
You may get better performance in some cases by configuring the number of threads used by BLAS to equal the number of physical CPU cores directly. This can be achieved by setting an environment variable prior to running your Python script or prior to importing numpy in your program.
For example, if you have OpenBLAS installed (which is the most common) you can set the OMP_NUM_THREADS environment variable in your program prior to importing numpy.
For example:
1 2 3 |
... from os import environ environ['OMP_NUM_THREADS'] = '4' |
This environment variable can also be set to 1 to turn off multithreading for these operations.
For example:
1 2 3 |
... from os import environ environ['OMP_NUM_THREADS'] = '1' |
This is helpful, allowing us to compare the performance of a single-threaded vs a multithreaded version of a matrix decomposition and confirm that indeed it is executing a parallel version of the algorithm on our system.
You can learn more about configuring the number of BLAS threads used in numpy in the tutorial:
Now that we know how to execute matrix decomposition in parallel using scipy and numpy, let’s look at some worked examples.
Parallel LU Decomposition
In this section, we will explore a parallel implementation of the LU decomposition.
In numerical analysis and linear algebra, lower–upper (LU) decomposition or factorization factors a matrix as the product of a lower triangular matrix and an upper triangular matrix (see matrix decomposition).
— LU decomposition, Wikipedia.
We can implement the LU decomposition in Python via the scipy.linalg.lu() function.
Compute pivoted LU decomposition of a matrix.
— scipy.linalg.lu API
Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.
Single Threaded LU Decomposition
In this example, we will configure BLAS to use a single thread.
We will create a single square matrix with the dimensions 8,000 x 8,000 then calculate the LU decomposition.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# single threaded lu matrix decomposition from os import environ environ['OMP_NUM_THREADS'] = '1' from numpy.random import rand from scipy.linalg import lu from time import time # record start time start = time() # size of matrix to create n = 8000 # create a square matrix of random floats matrix = rand(n, n) # perform decomposition p, l, u = lu(matrix) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 7.633 seconds on my system.
It may take more or fewer seconds, depending on the hardware in your system and the versions of Python, Numpy, Scipy, and the BLAS library installed.
1 |
Took 7.633 seconds |
Next, we can run the same example with a multithreaded algorithm.
Multithreaded LU Decomposition
We can update the above example to use one thread per physical CPU core in the system.
I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# multithreaded lu matrix decomposition from os import environ environ['OMP_NUM_THREADS'] = '4' from numpy.random import rand from scipy.linalg import lu from time import time # record start time start = time() # size of matrix to create n = 8000 # create a square matrix of random floats matrix = rand(n, n) # perform decomposition p, l, u = lu(matrix) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 3.755 seconds.
That is about 3.878 seconds faster or a speed-up of about 2.03x.
1 |
Took 3.755 seconds |
Free Concurrent NumPy Course
Get FREE access to my 7-day email course on concurrent NumPy.
Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.
Parallel Cholesky Decomposition
In this section, we will explore a parallel implementation of the Cholesky decomposition.
In linear algebra, the Cholesky decomposition or Cholesky factorization is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for efficient numerical solutions, e.g., Monte Carlo simulations.
— Cholesky decomposition, Wikipedia.
We can implement the Cholesky decomposition in Python via the numpy.linalg.cholesky() function.
Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.
Single Threaded Cholesky Decomposition
In this example, we will configure BLAS to use a single thread.
We will create a single square matrix with the dimensions 8,000 x 8,000, multiply the matrix by a transpose of itself to ensure it is positive definite, then calculate the Cholesky decomposition.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# single threaded cholesky matrix decomposition from os import environ environ['OMP_NUM_THREADS'] = '1' from numpy.random import rand from numpy.linalg import cholesky from time import time # record start time start = time() # size of matrix to create n = 8000 # create a square matrix of random floats matrix = rand(n, n) # ensure it is positive definite matrix = matrix.dot(matrix.T) # perform decomposition l = cholesky(matrix) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 12.954 seconds on my system.
1 |
Took 12.954 seconds |
Next, we can run the same example with a multithreaded algorithm.
Multithreaded Cholesky Decomposition
We can update the above example to use one thread per physical CPU core in the system.
I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# multithreaded cholesky matrix decomposition from os import environ environ['OMP_NUM_THREADS'] = '4' from numpy.random import rand from numpy.linalg import cholesky from time import time # record start time start = time() # size of matrix to create n = 8000 # create a square matrix of random floats matrix = rand(n, n) # ensure it is positive definite matrix = matrix.dot(matrix.T) # perform decomposition l = cholesky(matrix) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 3.755 seconds.
That is about 7.303 seconds faster or a speed-up of about 2.29x.
1 |
Took 5.651 seconds |
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Parallel QR Decomposition
In this section, we will explore a parallel implementation of the QR decomposition.
In linear algebra, a QR decomposition, also known as a QR factorization or QU factorization, is a decomposition of a matrix A into a product A = QR of an orthogonal matrix Q and an upper triangular matrix R. QR decomposition is often used to solve the linear least squares problem and is the basis for a particular eigenvalue algorithm, the QR algorithm.
— QR decomposition, Wikipedia.
We can implement the QR decomposition in Python via the numpy.linalg.qr() function.
Compute the qr factorization of a matrix. Factor the matrix a as qr, where q is orthonormal and r is upper-triangular.
— numpy.linalg.qr API.
Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.
Single Threaded QR Decomposition
In this example, we will configure BLAS to use a single thread.
We will create a single square matrix with the dimensions 8,000 x 8,000 then calculate the QR decomposition.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# single threaded qr matrix decomposition from os import environ environ['OMP_NUM_THREADS'] = '1' from numpy.random import rand from numpy.linalg import qr from time import time # record start time start = time() # size of matrix to create n = 8000 # create a square matrix of random floats matrix = rand(n, n) # perform decomposition q, r = qr(matrix) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 38.042 seconds to complete on my system.
1 |
Took 38.042 seconds |
Next, we can run the same example with a multithreaded algorithm.
Multithreaded QR Decomposition
We can update the above example to use one thread per physical CPU core in the system.
I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# multithreaded qr matrix decomposition from os import environ environ['OMP_NUM_THREADS'] = '4' from numpy.random import rand from numpy.linalg import qr from time import time # record start time start = time() # size of matrix to create n = 8000 # create a square matrix of random floats matrix = rand(n, n) # perform decomposition q, r = qr(matrix) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 23.272 seconds.
That is about 14.770 seconds faster or a speed-up of about 1.63x.
1 |
Took 23.272 seconds |
Parallel Singular Value Decomposition (SVD)
In this section, we will explore a parallel implementation of the SVD.
In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any m x n matrix.
— Singular value decomposition, Wikipedia.
We can implement the SVD decomposition in Python via the numpy.linalg.svd() function.
Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.
Single Threaded Singular Value Decomposition
In this example, we will configure BLAS to use a single thread.
We will create a single square matrix with the dimensions 3,000 x 3,000 then calculate the SVD.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# single threaded qr matrix decomposition from os import environ environ['OMP_NUM_THREADS'] = '1' from numpy.random import rand from numpy.linalg import svd from time import time # record start time start = time() # size of matrix to create n = 3000 # create a square matrix of random floats matrix = rand(n, n) # perform decomposition u, s, vh = svd(matrix) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 11.794 seconds to complete on my system.
1 |
Took 11.794 seconds |
Next, we can run the same example with a multithreaded algorithm.
Multithreaded Singular Value Decomposition (SVD)
We can update the above example to use one thread per physical CPU core in the system.
I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# multithreaded qr matrix decomposition from os import environ environ['OMP_NUM_THREADS'] = '4' from numpy.random import rand from numpy.linalg import svd from time import time # record start time start = time() # size of matrix to create n = 3000 # create a square matrix of random floats matrix = rand(n, n) # perform decomposition u, s, vh = svd(matrix) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 9.373 seconds.
That is about 2.421 seconds faster or a speed-up of about 1.26x.
1 |
Took 9.373 seconds |
Parallel Eigendecomposition
In this section, we will explore a parallel implementation of the Eigendecomposition.
In linear algebra, eigendecomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Only diagonalizable matrices can be factorized in this way. When the matrix being factorized is a normal or real symmetric matrix, the decomposition is called “spectral decomposition”, derived from the spectral theorem.
— Eigendecomposition of a matrix, Wikipedia.
We can implement the Eigendecomposition in Python via the numpy.linalg.eig() function.
Compute the eigenvalues and right eigenvectors of a square array.
— numpy.linalg.eig API
Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.
Single Threaded Eigendecomposition
In this example, we will configure BLAS to use a single thread.
We will create a single square matrix with the dimensions 3,000 x 3,000 then calculate the SVD.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# single threaded qr matrix decomposition from os import environ environ['OMP_NUM_THREADS'] = '1' from numpy.random import rand from numpy.linalg import eig from time import time # record start time start = time() # size of matrix to create n = 3000 # create a square matrix of random floats matrix = rand(n, n) # perform decomposition w, v = eig(matrix) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 13.911 seconds to complete on my system.
1 |
Took 13.911 seconds |
Next, we can run the same example with a multithreaded algorithm.
Multithreaded Eigendecomposition
We can update the above example to use one thread per physical CPU core in the system.
I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# multithreaded qr matrix decomposition from os import environ environ['OMP_NUM_THREADS'] = '4' from numpy.random import rand from numpy.linalg import eig from time import time # record start time start = time() # size of matrix to create n = 3000 # create a square matrix of random floats matrix = rand(n, n) # perform decomposition w, v = eig(matrix) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example took about 11.799 seconds.
That is about 2.112 seconds faster or a speed-up of about 1.18x.
1 |
Took 11.799 seconds |
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent NumPy in Python, Jason Brownlee (my book!)
Guides
- Concurrent NumPy 7-Day Course
- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes
Documentation
- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API
NumPy APIs
Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
Takeaways
You now know how to calculate parallel matrix decomposition algorithms with Numpy in Python.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Do you have any questions?