Numpy Parallel Matrix Decompositions

Last Updated on September 29, 2023

You can calculate matrix decompositions in parallel with NumPy.

NumPy uses the BLAS library to calculate matrix decompositions, and implementations of the BLAS library installed with numpy, such as OpenBLAS and ATLAS will implement multithreaded versions of matrix decomposition algorithms. This means matrix decomposition functions in numpy are parallel by default.

In this tutorial, you will discover how to calculate parallel matrix decomposition algorithms with NumPy in Python.

Let’s get started.

What Are Matrix Decompositions

A matrix decomposition is a mathematical operation that splits a matrix into its constituent parts.

The products of the matrix may be required by other linear algebra operations, or make other mathematical operations tractable or more efficient to calculate.

In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of problems.
— Matrix decomposition, Wikipedia.

There are many types of matrix decompositions.

The differences may be due to differences in the approach, such as different algorithms, or differences in the result, such as different matrix products or operations appropriate for different matrix types.

In numerical analysis, different decompositions are used to implement efficient matrix algorithms.
— Matrix decomposition, Wikipedia.

Run loops using all CPUs, download your FREE book to learn how.

Numpy Matrix Decompositions are Multithreaded

The numpy and scipy libraries provide an implementation of many commonly used matrix decompositions.

For example, the following matrix decompositions are available:

LU Decomposition via scipy.linalg.lu()
Cholesky Decomposition via numpy.linalg.cholesky()
QR Decomposition via numpy.linalg.qr()
Singular Value Decomposition (SVD) via numpy.linalg.svd()
Eigendecomposition via numpy.linalg.eig()

You can learn more about the linear algebra operations supported by numpy and scipy here:

These matrix decomposition algorithms are implemented efficiently in third-party libraries, such as BLAS, which stands for “Basic Linear Algebra Subprograms”.

The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software, LAPACK for example.
— BLAS (Basic Linear Algebra Subprograms)

The BLAS implementation is installed on your system. Common examples include OpenBLAS, ATLAS, and MKL.

You can learn more about BLAS and LAPACK in numpy in the tutorial:

What is BLAS and LAPACK in NumPy

These libraries typically provide multithreaded implementation matrix decompositions, allowing the algorithms to take full advantage of all CPU cores in your system.

As such, executing matrix decompositions in Python via Numpy or Scipy will be multithreaded by default.

By default, the BLAS library installed on your system will use all logical CPU cores. For example, if you have 4 physical CPU cores with hyperthreading, then your operating system will see and use 8 cores. BLAS will in turn create 8 threads to use when executing parallel algorithms like matrix decompositions.

You may get better performance in some cases by configuring the number of threads used by BLAS to equal the number of physical CPU cores directly. This can be achieved by setting an environment variable prior to running your Python script or prior to importing numpy in your program.

For example, if you have OpenBLAS installed (which is the most common) you can set the OMP_NUM_THREADS environment variable in your program prior to importing numpy.

For example:

...

from os import environ

environ['OMP_NUM_THREADS'] = '4'

This environment variable can also be set to 1 to turn off multithreading for these operations.

For example:

...

from os import environ

environ['OMP_NUM_THREADS'] = '1'

This is helpful, allowing us to compare the performance of a single-threaded vs a multithreaded version of a matrix decomposition and confirm that indeed it is executing a parallel version of the algorithm on our system.

You can learn more about configuring the number of BLAS threads used in numpy in the tutorial:

Configure the Number of BLAS/LAPACK Threads for NumPy

Now that we know how to execute matrix decomposition in parallel using scipy and numpy, let’s look at some worked examples.

Start Now: Free Concurrent NumPy Crash Course

Parallel LU Decomposition

In this section, we will explore a parallel implementation of the LU decomposition.

In numerical analysis and linear algebra, lower–upper (LU) decomposition or factorization factors a matrix as the product of a lower triangular matrix and an upper triangular matrix (see matrix decomposition).
— LU decomposition, Wikipedia.

We can implement the LU decomposition in Python via the scipy.linalg.lu() function.

Compute pivoted LU decomposition of a matrix.
— scipy.linalg.lu API

Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded LU Decomposition

In this example, we will configure BLAS to use a single thread.

We will create a single square matrix with the dimensions 8,000 x 8,000 then calculate the LU decomposition.

The complete example is listed below.

# single threaded lu matrix decomposition

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from numpy.random import rand

from scipy.linalg import lu

from time import time

# record start time

start = time()

# size of matrix to create

n = 8000

# create a square matrix of random floats

matrix = rand(n, n)

# perform decomposition

p, l, u = lu(matrix)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 7.633 seconds on my system.

It may take more or fewer seconds, depending on the hardware in your system and the versions of Python, Numpy, Scipy, and the BLAS library installed.

1	Took 7.633 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded LU Decomposition

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# multithreaded lu matrix decomposition

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from numpy.random import rand

from scipy.linalg import lu

from time import time

# record start time

start = time()

# size of matrix to create

n = 8000

# create a square matrix of random floats

matrix = rand(n, n)

# perform decomposition

p, l, u = lu(matrix)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 3.755 seconds.

That is about 3.878 seconds faster or a speed-up of about 2.03x.

1	Took 3.755 seconds

Free Concurrent NumPy Course

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

Learn more

Parallel Cholesky Decomposition

In this section, we will explore a parallel implementation of the Cholesky decomposition.

In linear algebra, the Cholesky decomposition or Cholesky factorization is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for efficient numerical solutions, e.g., Monte Carlo simulations.
— Cholesky decomposition, Wikipedia.

We can implement the Cholesky decomposition in Python via the numpy.linalg.cholesky() function.

Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded Cholesky Decomposition

In this example, we will configure BLAS to use a single thread.

We will create a single square matrix with the dimensions 8,000 x 8,000, multiply the matrix by a transpose of itself to ensure it is positive definite, then calculate the Cholesky decomposition.

The complete example is listed below.

# single threaded cholesky matrix decomposition

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from numpy.random import rand

from numpy.linalg import cholesky

from time import time

# record start time

start = time()

# size of matrix to create

n = 8000

# create a square matrix of random floats

matrix = rand(n, n)

# ensure it is positive definite

matrix = matrix.dot(matrix.T)

# perform decomposition

l = cholesky(matrix)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 12.954 seconds on my system.

1	Took 12.954 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded Cholesky Decomposition

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# multithreaded cholesky matrix decomposition

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from numpy.random import rand

from numpy.linalg import cholesky

from time import time

# record start time

start = time()

# size of matrix to create

n = 8000

# create a square matrix of random floats

matrix = rand(n, n)

# ensure it is positive definite

matrix = matrix.dot(matrix.T)

# perform decomposition

l = cholesky(matrix)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 3.755 seconds.

That is about 7.303 seconds faster or a speed-up of about 2.29x.

1	Took 5.651 seconds

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Parallel QR Decomposition

In this section, we will explore a parallel implementation of the QR decomposition.

In linear algebra, a QR decomposition, also known as a QR factorization or QU factorization, is a decomposition of a matrix A into a product A = QR of an orthogonal matrix Q and an upper triangular matrix R. QR decomposition is often used to solve the linear least squares problem and is the basis for a particular eigenvalue algorithm, the QR algorithm.
— QR decomposition, Wikipedia.

We can implement the QR decomposition in Python via the numpy.linalg.qr() function.

Compute the qr factorization of a matrix. Factor the matrix a as qr, where q is orthonormal and r is upper-triangular.
— numpy.linalg.qr API.

Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded QR Decomposition

In this example, we will configure BLAS to use a single thread.

We will create a single square matrix with the dimensions 8,000 x 8,000 then calculate the QR decomposition.

The complete example is listed below.

# single threaded qr matrix decomposition

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from numpy.random import rand

from numpy.linalg import qr

from time import time

# record start time

start = time()

# size of matrix to create

n = 8000

# create a square matrix of random floats

matrix = rand(n, n)

# perform decomposition

q, r = qr(matrix)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 38.042 seconds to complete on my system.

1	Took 38.042 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded QR Decomposition

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# multithreaded qr matrix decomposition

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from numpy.random import rand

from numpy.linalg import qr

from time import time

# record start time

start = time()

# size of matrix to create

n = 8000

# create a square matrix of random floats

matrix = rand(n, n)

# perform decomposition

q, r = qr(matrix)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 23.272 seconds.

That is about 14.770 seconds faster or a speed-up of about 1.63x.

1	Took 23.272 seconds

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Parallel Singular Value Decomposition (SVD)

In this section, we will explore a parallel implementation of the SVD.

In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any m x n matrix.
— Singular value decomposition, Wikipedia.

We can implement the SVD decomposition in Python via the numpy.linalg.svd() function.

Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded Singular Value Decomposition

In this example, we will configure BLAS to use a single thread.

We will create a single square matrix with the dimensions 3,000 x 3,000 then calculate the SVD.

The complete example is listed below.

# single threaded qr matrix decomposition

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from numpy.random import rand

from numpy.linalg import svd

from time import time

# record start time

start = time()

# size of matrix to create

n = 3000

# create a square matrix of random floats

matrix = rand(n, n)

# perform decomposition

u, s, vh = svd(matrix)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 11.794 seconds to complete on my system.

1	Took 11.794 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded Singular Value Decomposition (SVD)

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# multithreaded qr matrix decomposition

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from numpy.random import rand

from numpy.linalg import svd

from time import time

# record start time

start = time()

# size of matrix to create

n = 3000

# create a square matrix of random floats

matrix = rand(n, n)

# perform decomposition

u, s, vh = svd(matrix)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 9.373 seconds.

That is about 2.421 seconds faster or a speed-up of about 1.26x.

1	Took 9.373 seconds

Parallel Eigendecomposition

In this section, we will explore a parallel implementation of the Eigendecomposition.

In linear algebra, eigendecomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Only diagonalizable matrices can be factorized in this way. When the matrix being factorized is a normal or real symmetric matrix, the decomposition is called “spectral decomposition”, derived from the spectral theorem.
— Eigendecomposition of a matrix, Wikipedia.

We can implement the Eigendecomposition in Python via the numpy.linalg.eig() function.

Compute the eigenvalues and right eigenvectors of a square array.
— numpy.linalg.eig API

Let’s take a closer look at a single-threaded and multithreaded version of the algorithm.

Single Threaded Eigendecomposition

In this example, we will configure BLAS to use a single thread.

We will create a single square matrix with the dimensions 3,000 x 3,000 then calculate the SVD.

The complete example is listed below.

# single threaded qr matrix decomposition

from os import environ

environ['OMP_NUM_THREADS'] = '1'

from numpy.random import rand

from numpy.linalg import eig

from time import time

# record start time

start = time()

# size of matrix to create

n = 3000

# create a square matrix of random floats

matrix = rand(n, n)

# perform decomposition

w, v = eig(matrix)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 13.911 seconds to complete on my system.

1	Took 13.911 seconds

Next, we can run the same example with a multithreaded algorithm.

Multithreaded Eigendecomposition

We can update the above example to use one thread per physical CPU core in the system.

I have 4 cores, therefore, OpenBLAS is configured to use 4 threads. Set the number of threads to match the number of cores in your system accordingly.

The complete example is listed below.

# multithreaded qr matrix decomposition

from os import environ

environ['OMP_NUM_THREADS'] = '4'

from numpy.random import rand

from numpy.linalg import eig

from time import time

# record start time

start = time()

# size of matrix to create

n = 3000

# create a square matrix of random floats

matrix = rand(n, n)

# perform decomposition

w, v = eig(matrix)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example took about 11.799 seconds.

That is about 2.112 seconds faster or a speed-up of about 1.18x.

1	Took 11.799 seconds

Takeaways

You now know how to calculate parallel matrix decomposition algorithms with Numpy in Python.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by 戸山神奈 on Unsplash

Numpy Parallel Matrix Decompositions

What Are Matrix Decompositions

Numpy Matrix Decompositions are Multithreaded

Parallel LU Decomposition

Single Threaded LU Decomposition

Multithreaded LU Decomposition

Parallel Cholesky Decomposition

Single Threaded Cholesky Decomposition

Multithreaded Cholesky Decomposition

Parallel QR Decomposition

Single Threaded QR Decomposition

Multithreaded QR Decomposition

Parallel Singular Value Decomposition (SVD)

Single Threaded Singular Value Decomposition

Multithreaded Singular Value Decomposition (SVD)

Parallel Eigendecomposition

Single Threaded Eigendecomposition

Multithreaded Eigendecomposition

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent NumPy Fast
(without the frustration)

Additional menu

What Are Matrix Decompositions

Numpy Matrix Decompositions are Multithreaded

Parallel LU Decomposition

Single Threaded LU Decomposition

Multithreaded LU Decomposition

Parallel Cholesky Decomposition

Single Threaded Cholesky Decomposition

Multithreaded Cholesky Decomposition

Parallel QR Decomposition

Single Threaded QR Decomposition

Multithreaded QR Decomposition

Parallel Singular Value Decomposition (SVD)

Single Threaded Singular Value Decomposition

Multithreaded Singular Value Decomposition (SVD)

Parallel Eigendecomposition

Single Threaded Eigendecomposition

Multithreaded Eigendecomposition

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Leave a Reply Cancel reply

Footer

Learn Concurrent NumPy Fast (without the frustration)

Learn Concurrent NumPy Fast
(without the frustration)