Parallel Numpy Matrix Math Functions

Last Updated on September 29, 2023

You can calculate mathematical functions on matrices in numpy in parallel using Python threads.

This can be achieved because most numpy math functions release the global interpreter lock, allowing Python threads that call numpy math functions to run in parallel.

In this tutorial, you will discover how to calculate mathematical numpy functions on matrices in parallel using threads.

Let’s get started.

Table of Contents

Matrix Math Functions Are Single-Threaded

Numpy is an array library for Python.

It makes use of BLAS and LAPACK to implement many linear algebra functions for vectors and matrices efficiently. Some of these operations, such as matrix multiplication are multithreaded.

Nevertheless, many element-wise matrix mathematical functions in numpy are not multithreaded.

Element-wise functions are those operations applied to each item or cell in a matrix to produce an output matrix of the same size.

elementwise: Obtained by operating on one element (of a matrix etc) at a time.
— elementwise, Wiktionary.

Examples of element-wise math functions include numpy functions such as:

Exponential function with numpy.exp()
Natural logarithm with numpy.log()
Square root with numpy.sqrt()
Square function numpy.square()

None of these operations are multithreaded by default.

How can we perform math functions on matrices in parallel?

Run loops using all CPUs, download your FREE book to learn how.

How to Execute Matrix Mathematical Functions in Parallel

Element-wise matrix math functions can be made multithreaded.

Numpy will release the Global Interpreter Lock (GIL) when performing most math functions, including matrix math functions.

This means that we can make element-wise operations parallel using Python threads. This is great as if we had to resort to process-based concurrency, then most gains achieved from parallelism would be lost in inter-processes communication for the matrices involved. This is not a problem with threads as they can share memory within the same process.

You can learn more about numpy releasing the GIL in the tutorial:

NumPy vs the Global Interpreter Lock (GIL)

The Python standard library provides the multiprocessing.pool.ThreadPool class that can be used to create and manage a pool of worker threads.

A ThreadPool instance can be created and configured with one worker per physical CPU in the system.

For example:

...

# create the thread pool

with ThreadPool(4) as pool:

# ...

You can learn more about configuring the number of workers in the ThreadPool in the tutorial:

ThreadPool Configure The Number of Worker Threads

We can then define a function that implements the math function we wish to perform that takes matrices as arguments. The task function can then operate on a portion of the larger matrix and perform the slow element-wise operation, such as calculating the log of a portion of a matrix and placing the result in a second matrix.

For example, we can use a function like numpy.log() for calculating the natural logarithm that takes the input matrix as an argument and a target matrix for the result via the “out” argument. We can specify the coordinates of the sub-matrix in each as an argument to our task function.

# apply matrix math operation on a portion of a matrix

def task(coords, data, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = log(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

We can then divide our large input matrix into sections and issue each as a task to the ThreadPool to have the element-wise operations performed on each piece in parallel.

This can be challenging to make generic as it depends on the number of threads and the dimensions of the matrix.

If we have 4 physical CPU cores, then ideally we want to split our matrix into 4 tasks to be completed in parallel. This is probably the most efficient use of resources.

For a square matrix, we can find the midpoint of each dimension, then enumerate the four quadrants using for-loops, using the indices into the larger matrices directly.

For example:

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

Tasks can then be issued via the async_apply() method on the ThreadPool that returns an AsyncResult object, which we do not need in this case.

For example:

...

# issue task

_ = pool.apply_async(task, args=(coords, data1, data2, result))

You can learn more about issuing tasks to the ThreadPool via the apply_async() method in the tutorial:

How to Use ThreadPool apply_async() in Python

Now that we know how to execute matrix mathematical functions in parallel with threads, let’s look at some worked examples.

Start Now: Free Concurrent NumPy Crash Course

Multithreaded Matrix Exponential

The exponential function can be calculated on a matrix in parallel using threads.

We will first explore how to execute the function using a single thread, then use multiple threads and compare the performance.

Single-Threaded Matrix numpy.exp()

We can create a matrix and then calculate the exponential of the matrix.

We will use a modestly sized matrix of 50,000 x 50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The numpy.ones() function can create the input matrix, initialized with the value 1.0.

...

# size of arrays

n = 50000

# create square matrix

data = ones((n, n))

We can then calculate the exponential of the matrix using the numpy.exp() function.

...

# matrix exp

result = exp(data)

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

# single threaded matrix exp

from time import time

from numpy import ones

from numpy import exp

# record start time

start = time()

# size of arrays

n = 50000

# create square matrix

data = ones((n, n))

# matrix exp

result = exp(data)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example creates a matrix then calculates the exponential of the matrix.

The program takes about 24.026 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

The precision of the estimated run time can be improved by running the same program many times and averaging the result. This is left as an exercise as we are interested in directional improvements in performance more than specific benchmark speeds.

1	Took 24.026 seconds

Next, let’s look at a multithreaded version of the program.

Multithreaded Matrix numpy.exp()

We can update the calculation of the exponential of the matrix to be completed in parallel using threads.

We will use a ThreadPool and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the numpy.exp() function on the sub-array defined by the provided coordinates and ignores the return value.

# apply matrix math operation on a portion of a matrix

def task(coords, data, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = exp(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

We can ignore the return value because there is a single copy of the result matrix in memory and all of the task functions are operating on this single array in parallel. This is because threads have a shared memory model, unlike processes that must transmit copies of data between processes.

Next, we can create our input and output arrays as before. The output array is empty as we will populate it with results from each sub-task.

...

# create square matrices

data = ones((n, n))

result = empty((n, n))

We can then create a thread pool with one worker per physical CPU core.

I have 4 physical CPU cores in my system, so I will configure the ThreadPool to use 4 workers. Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.

...

# create the thread pool

with ThreadPool(4) as pool:

# ...

The square arrays are then divided into quadrants (4 parts) and each is issued to the task() function for parallel execution.

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data, result))

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

...

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

And that’s it.

Tying this together, the complete example of a multithreaded matrix exponential is listed below.

# multithreaded matrix exp

from time import time

from multiprocessing.pool import ThreadPool

from numpy import ones

from numpy import exp

from numpy import empty

# apply matrix math operation on a portion of a matrix

def task(coords, data, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = exp(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

# record start time

start = time()

# size of square arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

result = empty((n, n))

# create the thread pool

with ThreadPool(4) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data1, data2, result))

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example is completed in about 12.050 seconds.

On my system, this is about 11.976 seconds faster than the single-threaded version or a 1.99x speedup.

1	Took 12.050 seconds

Free Concurrent NumPy Course

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

Learn more

Multithreaded Matrix Logarithm

The natural logarithm function can be calculated on a matrix in parallel using threads.

We will first explore how to execute the function using a single thread, then use multiple threads and compare the performance.

Single-Threaded Matrix numpy.log()

We can create a matrix and then calculate the natural logarithm of the matrix.

We will use a modestly sized matrix of 50,000 x 50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The numpy.ones() function can create the input matrix, initialized with the value 1.0.

...

# size of arrays

n = 50000

# create square matrix

data = ones((n, n))

We can then calculate the natural logarithm of the matrix using the numpy.log() function.

...

# matrix log

result = log(data)

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

# single threaded matrix log

from time import time

from numpy import ones

from numpy import log

# record start time

start = time()

# size of arrays

n = 50000

# create square matrix

data = ones((n, n))

# matrix log

result = log(data)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example creates a matrix and then calculates the natural logarithm of the matrix.

The program takes about 23.094 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

1	Took 23.094 seconds

Next, let’s look at a multithreaded version of the program.

Multithreaded Matrix numpy.log()

We can update the calculation of the natural logarithm of the matrix to be completed in parallel using threads.

We will use a ThreadPool and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the numpy.log() function on the sub-array defined by the provided coordinates and ignores the return value.

# apply matrix math operation on a portion of a matrix

def task(coords, data, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = log(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

Next, we can create our input and output arrays as before. The output array is empty as we will populate it with results from each sub-task.

...

# create square matrices

data = ones((n, n))

result = empty((n, n))

We can then create a thread pool with one worker per physical CPU core.

Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.

...

# create the thread pool

with ThreadPool(4) as pool:

# ...

The square arrays are then divided into quadrants (4 parts) and each is issued to the task() function for parallel execution.

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data, result))

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

...

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

And that’s it.

Tying this together, the complete example of a multithreaded matrix natural logarithm is listed below.

# multithreaded matrix log

from time import time

from multiprocessing.pool import ThreadPool

from numpy import ones

from numpy import log

from numpy import empty

# apply matrix math operation on a portion of a matrix

def task(coords, data, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = log(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

# record start time

start = time()

# size of square arrays

n = 50000

# create square matrices

data = ones((n, n))

result = empty((n, n))

# create the thread pool

with ThreadPool(4) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data, result))

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example is completed in about 11.060 seconds.

On my system, this is about 12.034 seconds faster than the single-threaded version or a 2.09x speedup.

1	Took 11.060 seconds

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Multithreaded Matrix Square Root

The square root function can be calculated on a matrix in parallel using threads.

We will first explore how to execute the function using a single thread, then use multiple threads and compare the performance.

Single-Threaded Matrix numpy.sqrt()

We can create a matrix and then calculate the square root of the matrix.

We will use a modestly sized matrix of 50,000 x 50,000 to ensure it takes a non-trivial amount of time to perform the element-wise operation.

The numpy.ones() function can create the input matrix, initialized with the value 1.0.

...

# size of arrays

n = 50000

# create square matrix

data = ones((n, n))

We can then calculate the square root of the matrix using the numpy.sqrt() function.

...

# matrix sqrt

result = sqrt(data)

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

# single threaded matrix sqrt

from time import time

from numpy import ones

from numpy import sqrt

# record start time

start = time()

# size of arrays

n = 50000

# create square matrix

data = ones((n, n))

# matrix sqrt

result = sqrt(data)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example creates a matrix and then calculates the square root of the matrix.

The program takes about 13.787 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

1	Took 13.787 seconds

Next, let’s look at a multithreaded version of the program.

Multithreaded Matrix numpy.sqrt()

We can update the calculation of the square root of the matrix to be completed in parallel using threads.

We will use a ThreadPool and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the numpy.sqrt() function on the sub-array defined by the provided coordinates and ignores the return value.

# apply matrix math operation on a portion of a matrix

def task(coords, data, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = sqrt(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

Next, we can create our input and output arrays as before. The output array is empty as we will populate it with results from each sub-task.

...

# create square matrices

data = ones((n, n))

result = empty((n, n))

We can then create a thread pool with one worker per physical CPU core.

Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.

...

# create the thread pool

with ThreadPool(4) as pool:

# ...

The square arrays are then divided into quadrants (4 parts) and each is issued to the task() function for parallel execution.

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data, result))

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

...

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

And that’s it.

Tying this together, the complete example of a multithreaded matrix square root is listed below.

# multithreaded matrix sqrt

from time import time

from multiprocessing.pool import ThreadPool

from numpy import ones

from numpy import sqrt

from numpy import empty

# apply matrix math operation on a portion of a matrix

def task(coords, data, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = sqrt(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

# record start time

start = time()

# size of square arrays

n = 50000

# create square matrices

data = ones((n, n))

result = empty((n, n))

# create the thread pool

with ThreadPool(4) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data, result))

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example is completed in about 8.999 seconds.

On my system, this is about 4.788 seconds faster than the single-threaded version or a 1.53x speedup.

I suspect we see less of a speed-up of this function compared to other functions like exp() and log() is because the sqrt() function may be using a more efficient algorithm or an implementation that makes use of modern CPU instructions.

1	Took 8.999 seconds

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Multithreaded Matrix Square

The square function can be calculated on a matrix in parallel using threads.

We will first explore how to execute the function using a single thread, then use multiple threads and compare the performance.

Single-Threaded Matrix numpy.square()

We can create a matrix and then calculate the square of the matrix.

We will use a modestly sized matrix of 50,000 x 50,000 to ensure it takes a non-trivial amount of time to perform the element-wise operation.

The numpy.ones() function can create the input matrix, initialized with the value 1.0.

...

# size of arrays

n = 50000

# create square matrix

data = ones((n, n))

We can then calculate the square of the matrix using the numpy.square() function.

...

# matrix square

result = square(data)

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

# single threaded matrix square

from time import time

from numpy import ones

from numpy import square

# record start time

start = time()

# size of arrays

n = 50000

# create square matrix

data = ones((n, n))

# matrix square

result = square(data)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example creates a matrix and then calculates the square of the matrix.

The program takes about 13.279 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

1	Took 13.279 seconds

Next, let’s look at a multithreaded version of the program.

Multithreaded Matrix numpy.square()

We can update the calculation of the square of the matrix to be completed in parallel using threads.

We will use a ThreadPool and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the numpy.square() function on the sub-array defined by the provided coordinates and ignores the return value.

# apply matrix math operation on a portion of a matrix

def task(coords, data, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = square(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

Next, we can create our input and output arrays as before. The output array is empty as we will populate it with results from each sub-task.

...

# create square matrices

data = ones((n, n))

result = empty((n, n))

We can then create a thread pool with one worker per physical CPU core.

Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.

...

# create the thread pool

with ThreadPool(4) as pool:

# ...

The square arrays are then divided into quadrants (4 parts) and each is issued to the task() function for parallel execution.

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data, result))

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

...

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

And that’s it.

Tying this together, the complete example of a multithreaded matrix square is listed below.

# multithreaded matrix square

from time import time

from multiprocessing.pool import ThreadPool

from numpy import ones

from numpy import square

from numpy import empty

# apply matrix math operation on a portion of a matrix

def task(coords, data, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = square(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

# record start time

start = time()

# size of square arrays

n = 50000

# create square matrices

data = ones((n, n))

result = empty((n, n))

# create the thread pool

with ThreadPool(4) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data, result))

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example is completed in about 8.919 seconds.

On my system, this is about 4.360 seconds faster than the single-threaded version or a 1.49x speedup.

As with the square root, I suspect we see less of a speed-up of this function compared to other functions like exp() and log() because the square() function may be using a more efficient algorithm or an implementation that makes use of modern CPU instructions.

1	Took 8.919 seconds

Takeaways

You now know how to calculate mathematical numpy functions on matrices in parallel using threads.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by Gary Doughty on Unsplash

Parallel Numpy Matrix Math Functions

Matrix Math Functions Are Single-Threaded

How to Execute Matrix Mathematical Functions in Parallel

Multithreaded Matrix Exponential

Single-Threaded Matrix numpy.exp()

Multithreaded Matrix numpy.exp()

Multithreaded Matrix Logarithm

Single-Threaded Matrix numpy.log()

Multithreaded Matrix numpy.log()

Multithreaded Matrix Square Root

Single-Threaded Matrix numpy.sqrt()

Multithreaded Matrix numpy.sqrt()

Multithreaded Matrix Square

Single-Threaded Matrix numpy.square()

Multithreaded Matrix numpy.square()

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent NumPy Fast
(without the frustration)

Additional menu

Matrix Math Functions Are Single-Threaded

How to Execute Matrix Mathematical Functions in Parallel

Multithreaded Matrix Exponential

Single-Threaded Matrix numpy.exp()

Multithreaded Matrix numpy.exp()

Multithreaded Matrix Logarithm

Single-Threaded Matrix numpy.log()

Multithreaded Matrix numpy.log()

Multithreaded Matrix Square Root

Single-Threaded Matrix numpy.sqrt()

Multithreaded Matrix numpy.sqrt()

Multithreaded Matrix Square

Single-Threaded Matrix numpy.square()

Multithreaded Matrix numpy.square()

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Leave a Reply Cancel reply

Footer

Learn Concurrent NumPy Fast (without the frustration)

Learn Concurrent NumPy Fast
(without the frustration)