You can **calculate mathematical functions on matrices in numpy in parallel** using Python threads.

This can be achieved because most numpy math functions release the global interpreter lock, allowing Python threads that call numpy math functions to run in parallel.

In this tutorial, you will discover how to calculate mathematical numpy functions on matrices in parallel using threads.

Let’s get started.

Table of Contents

## Matrix Math Functions Are Single-Threaded

Numpy is an array library for Python.

It makes use of BLAS and LAPACK to implement many linear algebra functions for vectors and matrices efficiently. Some of these operations, such as matrix multiplication are multithreaded.

Nevertheless, many element-wise matrix mathematical functions in numpy are not multithreaded.

Element-wise functions are those operations applied to each item or cell in a matrix to produce an output matrix of the same size.

elementwise: Obtained by operating on one element (of a matrix etc) at a time.

— elementwise, Wiktionary.

Examples of element-wise math functions include numpy functions such as:

- Exponential function with
**numpy.exp()** - Natural logarithm with
**numpy.log()** - Square root with
**numpy.sqrt()** - Square function
**numpy.square()**

None of these operations are multithreaded by default.

**How can we perform math functions on matrices in parallel?**

## How to Execute Matrix Mathematical Functions in Parallel

Element-wise matrix math functions can be made multithreaded.

Numpy will release the Global Interpreter Lock (GIL) when performing most math functions, including matrix math functions.

This means that we can make element-wise operations parallel using Python threads. This is great as if we had to resort to process-based concurrency, then most gains achieved from parallelism would be lost in inter-processes communication for the matrices involved. This is not a problem with threads as they can share memory within the same process.

You can learn more about numpy releasing the GIL in the tutorial:

The Python standard library provides the **multiprocessing.pool.ThreadPool** class that can be used to create and manage a pool of worker threads.

A **ThreadPool** instance can be created and configured with one worker per physical CPU in the system.

For example:

1 2 3 4 |
... # create the thread pool with ThreadPool(4) as pool: # ... |

You can learn more about configuring the number of workers in the **ThreadPool** in the tutorial:

We can then define a function that implements the math function we wish to perform that takes matrices as arguments. The task function can then operate on a portion of the larger matrix and perform the slow element-wise operation, such as calculating the log of a portion of a matrix and placing the result in a second matrix.

For example, we can use a function like **numpy.log()** for calculating the natural logarithm that takes the input matrix as an argument and a target matrix for the result via the “out” argument. We can specify the coordinates of the sub-matrix in each as an argument to our task function.

1 2 3 4 5 |
# apply matrix math operation on a portion of a matrix def task(coords, data, result): # unpack array indexes i1, i2, i3, i4 = coords _ = log(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) |

We can then divide our large input matrix into sections and issue each as a task to the **ThreadPool** to have the element-wise operations performed on each piece in parallel.

This can be challenging to make generic as it depends on the number of threads and the dimensions of the matrix.

If we have 4 physical CPU cores, then ideally we want to split our matrix into 4 tasks to be completed in parallel. This is probably the most efficient use of resources.

For a square matrix, we can find the midpoint of each dimension, then enumerate the four quadrants using for-loops, using the indices into the larger matrices directly.

For example:

1 2 3 4 5 6 7 8 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) |

Tasks can then be issued via the **async_apply()** method on the **ThreadPool** that returns an **AsyncResult** object, which we do not need in this case.

For example:

1 2 3 |
... # issue task _ = pool.apply_async(task, args=(coords, data1, data2, result)) |

You can learn more about issuing tasks to the **ThreadPool** via the **apply_async()** method in the tutorial:

Now that we know how to execute matrix mathematical functions in parallel with threads, let’s look at some worked examples.

## Multithreaded Matrix Exponential

The exponential function can be calculated on a matrix in parallel using threads.

We will first explore how to execute the function using a single thread, then use multiple threads and compare the performance.

### Single-Threaded Matrix numpy.exp()

We can create a matrix and then calculate the exponential of the matrix.

We will use a modestly sized matrix of 50,000 x 50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The **numpy.ones()** function can create the input matrix, initialized with the value 1.0.

1 2 3 4 5 |
... # size of arrays n = 50000 # create square matrix data = ones((n, n)) |

We can then calculate the exponential of the matrix using the **numpy.exp()** function.

1 2 3 |
... # matrix exp result = exp(data) |

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# single threaded matrix exp from time import time from numpy import ones from numpy import exp # record start time start = time() # size of arrays n = 50000 # create square matrix data = ones((n, n)) # matrix exp result = exp(data) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example creates a matrix then calculates the exponential of the matrix.

The program takes about 24.026 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

The precision of the estimated run time can be improved by running the same program many times and averaging the result. This is left as an exercise as we are interested in directional improvements in performance more than specific benchmark speeds.

1 |
Took 24.026 seconds |

Next, let’s look at a multithreaded version of the program.

### Multithreaded Matrix numpy.exp()

We can update the calculation of the exponential of the matrix to be completed in parallel using threads.

We will use a **ThreadPool** and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the **numpy.exp()** function on the sub-array defined by the provided coordinates and ignores the return value.

1 2 3 4 5 |
# apply matrix math operation on a portion of a matrix def task(coords, data, result): # unpack array indexes i1, i2, i3, i4 = coords _ = exp(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) |

We can ignore the return value because there is a single copy of the result matrix in memory and all of the task functions are operating on this single array in parallel. This is because threads have a shared memory model, unlike processes that must transmit copies of data between processes.

Next, we can create our input and output arrays as before. The output array is empty as we will populate it with results from each sub-task.

1 2 3 4 |
... # create square matrices data = ones((n, n)) result = empty((n, n)) |

We can then create a thread pool with one worker per physical CPU core.

I have 4 physical CPU cores in my system, so I will configure the **ThreadPool** to use 4 workers. Update the **ThreadPool** configuration accordingly to match the number of physical CPU cores in your system.

1 2 3 4 |
... # create the thread pool with ThreadPool(4) as pool: # ... |

The square arrays are then divided into quadrants (4 parts) and each is issued to the **task()** function for parallel execution.

1 2 3 4 5 6 7 8 9 10 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data, result)) |

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

1 2 3 4 5 |
... # close the pool pool.close() # wait for tasks to complete pool.join() |

And that’s it.

Tying this together, the complete example of a multithreaded matrix exponential is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# multithreaded matrix exp from time import time from multiprocessing.pool import ThreadPool from numpy import ones from numpy import exp from numpy import empty # apply matrix math operation on a portion of a matrix def task(coords, data, result): # unpack array indexes i1, i2, i3, i4 = coords _ = exp(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) # record start time start = time() # size of square arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) result = empty((n, n)) # create the thread pool with ThreadPool(4) as pool: # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data1, data2, result)) # close the pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example is completed in about 12.050 seconds.

On my system, this is about 11.976 seconds faster than the single-threaded version or a 1.99x speedup.

1 |
Took 12.050 seconds |

## Multithreaded Matrix Logarithm

The natural logarithm function can be calculated on a matrix in parallel using threads.

We will first explore how to execute the function using a single thread, then use multiple threads and compare the performance.

### Single-Threaded Matrix numpy.log()

We can create a matrix and then calculate the natural logarithm of the matrix.

We will use a modestly sized matrix of 50,000 x 50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The **numpy.ones()** function can create the input matrix, initialized with the value 1.0.

1 2 3 4 5 |
... # size of arrays n = 50000 # create square matrix data = ones((n, n)) |

We can then calculate the natural logarithm of the matrix using the **numpy.log()** function.

1 2 3 |
... # matrix log result = log(data) |

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# single threaded matrix log from time import time from numpy import ones from numpy import log # record start time start = time() # size of arrays n = 50000 # create square matrix data = ones((n, n)) # matrix log result = log(data) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example creates a matrix and then calculates the natural logarithm of the matrix.

The program takes about 23.094 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

1 |
Took 23.094 seconds |

Next, let’s look at a multithreaded version of the program.

### Multithreaded Matrix numpy.log()

We can update the calculation of the natural logarithm of the matrix to be completed in parallel using threads.

We will use a **ThreadPool** and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the **numpy.log()** function on the sub-array defined by the provided coordinates and ignores the return value.

1 2 3 4 5 |
# apply matrix math operation on a portion of a matrix def task(coords, data, result): # unpack array indexes i1, i2, i3, i4 = coords _ = log(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) |

Next, we can create our input and output arrays as before. The output array is empty as we will populate it with results from each sub-task.

1 2 3 4 |
... # create square matrices data = ones((n, n)) result = empty((n, n)) |

We can then create a thread pool with one worker per physical CPU core.

Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.

1 2 3 4 |
... # create the thread pool with ThreadPool(4) as pool: # ... |

The square arrays are then divided into quadrants (4 parts) and each is issued to the **task()** function for parallel execution.

1 2 3 4 5 6 7 8 9 10 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data, result)) |

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

1 2 3 4 5 |
... # close the pool pool.close() # wait for tasks to complete pool.join() |

And that’s it.

Tying this together, the complete example of a multithreaded matrix natural logarithm is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# multithreaded matrix log from time import time from multiprocessing.pool import ThreadPool from numpy import ones from numpy import log from numpy import empty # apply matrix math operation on a portion of a matrix def task(coords, data, result): # unpack array indexes i1, i2, i3, i4 = coords _ = log(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) # record start time start = time() # size of square arrays n = 50000 # create square matrices data = ones((n, n)) result = empty((n, n)) # create the thread pool with ThreadPool(4) as pool: # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data, result)) # close the pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example is completed in about 11.060 seconds.

On my system, this is about 12.034 seconds faster than the single-threaded version or a 2.09x speedup.

1 |
Took 11.060 seconds |

**Overwheled by the python concurrency APIs?**

Find relief, download my FREE Python Concurrency Mind Maps

## Multithreaded Matrix Square Root

The square root function can be calculated on a matrix in parallel using threads.

We will first explore how to execute the function using a single thread, then use multiple threads and compare the performance.

### Single-Threaded Matrix numpy.sqrt()

We can create a matrix and then calculate the square root of the matrix.

We will use a modestly sized matrix of 50,000 x 50,000 to ensure it takes a non-trivial amount of time to perform the element-wise operation.

The **numpy.ones()** function can create the input matrix, initialized with the value 1.0.

1 2 3 4 5 |
... # size of arrays n = 50000 # create square matrix data = ones((n, n)) |

We can then calculate the square root of the matrix using the **numpy.sqrt()** function.

1 2 3 |
... # matrix sqrt result = sqrt(data) |

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# single threaded matrix sqrt from time import time from numpy import ones from numpy import sqrt # record start time start = time() # size of arrays n = 50000 # create square matrix data = ones((n, n)) # matrix sqrt result = sqrt(data) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example creates a matrix and then calculates the square root of the matrix.

The program takes about 13.787 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

1 |
Took 13.787 seconds |

Next, let’s look at a multithreaded version of the program.

### Multithreaded Matrix numpy.sqrt()

We can update the calculation of the square root of the matrix to be completed in parallel using threads.

We will use a **ThreadPool** and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the **numpy.sqrt()** function on the sub-array defined by the provided coordinates and ignores the return value.

1 2 3 4 5 |
# apply matrix math operation on a portion of a matrix def task(coords, data, result): # unpack array indexes i1, i2, i3, i4 = coords _ = sqrt(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) |

Next, we can create our input and output arrays as before. The output array is empty as we will populate it with results from each sub-task.

1 2 3 4 |
... # create square matrices data = ones((n, n)) result = empty((n, n)) |

We can then create a thread pool with one worker per physical CPU core.

Update the **ThreadPool** configuration accordingly to match the number of physical CPU cores in your system.

1 2 3 4 |
... # create the thread pool with ThreadPool(4) as pool: # ... |

The square arrays are then divided into quadrants (4 parts) and each is issued to the **task()** function for parallel execution.

1 2 3 4 5 6 7 8 9 10 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data, result)) |

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

1 2 3 4 5 |
... # close the pool pool.close() # wait for tasks to complete pool.join() |

And that’s it.

Tying this together, the complete example of a multithreaded matrix square root is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# multithreaded matrix sqrt from time import time from multiprocessing.pool import ThreadPool from numpy import ones from numpy import sqrt from numpy import empty # apply matrix math operation on a portion of a matrix def task(coords, data, result): # unpack array indexes i1, i2, i3, i4 = coords _ = sqrt(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) # record start time start = time() # size of square arrays n = 50000 # create square matrices data = ones((n, n)) result = empty((n, n)) # create the thread pool with ThreadPool(4) as pool: # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data, result)) # close the pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example is completed in about 8.999 seconds.

On my system, this is about 4.788 seconds faster than the single-threaded version or a 1.53x speedup.

I suspect we see less of a speed-up of this function compared to other functions like **exp()** and **log()** is because the **sqrt()** function may be using a more efficient algorithm or an implementation that makes use of modern CPU instructions.

1 |
Took 8.999 seconds |

## Multithreaded Matrix Square

The square function can be calculated on a matrix in parallel using threads.

### Single-Threaded Matrix numpy.square()

We can create a matrix and then calculate the square of the matrix.

We will use a modestly sized matrix of 50,000 x 50,000 to ensure it takes a non-trivial amount of time to perform the element-wise operation.

The **numpy.ones()** function can create the input matrix, initialized with the value 1.0.

1 2 3 4 5 |
... # size of arrays n = 50000 # create square matrix data = ones((n, n)) |

We can then calculate the square of the matrix using the **numpy.square()** function.

1 2 3 |
... # matrix square result = square(data) |

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# single threaded matrix square from time import time from numpy import ones from numpy import square # record start time start = time() # size of arrays n = 50000 # create square matrix data = ones((n, n)) # matrix square result = square(data) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example creates a matrix and then calculates the square of the matrix.

The program takes about 13.279 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

1 |
Took 13.279 seconds |

Next, let’s look at a multithreaded version of the program.

### Multithreaded Matrix numpy.square()

We can update the calculation of the square of the matrix to be completed in parallel using threads.

**ThreadPool** and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the **numpy.square()** function on the sub-array defined by the provided coordinates and ignores the return value.

1 2 3 4 5 |
# apply matrix math operation on a portion of a matrix def task(coords, data, result): # unpack array indexes i1, i2, i3, i4 = coords _ = square(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) |

1 2 3 4 |
... # create square matrices data = ones((n, n)) result = empty((n, n)) |

We can then create a thread pool with one worker per physical CPU core.

Update the **ThreadPool** configuration accordingly to match the number of physical CPU cores in your system.

1 2 3 4 |
... # create the thread pool with ThreadPool(4) as pool: # ... |

1 2 3 4 5 6 7 8 9 10 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data, result)) |

1 2 3 4 5 |
... # close the pool pool.close() # wait for tasks to complete pool.join() |

And that’s it.

Tying this together, the complete example of a multithreaded matrix square is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# multithreaded matrix square from time import time from multiprocessing.pool import ThreadPool from numpy import ones from numpy import square from numpy import empty # apply matrix math operation on a portion of a matrix def task(coords, data, result): # unpack array indexes i1, i2, i3, i4 = coords _ = square(data[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) # record start time start = time() # size of square arrays n = 50000 # create square matrices data = ones((n, n)) result = empty((n, n)) # create the thread pool with ThreadPool(4) as pool: # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data, result)) # close the pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example is completed in about 8.919 seconds.

On my system, this is about 4.360 seconds faster than the single-threaded version or a 1.49x speedup.

As with the square root, I suspect we see less of a speed-up of this function compared to other functions like **exp()** and **log()** because the **square()** function may be using a more efficient algorithm or an implementation that makes use of modern CPU instructions.

1 |
Took 8.919 seconds |

## Takeaways

You now know how to calculate mathematical numpy functions on matrices in parallel using threads.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

Photo by Gary Doughty on Unsplash

## Leave a Reply