You can perform **element-wise matrix math functions in parallel** in numpy using Python threads.

This can offer a 1.3x speed improvement over the single-threaded version of operations like element-wise matrix addition, subtraction, division, and multiplication.

In this tutorial, you will discover how to perform multithreaded element-wise matrix math functions with numpy.

Let’s get started.

Table of Contents

## Element-Wise Matrix Operations Are Single Threaded

Numpy is an array library for Python.

It makes use of BLAS and LAPACK to implement many linear algebra functions for vectors and matrices efficiently. Some of these operations, such as matrix multiplication are multithreaded.

Nevertheless, many element-wise matrix functions in numpy are not multithreaded.

Element-wise functions are those operations applied to each item or cell in a matrix to produce an output matrix of the same size.

elementwise: Obtained by operating on one element (of a matrix etc) at a time.

— elementwise, Wiktionary.

For example, adding a square 2×2 matrix to another 2×2 matrix results in a third 2×2 matrix where each element in the new matrix is the element-wise addition of elements in the same position in the first and second matrices.

1 2 |
[1, 1] + [1, 1] = [2, 2] [1, 1] [1, 1] [2, 2] |

Examples of element-wise math functions include arithmetic functions such as:

- Matrix addition
- Matrix subtraction
- Matrix multiplication (Hadamard product)
- Matrix division

It also includes many other operations that may involve one or more input matrices, such as square root, power, and so on.

None of these operations are multithreaded by default.

**How can we perform element-wise matrix math functions in parallel?**

## Multithreaded Element-Wise Matrix Operations

Element-wise matrix math functions can be made multithreaded.

Numpy will release the Global Interpreter Lock (GIL) when performing most math functions, including matrix math functions.

This means that we can make element-wise operations parallel using Python threads. This is great because if we had to resort to process-based concurrency, then most gains achieved from parallelism would be lost in inter-process communication for the matrices involved. This is not a problem with threads as they can share memory within the same process.

You can learn more about numpy and the GIL in the tutorial:

The Python standard library provides the **multiprocessing.pool.ThreadPool** class that can be used to create and manage a pool of worker threads.

A **ThreadPool** instance can be created and configured with one worker per physical CPU in the system.

For example:

1 2 3 4 |
... # create the thread pool with ThreadPool(4) as pool: # ... |

You can learn more about configuring the number of workers in the **ThreadPool** in the tutorial:

We can then define a function that implements the math function we wish to perform that takes matrices as arguments. The task function can then operate on a portion of the larger matrix and perform the slow element-wise operation, such as adding two parts of the arrays and putting the results in a third array.

For example, we can use a function like **numpy.add()** for matrix addition, **numpy.multiply()** for element-wise multiplication, and so on.

These element-wise functions take the first matrix, the second matrix, and a target matrix. We can specify the coordinates of the sub-matrix in each.

1 2 3 4 5 |
# add a section of two matrices into a third matrix def task(coords, data1, data2, result): # unpack array indexes i1, i2, i3, i4 = coords _ = numpy.add(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) |

We can then divide our large matrices into sections and issue each as a task to the thread pool to have the element-wise operations performed in parallel.

This can be challenging to make generic as it depends on the number of threads and the dimensions of the matrix.

If we have 4 physical CPU cores, then ideally we want to split our matrix into 4 tasks to be completed in parallel. This is probably the most efficient use of resources.

For a square matrix, we can find the midpoint of each dimension, then enumerate the four quadrants using for-loops, using the indices into the larger matrices directly.

For example:

1 2 3 4 5 6 7 8 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) |

Tasks can then be issued via the **async_apply()** method on the **ThreadPool** that returns an **AsyncResult** object, which we do not need in this case.

For example:

1 2 3 |
... # issue task _ = pool.apply_async(task, args=(coords, data1, data2, result)) |

You can learn more about issuing tasks to the **ThreadPool** via the **apply_async()** method in the tutorial:

Now that we know how to execute element-wise matrix mathematical functions in parallel with threads, let’s look at some worked examples.

## Multithreaded Element-Wise Matrix Addition

Element-wise matrix addition can be executed in parallel using threads.

We will first explore how to execute matrix addition using a single thread, then using multiple threads and compare the performance.

### Single-Threaded Matrix Addition

We can create two matrices of numbers and then add them together.

We will use a modestly sized matrix of 50,000×50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The **numpy.ones()** function can be used to create two matrices of the given size, initialized with the value 1.0.

1 2 3 4 5 6 |
... # size of arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) |

We can then add the matrices together to produce a third matrix with the “+” operator.

1 2 3 |
... # matrix addition result = data1 + data2 |

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# single threaded matrix addition from time import time from numpy import ones # record start time start = time() # size of arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) # matrix addition result = data1 + data2 # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example creates two matrices and then adds them together.

The program takes about 27.385 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

The precision of the estimated run time can be improved by running the same program many times and averaging the result. This is left as an exercise as we are interested in directional improvements in performance more than specific benchmark speeds.

1 |
Took 27.385 seconds |

### Single-Threaded Matrix Addition with add()

We can also perform addition using the **numpy.add()** function.

This can be achieved by creating a result matrix via the **numpy.empty()** function and then calling **numpy.add()** with the input and output matrices.

For example:

1 2 3 4 |
... result = empty((n, n)) # matrix addition _ = add(data1, data2, out=result) |

This is functionally the same as using the “**+**” operator, but allows us to have the result placed into a predefined target matrix easily, a feature we will need when it comes to multithreading the example.

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# single threaded matrix addition with add() from time import time from numpy import ones from numpy import add from numpy import empty # record start time start = time() # size of arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) result = empty((n, n)) # matrix addition _ = add(data1, data2, out=result) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the program takes about the same amount of time as using the “+” operator.

In this case, it took about 26.973 seconds to complete.

1 |
Took 26.973 seconds |

Next, let’s look at a multithreaded version of the program.

### Multithreaded Matrix Addition

We can update the element-wise matrix addition to be completed in parallel using threads.

We will use a **ThreadPool** and issue tasks to operate on sub-sections of the input and output matrix simultaneously, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the **numpy.add()** function on the sub-array defined by the provided coordinates and ignores the return value.

1 2 3 4 5 |
# add a section of two matrices into a third matrix def task(coords, data1, data2, result): # unpack array indexes i1, i2, i3, i4 = coords _ = add(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) |

We can ignore the return value because there is a single copy of the result matrix in memory and all of the task functions are operating on this single array in parallel. This is because threads have a shared memory model, unlike processes that must transmit copies of data.

Next, we can create our input and output arrays as before.

1 2 3 4 5 |
... # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) result = empty((n, n)) |

We can then create a thread pool with one worker per physical CPU core.

I have 4 physical CPU cores in my system, so I will configure the ThreadPool to use 4 workers. Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.

1 2 3 4 |
... # create the thread pool with ThreadPool(4) as pool: # ... |

The square arrays are then divided into quadrants (4 parts) and each is issued to the **task()** function for parallel execution of the **numpy.add()** function.

1 2 3 4 5 6 7 8 9 10 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data1, data2, result)) |

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

1 2 3 4 5 |
... # close the pool pool.close() # wait for tasks to complete pool.join() |

And that’s it.

Tying this together, the complete example of multithreaded matrix addition is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# multithreaded matrix addition from time import time from multiprocessing.pool import ThreadPool from numpy import ones from numpy import add from numpy import empty # add a section of two matrices into a third matrix def task(coords, data1, data2, result): # unpack array indexes i1, i2, i3, i4 = coords _ = add(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) # record start time start = time() # size of square arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) result = empty((n, n)) # create the thread pool with ThreadPool(4) as pool: # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data1, data2, result)) # close the pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example is completed in about 19.492 seconds.

On my system, this is about 7.481 seconds faster than the single-threaded version or a 1.38x speedup.

1 |
Took 19.492 seconds |

## Multithreaded Element-Wise Matrix Subtraction

Element-wise matrix subtraction can be executed in parallel using threads.

We will first explore how to execute matrix subtraction using a single thread, then use multiple threads and compare the performance.

### Single-Threaded Matrix Subtraction

We can create two matrices of numbers and then subtract one from the other.

We will use a modestly sized matrix of 50,000×50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The **numpy.ones()** function can be used to create two matrices of the given size, initialized with the value 1.0.

1 2 3 4 5 6 |
... # size of arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) |

We can then subtract one array from the other to produce a third matrix with the “-” operator.

1 2 3 |
... # matrix subtraction result = data1 - data2 |

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# single threaded matrix subtraction from time import time from numpy import ones # record start time start = time() # size of arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) # matrix subtraction result = data1 - data2 # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example creates two matrices and then subtracts one from the other.

The program takes about 28.648 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

1 |
Took 28.648 seconds |

### Single-Threaded Matrix Subtraction with subtract()

We can also perform subtraction using the **numpy.subtract()** function.

This can be achieved by creating a result matrix via the numpy.empty() function and then calling **numpy.subtract()** with the input and output matrices.

For example:

1 2 3 4 |
... result = empty((n, n)) # matrix subtraction _ = subtract(data1, data2, out=result) |

This is functionally the same as using the “-” operator but allows us to have the result placed into a predefined target matrix easily, a feature we will need when it comes to multithreading the example.

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# single threaded matrix subtraction with subtract() from time import time from numpy import ones from numpy import subtract from numpy import empty # record start time start = time() # size of arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) result = empty((n, n)) # matrix subtraction _ = subtract(data1, data2, out=result) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the program takes about the same amount of time as using the “-” operator.

In this case, it took about 25.861 seconds to complete.

1 |
Took 25.861 seconds |

Next, let’s look at a multithreaded version of the program.

### Multithreaded Matrix Subtraction

We can update the element-wise matrix subtraction to be completed in parallel using threads.

We will use a **ThreadPool** and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the **numpy.subtract()** function on the sub-array defined by the provided coordinates and ignores the return value,

1 2 3 4 5 |
# subtract a section of one array from another and place it into a third array def task(coords, data1, data2, result): # unpack array indexes i1, i2, i3, i4 = coords _ = subtract(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) |

We can ignore the return value because there is a single copy of the result matrix in memory and all of the task functions are operating on this single array in parallel. This is because threads have a shared memory model, unlike processes that must transmit copies of data between processes.

Next, we can create our input and output arrays as before.

1 2 3 4 5 |
... # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) result = empty((n, n)) |

We can then create a thread pool with one worker per physical CPU core.

I have 4 physical CPU cores in my system, so I will configure the **ThreadPool** to use 4 workers. Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.

1 2 3 4 |
... # create the thread pool with ThreadPool(4) as pool: # ... |

The square arrays are then divided into quadrants (4 parts) and each is issued to the task() function for parallel execution of the **numpy.subtract()** function.

1 2 3 4 5 6 7 8 9 10 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data1, data2, result)) |

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

1 2 3 4 5 |
... # close the pool pool.close() # wait for tasks to complete pool.join() |

And that’s it.

Tying this together, the complete example of multithreaded matrix subtraction is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# multithreaded matrix subtraction from time import time from multiprocessing.pool import ThreadPool from numpy import ones from numpy import subtract from numpy import empty # subtract a section of one array from another and place into a third array def task(coords, data1, data2, result): # unpack array indexes i1, i2, i3, i4 = coords _ = subtract(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) # record start time start = time() # size of square arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) result = empty((n, n)) # create the thread pool with ThreadPool(4) as pool: # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data1, data2, result)) # close the pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example is completed in about 19.334 seconds.

On my system, this is about 6.527 seconds faster than the single-threaded version or a 1.34x speedup.

1 |
Took 19.334 seconds |

**Overwheled by the python concurrency APIs?**

Find relief, download my FREE Python Concurrency Mind Maps

## Multithreaded Element-Wise Matrix Multiplication (Hadamard product)

Element-wise matrix multiplication, also called the Hadamard product, can be executed in parallel using threads.

Note, element-wise matrix multiplication is different from “matrix multiplication“, also called the “matrix product”. The matrix product is implemented in numpy via the “**@**” operator and the **numpy.dot()** function and is already multithreaded via the underlying BLAS library.

You can learn more about multithreaded matrix multiplication in the tutorial:

We will first explore how to execute matrix multiplication using a single thread, then use multiple threads and compare the performance.

### Single-Threaded Matrix Multiplication

We can create two matrices of numbers and then multiply them together, element-wise.

We will use a modestly sized matrix of 50,000×50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The **numpy.ones()** function can be used to create two matrices of the given size, initialized with the value 1.0.

1 2 3 4 5 6 |
... # size of arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) |

We can then multiply the matrices to produce a third matrix with the “*” operator.

1 2 3 |
... # matrix multiplication result = data1 * data2 |

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# single threaded element-wise matrix multiplication from time import time from numpy import ones # record start time start = time() # size of arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) # matrix multiplication result = data1 * data2 # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example creates two matrices then multiplies them together.

The program takes about 28.607 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

The precision of the estimated run time can be improved by running the same program many times and averaging the result. This is left as an exercise as we are interested in directional improvements in performance more than specific benchmark speeds.

1 |
Took 28.607 seconds |

### Single-Threaded Matrix Multiplication with multiply()

We can also perform multiplication using the **numpy.multiply()** function.

This can be achieved by creating a result matrix via the **numpy.empty()** function and then calling **numpy.multiply()** with the input and output matrices.

For example:

1 2 3 4 |
... result = empty((n, n)) # matrix multiplication _ = multiply(data1, data2, out=result) |

This is functionally the same as using the “*****” operator, but allows us to have the result placed into a predefined target matrix easily, a feature we will need when it comes to multithreading the example.

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# single threaded element-wise matrix multiplication with multiply() from time import time from numpy import ones from numpy import multiply from numpy import empty # record start time start = time() # size of arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) result = empty((n, n)) # matrix multiplication _ = multiply(data1, data2, out=result) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the program takes about the same amount of time as using the “*” operator.

In this case, it took about 27.399 seconds to complete.

1 |
Took 27.399 seconds |

Next, let’s look at a multithreaded version of the program.

### Multithreaded Matrix Multiplication

We can update the element-wise matrix multiplication to be completed in parallel using threads.

We will use a **ThreadPool** and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the **numpy.multiply()** function on the sub-array defined by the provided coordinates and ignores the return value,

1 2 3 4 5 |
# multiply a section of one array from another and place it into a third array def task(coords, data1, data2, result): # unpack array indexes i1, i2, i3, i4 = coords _ = multiply(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) |

We can ignore the return value because there is a single copy of the result matrix in memory and all of the task functions are operating on this single array in parallel. This is because threads have a shared memory model, unlike processes that must transmit copies of data between processes.

Next, we can create our input and output arrays as before.

1 2 3 4 5 |
... # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) result = empty((n, n)) |

We can then create a thread pool with one worker per physical CPU core.

I have 4 physical CPU cores in my system, so I will configure the ThreadPool to use 4 workers. Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.

1 2 3 4 |
... # create the thread pool with ThreadPool(4) as pool: # ... |

The square arrays are then divided into quadrants (4 parts) and each is issued to the **task()** function for parallel execution of the **numpy.multiply()** function.

1 2 3 4 5 6 7 8 9 10 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data1, data2, result)) |

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

1 2 3 4 5 |
... # close the pool pool.close() # wait for tasks to complete pool.join() |

And that’s it.

Tying this together, the complete example of multithreaded matrix multiplication is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# multithreaded element-wise matrix multiplication from time import time from multiprocessing.pool import ThreadPool from numpy import ones from numpy import multiply from numpy import empty # multiply a section of one array from another and place into a third array def task(coords, data1, data2, result): # unpack array indexes i1, i2, i3, i4 = coords _ = multiply(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) # record start time start = time() # size of square arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) result = empty((n, n)) # create the thread pool with ThreadPool(4) as pool: # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data1, data2, result)) # close the pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example is completed in about 20.455 seconds.

On my system, this is about 6.944 seconds faster than the single-threaded version or a 1.34x speedup.

1 |
Took 20.455 seconds |

## Multithreaded Element-Wise Matrix Division

Element-wise matrix division can be executed in parallel using threads.

We will first explore how to execute matrix division using a single thread, then use multiple threads and compare the performance.

### Single-Threaded Matrix Division

We can create two matrices of numbers and then divide one by the other.

**numpy.ones()** function can be used to create two matrices of the given size, initialized with the value 1.0.

1 2 3 4 5 6 |
... # size of arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) |

We can then divide one array by the other to produce a third matrix with the “/” operator.

1 2 3 |
... # matrix subtraction result = data1 / data2 |

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# single threaded matrix division from time import time from numpy import ones # record start time start = time() # size of arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) # matrix division result = data1 / data2 # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example creates two matrices and then divides one by the other.

The program takes about 27.986 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

1 |
Took 27.986 seconds |

### Single-Threaded Matrix Division with divide()

We can also perform division using the **numpy.divide()** function.

This can be achieved by creating a result matrix via the **numpy.empty()** function and then calling **numpy.divide()** with the input and output matrices.

For example:

1 2 3 4 |
... result = empty((n, n)) # matrix division _ = divide(data1, data2, out=result) |

This is functionally the same as using the “/” operator, but allows us to have the result placed into a predefined target matrix easily, a feature we will need when it comes to multithreading the example.

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# single threaded matrix division with divide() from time import time from numpy import ones from numpy import divide from numpy import empty # record start time start = time() # size of arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) result = empty((n, n)) # matrix division _ = divide(data1, data2, out=result) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the program takes about the same amount of time as using the “/” operator.

In this case, it took about 27.829 seconds to complete.

1 |
Took 27.829 seconds |

Next, let’s look at a multithreaded version of the program.

### Multithreaded Matrix Division

We can update the element-wise matrix division to be completed in parallel using threads.

We will use a **ThreadPool** and issue tasks to operate on sub-sections of the input and output matrix simultaneously, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the **numpy.divide()** function on the sub-array defined by the provided coordinates and ignores the return value,

1 2 3 4 5 |
# divide a section of one array by another and place it into a third array def task(coords, data1, data2, result): # unpack array indexes i1, i2, i3, i4 = coords _ = divide(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) |

We can ignore the return value because there is a single copy of the result matrix in memory and all of the task functions are operating on this single array in parallel. This is because threads have a shared memory model, unlike processes that must transmit copies of data between processes.

Next, we can create our input and output arrays as before.

1 2 3 4 5 |
... # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) result = empty((n, n)) |

We can then create a thread pool with one worker per physical CPU core.

1 2 3 4 |
... # create the thread pool with ThreadPool(4) as pool: # ... |

The square arrays are then divided into quadrants (4 parts) and each is issued to the **task()** function for parallel execution of the **numpy.subtract()** function.

1 2 3 4 5 6 7 8 9 10 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data1, data2, result)) |

1 2 3 4 5 |
... # close the pool pool.close() # wait for tasks to complete pool.join() |

And that’s it.

Tying this together, the complete example of multithreaded matrix division is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# multithreaded matrix division from time import time from multiprocessing.pool import ThreadPool from numpy import ones from numpy import divide from numpy import empty # divide a section of one array by another and place it into a third array def task(coords, data1, data2, result): # unpack array indexes i1, i2, i3, i4 = coords _ = divide(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4]) # record start time start = time() # size of square arrays n = 50000 # create square matrices data1 = ones((n, n)) data2 = ones((n, n)) result = empty((n, n)) # create the thread pool with ThreadPool(4) as pool: # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(task, args=(coords, data1, data2, result)) # close the pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example is completed in about 20.802 seconds.

On my system, this is about 7.027 seconds faster than the single-threaded version or a 1.34x speedup.

1 |
Took 20.802 seconds |

## Further Reading

This section provides additional resources that you may find helpful.

**Books**

- Concurrent NumPy in Python, Jason Brownlee, 2023 (
**my book**).

**Concurrent NumPy Guides**

- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes

**Concurrent NumPy Documentation**

- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API

**NumPy API**

**Python Concurrency APIs**

- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks

## Takeaways

You now know how to perform multithreaded element-wise matrix math functions with numpy.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

## Do you have any questions?