NumPy Multithreaded Element-Wise Matrix Arithmetic

Last Updated on September 29, 2023

You can perform element-wise matrix math functions in parallel in numpy using Python threads.

This can offer a 1.3x speed improvement over the single-threaded version of operations like element-wise matrix addition, subtraction, division, and multiplication.

In this tutorial, you will discover how to perform multithreaded element-wise matrix math functions with numpy.

Let’s get started.

Element-Wise Matrix Operations Are Single Threaded

Numpy is an array library for Python.

It makes use of BLAS and LAPACK to implement many linear algebra functions for vectors and matrices efficiently. Some of these operations, such as matrix multiplication are multithreaded.

Nevertheless, many element-wise matrix functions in numpy are not multithreaded.

Element-wise functions are those operations applied to each item or cell in a matrix to produce an output matrix of the same size.

elementwise: Obtained by operating on one element (of a matrix etc) at a time.
— elementwise, Wiktionary.

For example, adding a square 2×2 matrix to another 2×2 matrix results in a third 2×2 matrix where each element in the new matrix is the element-wise addition of elements in the same position in the first and second matrices.

1 2	[1, 1] + [1, 1] = [2, 2] [1, 1] [1, 1] [2, 2]

Examples of element-wise math functions include arithmetic functions such as:

Matrix addition
Matrix subtraction
Matrix multiplication (Hadamard product)
Matrix division

It also includes many other operations that may involve one or more input matrices, such as square root, power, and so on.

None of these operations are multithreaded by default.

How can we perform element-wise matrix math functions in parallel?

Run loops using all CPUs, download your FREE book to learn how.

Multithreaded Element-Wise Matrix Operations

Element-wise matrix math functions can be made multithreaded.

Numpy will release the Global Interpreter Lock (GIL) when performing most math functions, including matrix math functions.

This means that we can make element-wise operations parallel using Python threads. This is great because if we had to resort to process-based concurrency, then most gains achieved from parallelism would be lost in inter-process communication for the matrices involved. This is not a problem with threads as they can share memory within the same process.

You can learn more about numpy and the GIL in the tutorial:

NumPy vs the Global Interpreter Lock (GIL)

The Python standard library provides the multiprocessing.pool.ThreadPool class that can be used to create and manage a pool of worker threads.

A ThreadPool instance can be created and configured with one worker per physical CPU in the system.

For example:

...

# create the thread pool

with ThreadPool(4) as pool:

# ...

You can learn more about configuring the number of workers in the ThreadPool in the tutorial:

ThreadPool Configure The Number of Worker Threads

We can then define a function that implements the math function we wish to perform that takes matrices as arguments. The task function can then operate on a portion of the larger matrix and perform the slow element-wise operation, such as adding two parts of the arrays and putting the results in a third array.

For example, we can use a function like numpy.add() for matrix addition, numpy.multiply() for element-wise multiplication, and so on.

These element-wise functions take the first matrix, the second matrix, and a target matrix. We can specify the coordinates of the sub-matrix in each.

# add a section of two matrices into a third matrix

def task(coords, data1, data2, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = numpy.add(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

We can then divide our large matrices into sections and issue each as a task to the thread pool to have the element-wise operations performed in parallel.

This can be challenging to make generic as it depends on the number of threads and the dimensions of the matrix.

If we have 4 physical CPU cores, then ideally we want to split our matrix into 4 tasks to be completed in parallel. This is probably the most efficient use of resources.

For a square matrix, we can find the midpoint of each dimension, then enumerate the four quadrants using for-loops, using the indices into the larger matrices directly.

For example:

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

Tasks can then be issued via the async_apply() method on the ThreadPool that returns an AsyncResult object, which we do not need in this case.

For example:

...

# issue task

_ = pool.apply_async(task, args=(coords, data1, data2, result))

You can learn more about issuing tasks to the ThreadPool via the apply_async() method in the tutorial:

How to Use ThreadPool apply_async() in Python

Now that we know how to execute element-wise matrix mathematical functions in parallel with threads, let’s look at some worked examples.

Start Now: Free Concurrent NumPy Crash Course

Multithreaded Element-Wise Matrix Addition

Element-wise matrix addition can be executed in parallel using threads.

We will first explore how to execute matrix addition using a single thread, then using multiple threads and compare the performance.

Single-Threaded Matrix Addition

We can create two matrices of numbers and then add them together.

We will use a modestly sized matrix of 50,000×50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The numpy.ones() function can be used to create two matrices of the given size, initialized with the value 1.0.

...

# size of arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

We can then add the matrices together to produce a third matrix with the “+” operator.

...

# matrix addition

result = data1 + data2

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

# single threaded matrix addition

from time import time

from numpy import ones

# record start time

start = time()

# size of arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

# matrix addition

result = data1 + data2

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example creates two matrices and then adds them together.

The program takes about 27.385 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

The precision of the estimated run time can be improved by running the same program many times and averaging the result. This is left as an exercise as we are interested in directional improvements in performance more than specific benchmark speeds.

1	Took 27.385 seconds

Single-Threaded Matrix Addition with add()

We can also perform addition using the numpy.add() function.

This can be achieved by creating a result matrix via the numpy.empty() function and then calling numpy.add() with the input and output matrices.

For example:

...

result = empty((n, n))

# matrix addition

_ = add(data1, data2, out=result)

This is functionally the same as using the “+” operator, but allows us to have the result placed into a predefined target matrix easily, a feature we will need when it comes to multithreading the example.

Tying this together, the complete example is listed below.

# single threaded matrix addition with add()

from time import time

from numpy import ones

from numpy import add

from numpy import empty

# record start time

start = time()

# size of arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

result = empty((n, n))

# matrix addition

_ = add(data1, data2, out=result)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the program takes about the same amount of time as using the “+” operator.

In this case, it took about 26.973 seconds to complete.

1	Took 26.973 seconds

Next, let’s look at a multithreaded version of the program.

Multithreaded Matrix Addition

We can update the element-wise matrix addition to be completed in parallel using threads.

We will use a ThreadPool and issue tasks to operate on sub-sections of the input and output matrix simultaneously, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the numpy.add() function on the sub-array defined by the provided coordinates and ignores the return value.

# add a section of two matrices into a third matrix

def task(coords, data1, data2, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = add(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

We can ignore the return value because there is a single copy of the result matrix in memory and all of the task functions are operating on this single array in parallel. This is because threads have a shared memory model, unlike processes that must transmit copies of data.

Next, we can create our input and output arrays as before.

...

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

result = empty((n, n))

We can then create a thread pool with one worker per physical CPU core.

I have 4 physical CPU cores in my system, so I will configure the ThreadPool to use 4 workers. Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.

...

# create the thread pool

with ThreadPool(4) as pool:

# ...

The square arrays are then divided into quadrants (4 parts) and each is issued to the task() function for parallel execution of the numpy.add() function.

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data1, data2, result))

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

...

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

And that’s it.

Tying this together, the complete example of multithreaded matrix addition is listed below.

# multithreaded matrix addition

from time import time

from multiprocessing.pool import ThreadPool

from numpy import ones

from numpy import add

from numpy import empty

# add a section of two matrices into a third matrix

def task(coords, data1, data2, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = add(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

# record start time

start = time()

# size of square arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

result = empty((n, n))

# create the thread pool

with ThreadPool(4) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data1, data2, result))

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example is completed in about 19.492 seconds.

On my system, this is about 7.481 seconds faster than the single-threaded version or a 1.38x speedup.

1	Took 19.492 seconds

Free Concurrent NumPy Course

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

Learn more

Multithreaded Element-Wise Matrix Subtraction

Element-wise matrix subtraction can be executed in parallel using threads.

We will first explore how to execute matrix subtraction using a single thread, then use multiple threads and compare the performance.

Single-Threaded Matrix Subtraction

We can create two matrices of numbers and then subtract one from the other.

We will use a modestly sized matrix of 50,000×50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The numpy.ones() function can be used to create two matrices of the given size, initialized with the value 1.0.

...

# size of arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

We can then subtract one array from the other to produce a third matrix with the “-” operator.

...

# matrix subtraction

result = data1 - data2

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

# single threaded matrix subtraction

from time import time

from numpy import ones

# record start time

start = time()

# size of arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

# matrix subtraction

result = data1 - data2

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example creates two matrices and then subtracts one from the other.

The program takes about 28.648 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

1	Took 28.648 seconds

Single-Threaded Matrix Subtraction with subtract()

We can also perform subtraction using the numpy.subtract() function.

This can be achieved by creating a result matrix via the numpy.empty() function and then calling numpy.subtract() with the input and output matrices.

For example:

...

result = empty((n, n))

# matrix subtraction

_ = subtract(data1, data2, out=result)

This is functionally the same as using the “-” operator but allows us to have the result placed into a predefined target matrix easily, a feature we will need when it comes to multithreading the example.

Tying this together, the complete example is listed below.

# single threaded matrix subtraction with subtract()

from time import time

from numpy import ones

from numpy import subtract

from numpy import empty

# record start time

start = time()

# size of arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

result = empty((n, n))

# matrix subtraction

_ = subtract(data1, data2, out=result)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the program takes about the same amount of time as using the “-” operator.

In this case, it took about 25.861 seconds to complete.

1	Took 25.861 seconds

Next, let’s look at a multithreaded version of the program.

Multithreaded Matrix Subtraction

We can update the element-wise matrix subtraction to be completed in parallel using threads.

We will use a ThreadPool and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the numpy.subtract() function on the sub-array defined by the provided coordinates and ignores the return value,

# subtract a section of one array from another and place it into a third array

def task(coords, data1, data2, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = subtract(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

Next, we can create our input and output arrays as before.

...

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

result = empty((n, n))

We can then create a thread pool with one worker per physical CPU core.

I have 4 physical CPU cores in my system, so I will configure the ThreadPool to use 4 workers. Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.

...

# create the thread pool

with ThreadPool(4) as pool:

# ...

The square arrays are then divided into quadrants (4 parts) and each is issued to the task() function for parallel execution of the numpy.subtract() function.

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data1, data2, result))

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

...

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

And that’s it.

Tying this together, the complete example of multithreaded matrix subtraction is listed below.

# multithreaded matrix subtraction

from time import time

from multiprocessing.pool import ThreadPool

from numpy import ones

from numpy import subtract

from numpy import empty

# subtract a section of one array from another and place into a third array

def task(coords, data1, data2, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = subtract(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

# record start time

start = time()

# size of square arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

result = empty((n, n))

# create the thread pool

with ThreadPool(4) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data1, data2, result))

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example is completed in about 19.334 seconds.

On my system, this is about 6.527 seconds faster than the single-threaded version or a 1.34x speedup.

1	Took 19.334 seconds

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Multithreaded Element-Wise Matrix Multiplication (Hadamard product)

Element-wise matrix multiplication, also called the Hadamard product, can be executed in parallel using threads.

Note, element-wise matrix multiplication is different from “matrix multiplication“, also called the “matrix product”. The matrix product is implemented in numpy via the “@” operator and the numpy.dot() function and is already multithreaded via the underlying BLAS library.

You can learn more about multithreaded matrix multiplication in the tutorial:

Numpy Multithreaded Matrix Multiplication (up to 5x faster)

We will first explore how to execute matrix multiplication using a single thread, then use multiple threads and compare the performance.

Single-Threaded Matrix Multiplication

We can create two matrices of numbers and then multiply them together, element-wise.

We will use a modestly sized matrix of 50,000×50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The numpy.ones() function can be used to create two matrices of the given size, initialized with the value 1.0.

...

# size of arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

We can then multiply the matrices to produce a third matrix with the “*” operator.

...

# matrix multiplication

result = data1 * data2

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

# single threaded element-wise matrix multiplication

from time import time

from numpy import ones

# record start time

start = time()

# size of arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

# matrix multiplication

result = data1 * data2

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example creates two matrices then multiplies them together.

The program takes about 28.607 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

1	Took 28.607 seconds

Single-Threaded Matrix Multiplication with multiply()

We can also perform multiplication using the numpy.multiply() function.

This can be achieved by creating a result matrix via the numpy.empty() function and then calling numpy.multiply() with the input and output matrices.

For example:

...

result = empty((n, n))

# matrix multiplication

_ = multiply(data1, data2, out=result)

This is functionally the same as using the “*” operator, but allows us to have the result placed into a predefined target matrix easily, a feature we will need when it comes to multithreading the example.

Tying this together, the complete example is listed below.

# single threaded element-wise matrix multiplication with multiply()

from time import time

from numpy import ones

from numpy import multiply

from numpy import empty

# record start time

start = time()

# size of arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

result = empty((n, n))

# matrix multiplication

_ = multiply(data1, data2, out=result)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the program takes about the same amount of time as using the “*” operator.

In this case, it took about 27.399 seconds to complete.

1	Took 27.399 seconds

Next, let’s look at a multithreaded version of the program.

Multithreaded Matrix Multiplication

We can update the element-wise matrix multiplication to be completed in parallel using threads.

We will use a ThreadPool and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the numpy.multiply() function on the sub-array defined by the provided coordinates and ignores the return value,

# multiply a section of one array from another and place it into a third array

def task(coords, data1, data2, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = multiply(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

Next, we can create our input and output arrays as before.

...

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

result = empty((n, n))

We can then create a thread pool with one worker per physical CPU core.

...

# create the thread pool

with ThreadPool(4) as pool:

# ...

The square arrays are then divided into quadrants (4 parts) and each is issued to the task() function for parallel execution of the numpy.multiply() function.

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data1, data2, result))

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

...

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

And that’s it.

Tying this together, the complete example of multithreaded matrix multiplication is listed below.

# multithreaded element-wise matrix multiplication

from time import time

from multiprocessing.pool import ThreadPool

from numpy import ones

from numpy import multiply

from numpy import empty

# multiply a section of one array from another and place into a third array

def task(coords, data1, data2, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = multiply(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

# record start time

start = time()

# size of square arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

result = empty((n, n))

# create the thread pool

with ThreadPool(4) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data1, data2, result))

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example is completed in about 20.455 seconds.

On my system, this is about 6.944 seconds faster than the single-threaded version or a 1.34x speedup.

1	Took 20.455 seconds

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Multithreaded Element-Wise Matrix Division

Element-wise matrix division can be executed in parallel using threads.

We will first explore how to execute matrix division using a single thread, then use multiple threads and compare the performance.

Single-Threaded Matrix Division

We can create two matrices of numbers and then divide one by the other.

We will use a modestly sized matrix of 50,000×50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The numpy.ones() function can be used to create two matrices of the given size, initialized with the value 1.0.

...

# size of arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

We can then divide one array by the other to produce a third matrix with the “/” operator.

...

# matrix subtraction

result = data1 / data2

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

# single threaded matrix division

from time import time

from numpy import ones

# record start time

start = time()

# size of arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

# matrix division

result = data1 / data2

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example creates two matrices and then divides one by the other.

The program takes about 27.986 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

1	Took 27.986 seconds

Single-Threaded Matrix Division with divide()

We can also perform division using the numpy.divide() function.

This can be achieved by creating a result matrix via the numpy.empty() function and then calling numpy.divide() with the input and output matrices.

For example:

...

result = empty((n, n))

# matrix division

_ = divide(data1, data2, out=result)

This is functionally the same as using the “/” operator, but allows us to have the result placed into a predefined target matrix easily, a feature we will need when it comes to multithreading the example.

Tying this together, the complete example is listed below.

# single threaded matrix division with divide()

from time import time

from numpy import ones

from numpy import divide

from numpy import empty

# record start time

start = time()

# size of arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

result = empty((n, n))

# matrix division

_ = divide(data1, data2, out=result)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the program takes about the same amount of time as using the “/” operator.

In this case, it took about 27.829 seconds to complete.

1	Took 27.829 seconds

Next, let’s look at a multithreaded version of the program.

Multithreaded Matrix Division

We can update the element-wise matrix division to be completed in parallel using threads.

We will use a ThreadPool and issue tasks to operate on sub-sections of the input and output matrix simultaneously, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the numpy.divide() function on the sub-array defined by the provided coordinates and ignores the return value,

# divide a section of one array by another and place it into a third array

def task(coords, data1, data2, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = divide(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

Next, we can create our input and output arrays as before.

...

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

result = empty((n, n))

We can then create a thread pool with one worker per physical CPU core.

...

# create the thread pool

with ThreadPool(4) as pool:

# ...

The square arrays are then divided into quadrants (4 parts) and each is issued to the task() function for parallel execution of the numpy.subtract() function.

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data1, data2, result))

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

...

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

And that’s it.

Tying this together, the complete example of multithreaded matrix division is listed below.

# multithreaded matrix division

from time import time

from multiprocessing.pool import ThreadPool

from numpy import ones

from numpy import divide

from numpy import empty

# divide a section of one array by another and place it into a third array

def task(coords, data1, data2, result):

# unpack array indexes

i1, i2, i3, i4 = coords

_ = divide(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

# record start time

start = time()

# size of square arrays

n = 50000

# create square matrices

data1 = ones((n, n))

data2 = ones((n, n))

result = empty((n, n))

# create the thread pool

with ThreadPool(4) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(task, args=(coords, data1, data2, result))

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example is completed in about 20.802 seconds.

On my system, this is about 7.027 seconds faster than the single-threaded version or a 1.34x speedup.

1	Took 20.802 seconds

Takeaways

You now know how to perform multithreaded element-wise matrix math functions with numpy.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by Ryan Lu on Unsplash

NumPy Multithreaded Element-Wise Matrix Arithmetic

Element-Wise Matrix Operations Are Single Threaded

Multithreaded Element-Wise Matrix Operations

Multithreaded Element-Wise Matrix Addition

Single-Threaded Matrix Addition

Single-Threaded Matrix Addition with add()

Multithreaded Matrix Addition

Multithreaded Element-Wise Matrix Subtraction

Single-Threaded Matrix Subtraction

Single-Threaded Matrix Subtraction with subtract()

Multithreaded Matrix Subtraction

Multithreaded Element-Wise Matrix Multiplication (Hadamard product)

Single-Threaded Matrix Multiplication

Single-Threaded Matrix Multiplication with multiply()

Multithreaded Matrix Multiplication

Multithreaded Element-Wise Matrix Division

Single-Threaded Matrix Division

Single-Threaded Matrix Division with divide()

Multithreaded Matrix Division

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent NumPy Fast
(without the frustration)

Additional menu

Element-Wise Matrix Operations Are Single Threaded

Multithreaded Element-Wise Matrix Operations

Multithreaded Element-Wise Matrix Addition

Single-Threaded Matrix Addition

Single-Threaded Matrix Addition with add()

Multithreaded Matrix Addition

Multithreaded Element-Wise Matrix Subtraction

Single-Threaded Matrix Subtraction

Single-Threaded Matrix Subtraction with subtract()

Multithreaded Matrix Subtraction

Multithreaded Element-Wise Matrix Multiplication (Hadamard product)

Single-Threaded Matrix Multiplication

Single-Threaded Matrix Multiplication with multiply()

Multithreaded Matrix Multiplication

Multithreaded Element-Wise Matrix Division

Single-Threaded Matrix Division

Single-Threaded Matrix Division with divide()

Multithreaded Matrix Division

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Leave a Reply Cancel reply

Footer

Learn Concurrent NumPy Fast (without the frustration)

Learn Concurrent NumPy Fast
(without the frustration)