NumPy Multithreaded Element-Wise Matrix Arithmetic

May 21, 2023 Concurrent NumPy

You can perform element-wise matrix math functions in parallel in numpy using Python threads.

This can offer a 1.3x speed improvement over the single-threaded version of operations like element-wise matrix addition, subtraction, division, and multiplication.

In this tutorial, you will discover how to perform multithreaded element-wise matrix math functions with numpy.

Let's get started.

Element-Wise Matrix Operations Are Single Threaded

Numpy is an array library for Python.

It makes use of BLAS and LAPACK to implement many linear algebra functions for vectors and matrices efficiently. Some of these operations, such as matrix multiplication are multithreaded.

Nevertheless, many element-wise matrix functions in numpy are not multithreaded.

Element-wise functions are those operations applied to each item or cell in a matrix to produce an output matrix of the same size.

elementwise: Obtained by operating on one element (of a matrix etc) at a time.
-- elementwise, Wiktionary.

For example, adding a square 2x2 matrix to another 2x2 matrix results in a third 2x2 matrix where each element in the new matrix is the element-wise addition of elements in the same position in the first and second matrices.

[1, 1] + [1, 1] = [2, 2]
[1, 1]   [1, 1]   [2, 2]

Examples of element-wise math functions include arithmetic functions such as:

Matrix addition
Matrix subtraction
Matrix multiplication (Hadamard product)
Matrix division

It also includes many other operations that may involve one or more input matrices, such as square root, power, and so on.

None of these operations are multithreaded by default.

How can we perform element-wise matrix math functions in parallel?

Multithreaded Element-Wise Matrix Operations

Element-wise matrix math functions can be made multithreaded.

Numpy will release the Global Interpreter Lock (GIL) when performing most math functions, including matrix math functions.

This means that we can make element-wise operations parallel using Python threads. This is great because if we had to resort to process-based concurrency, then most gains achieved from parallelism would be lost in inter-process communication for the matrices involved. This is not a problem with threads as they can share memory within the same process.

You can learn more about numpy and the GIL in the tutorial:

NumPy vs the Global Interpreter Lock (GIL)

The Python standard library provides the multiprocessing.pool.ThreadPool class that can be used to create and manage a pool of worker threads.

A ThreadPool instance can be created and configured with one worker per physical CPU in the system.

For example:

...
# create the thread pool
with ThreadPool(4) as pool:
	# ...

You can learn more about configuring the number of workers in the ThreadPool in the tutorial:

ThreadPool Configure The Number of Worker Threads

We can then define a function that implements the math function we wish to perform that takes matrices as arguments. The task function can then operate on a portion of the larger matrix and perform the slow element-wise operation, such as adding two parts of the arrays and putting the results in a third array.

For example, we can use a function like numpy.add() for matrix addition, numpy.multiply() for element-wise multiplication, and so on.

These element-wise functions take the first matrix, the second matrix, and a target matrix. We can specify the coordinates of the sub-matrix in each.

# add a section of two matrices into a third matrix
def task(coords, data1, data2, result):
    # unpack array indexes
    i1, i2, i3, i4 = coords
    _ = numpy.add(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

We can then divide our large matrices into sections and issue each as a task to the thread pool to have the element-wise operations performed in parallel.

This can be challenging to make generic as it depends on the number of threads and the dimensions of the matrix.

If we have 4 physical CPU cores, then ideally we want to split our matrix into 4 tasks to be completed in parallel. This is probably the most efficient use of resources.

For a square matrix, we can find the midpoint of each dimension, then enumerate the four quadrants using for-loops, using the indices into the larger matrices directly.

For example:

...
# split each dimension (divisor of matrix dimension)
split = round(n/2)
# issue tasks
for x in range(0, n, split):
    for y in range(0, n, split):
        # determine matrix coordinates
        coords = (x, x+split, y, y+split)

Tasks can then be issued via the async_apply() method on the ThreadPool that returns an AsyncResult object, which we do not need in this case.

For example:

...
# issue task
_ = pool.apply_async(task, args=(coords, data1, data2, result))

You can learn more about issuing tasks to the ThreadPool via the apply_async() method in the tutorial:

How to Use ThreadPool apply_async() in Python

Now that we know how to execute element-wise matrix mathematical functions in parallel with threads, let's look at some worked examples.

Multithreaded Element-Wise Matrix Addition

Element-wise matrix addition can be executed in parallel using threads.

We will first explore how to execute matrix addition using a single thread, then using multiple threads and compare the performance.

Single-Threaded Matrix Addition

We can create two matrices of numbers and then add them together.

We will use a modestly sized matrix of 50,000x50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The numpy.ones() function can be used to create two matrices of the given size, initialized with the value 1.0.

...
# size of arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))

We can then add the matrices together to produce a third matrix with the "+" operator.

...
# matrix addition
result = data1 + data2

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

# single threaded matrix addition
from time import time
from numpy import ones
# record start time
start = time()
# size of arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
# matrix addition
result = data1 + data2
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example creates two matrices and then adds them together.

The program takes about 27.385 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

The precision of the estimated run time can be improved by running the same program many times and averaging the result. This is left as an exercise as we are interested in directional improvements in performance more than specific benchmark speeds.

Took 27.385 seconds

Single-Threaded Matrix Addition with add()

We can also perform addition using the numpy.add() function.

This can be achieved by creating a result matrix via the numpy.empty() function and then calling numpy.add() with the input and output matrices.

For example:

...
result = empty((n, n))
# matrix addition
_ = add(data1, data2, out=result)

This is functionally the same as using the "+" operator, but allows us to have the result placed into a predefined target matrix easily, a feature we will need when it comes to multithreading the example.

Tying this together, the complete example is listed below.

# single threaded matrix addition with add()
from time import time
from numpy import ones
from numpy import add
from numpy import empty
# record start time
start = time()
# size of arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
result = empty((n, n))
# matrix addition
_ = add(data1, data2, out=result)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the program takes about the same amount of time as using the "+" operator.

In this case, it took about 26.973 seconds to complete.

Took 26.973 seconds

Next, let's look at a multithreaded version of the program.

Multithreaded Matrix Addition

We can update the element-wise matrix addition to be completed in parallel using threads.

We will use a ThreadPool and issue tasks to operate on sub-sections of the input and output matrix simultaneously, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the numpy.add() function on the sub-array defined by the provided coordinates and ignores the return value.

# add a section of two matrices into a third matrix
def task(coords, data1, data2, result):
    # unpack array indexes
    i1, i2, i3, i4 = coords
    _ = add(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

We can ignore the return value because there is a single copy of the result matrix in memory and all of the task functions are operating on this single array in parallel. This is because threads have a shared memory model, unlike processes that must transmit copies of data.

Next, we can create our input and output arrays as before.

...
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
result = empty((n, n))

We can then create a thread pool with one worker per physical CPU core.

I have 4 physical CPU cores in my system, so I will configure the ThreadPool to use 4 workers. Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.

...
# create the thread pool
with ThreadPool(4) as pool:
	# ...

The square arrays are then divided into quadrants (4 parts) and each is issued to the task() function for parallel execution of the numpy.add() function.

...
# split each dimension (divisor of matrix dimension)
split = round(n/2)
# issue tasks
for x in range(0, n, split):
    for y in range(0, n, split):
        # determine matrix coordinates
        coords = (x, x+split, y, y+split)
        # issue task
        _ = pool.apply_async(task, args=(coords, data1, data2, result))

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

...
# close the pool
pool.close()
# wait for tasks to complete
pool.join()

And that's it.

Tying this together, the complete example of multithreaded matrix addition is listed below.

# multithreaded matrix addition
from time import time
from multiprocessing.pool import ThreadPool
from numpy import ones
from numpy import add
from numpy import empty

# add a section of two matrices into a third matrix
def task(coords, data1, data2, result):
    # unpack array indexes
    i1, i2, i3, i4 = coords
    _ = add(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

# record start time
start = time()
# size of square arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
result = empty((n, n))
# create the thread pool
with ThreadPool(4) as pool:
    # split each dimension (divisor of matrix dimension)
    split = round(n/2)
    # issue tasks
    for x in range(0, n, split):
        for y in range(0, n, split):
            # determine matrix coordinates
            coords = (x, x+split, y, y+split)
            # issue task
            _ = pool.apply_async(task, args=(coords, data1, data2, result))
    # close the pool
    pool.close()
    # wait for tasks to complete
    pool.join()
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example is completed in about 19.492 seconds.

On my system, this is about 7.481 seconds faster than the single-threaded version or a 1.38x speedup.

Took 19.492 seconds

Multithreaded Element-Wise Matrix Subtraction

Element-wise matrix subtraction can be executed in parallel using threads.

We will first explore how to execute matrix subtraction using a single thread, then use multiple threads and compare the performance.

Single-Threaded Matrix Subtraction

We can create two matrices of numbers and then subtract one from the other.

We will use a modestly sized matrix of 50,000x50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The numpy.ones() function can be used to create two matrices of the given size, initialized with the value 1.0.

...
# size of arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))

We can then subtract one array from the other to produce a third matrix with the "-" operator.

...
# matrix subtraction
result = data1 - data2

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

# single threaded matrix subtraction
from time import time
from numpy import ones
# record start time
start = time()
# size of arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
# matrix subtraction
result = data1 - data2
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example creates two matrices and then subtracts one from the other.

The program takes about 28.648 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

Took 28.648 seconds

Single-Threaded Matrix Subtraction with subtract()

We can also perform subtraction using the numpy.subtract() function.

This can be achieved by creating a result matrix via the numpy.empty() function and then calling numpy.subtract() with the input and output matrices.

For example:

...
result = empty((n, n))
# matrix subtraction
_ = subtract(data1, data2, out=result)

This is functionally the same as using the "-" operator but allows us to have the result placed into a predefined target matrix easily, a feature we will need when it comes to multithreading the example.

Tying this together, the complete example is listed below.

# single threaded matrix subtraction with subtract()
from time import time
from numpy import ones
from numpy import subtract
from numpy import empty
# record start time
start = time()
# size of arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
result = empty((n, n))
# matrix subtraction
_ = subtract(data1, data2, out=result)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the program takes about the same amount of time as using the "-" operator.

In this case, it took about 25.861 seconds to complete.

Took 25.861 seconds

Next, let's look at a multithreaded version of the program.

Multithreaded Matrix Subtraction

We can update the element-wise matrix subtraction to be completed in parallel using threads.

We will use a ThreadPool and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the numpy.subtract() function on the sub-array defined by the provided coordinates and ignores the return value,

# subtract a section of one array from another and place it into a third array
def task(coords, data1, data2, result):
    # unpack array indexes
    i1, i2, i3, i4 = coords
    _ = subtract(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

Next, we can create our input and output arrays as before.

...
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
result = empty((n, n))

We can then create a thread pool with one worker per physical CPU core.

I have 4 physical CPU cores in my system, so I will configure the ThreadPool to use 4 workers. Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.

...
# create the thread pool
with ThreadPool(4) as pool:
	# ...

The square arrays are then divided into quadrants (4 parts) and each is issued to the task() function for parallel execution of the numpy.subtract() function.

...
# split each dimension (divisor of matrix dimension)
split = round(n/2)
# issue tasks
for x in range(0, n, split):
    for y in range(0, n, split):
        # determine matrix coordinates
        coords = (x, x+split, y, y+split)
        # issue task
        _ = pool.apply_async(task, args=(coords, data1, data2, result))

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

...
# close the pool
pool.close()
# wait for tasks to complete
pool.join()

And that's it.

Tying this together, the complete example of multithreaded matrix subtraction is listed below.

# multithreaded matrix subtraction
from time import time
from multiprocessing.pool import ThreadPool
from numpy import ones
from numpy import subtract
from numpy import empty

# subtract a section of one array from another and place into a third array
def task(coords, data1, data2, result):
    # unpack array indexes
    i1, i2, i3, i4 = coords
    _ = subtract(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

# record start time
start = time()
# size of square arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
result = empty((n, n))
# create the thread pool
with ThreadPool(4) as pool:
    # split each dimension (divisor of matrix dimension)
    split = round(n/2)
    # issue tasks
    for x in range(0, n, split):
        for y in range(0, n, split):
            # determine matrix coordinates
            coords = (x, x+split, y, y+split)
            # issue task
            _ = pool.apply_async(task, args=(coords, data1, data2, result))
    # close the pool
    pool.close()
    # wait for tasks to complete
    pool.join()
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example is completed in about 19.334 seconds.

On my system, this is about 6.527 seconds faster than the single-threaded version or a 1.34x speedup.

Took 19.334 seconds

Multithreaded Element-Wise Matrix Multiplication (Hadamard product)

Element-wise matrix multiplication, also called the Hadamard product, can be executed in parallel using threads.

Note, element-wise matrix multiplication is different from "matrix multiplication", also called the "matrix product". The matrix product is implemented in numpy via the "@" operator and the numpy.dot() function and is already multithreaded via the underlying BLAS library.

You can learn more about multithreaded matrix multiplication in the tutorial:

Numpy Multithreaded Matrix Multiplication (up to 5x faster)

We will first explore how to execute matrix multiplication using a single thread, then use multiple threads and compare the performance.

Single-Threaded Matrix Multiplication

We can create two matrices of numbers and then multiply them together, element-wise.

We will use a modestly sized matrix of 50,000x50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The numpy.ones() function can be used to create two matrices of the given size, initialized with the value 1.0.

...
# size of arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))

We can then multiply the matrices to produce a third matrix with the "*" operator.

...
# matrix multiplication
result = data1 * data2

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

# single threaded element-wise matrix multiplication
from time import time
from numpy import ones
# record start time
start = time()
# size of arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
# matrix multiplication
result = data1 * data2
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example creates two matrices then multiplies them together.

The program takes about 28.607 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

Took 28.607 seconds

Single-Threaded Matrix Multiplication with multiply()

We can also perform multiplication using the numpy.multiply() function.

This can be achieved by creating a result matrix via the numpy.empty() function and then calling numpy.multiply() with the input and output matrices.

For example:

...
result = empty((n, n))
# matrix multiplication
_ = multiply(data1, data2, out=result)

This is functionally the same as using the "*" operator, but allows us to have the result placed into a predefined target matrix easily, a feature we will need when it comes to multithreading the example.

Tying this together, the complete example is listed below.

# single threaded element-wise matrix multiplication with multiply()
from time import time
from numpy import ones
from numpy import multiply
from numpy import empty
# record start time
start = time()
# size of arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
result = empty((n, n))
# matrix multiplication
_ = multiply(data1, data2, out=result)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the program takes about the same amount of time as using the "*" operator.

In this case, it took about 27.399 seconds to complete.

Took 27.399 seconds

Next, let's look at a multithreaded version of the program.

Multithreaded Matrix Multiplication

We can update the element-wise matrix multiplication to be completed in parallel using threads.

We will use a ThreadPool and issue tasks to operate on sub-sections of the input and output matrix at the same time, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the numpy.multiply() function on the sub-array defined by the provided coordinates and ignores the return value,

# multiply a section of one array from another and place it into a third array
def task(coords, data1, data2, result):
    # unpack array indexes
    i1, i2, i3, i4 = coords
    _ = multiply(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

Next, we can create our input and output arrays as before.

...
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
result = empty((n, n))

We can then create a thread pool with one worker per physical CPU core.

...
# create the thread pool
with ThreadPool(4) as pool:
	# ...

The square arrays are then divided into quadrants (4 parts) and each is issued to the task() function for parallel execution of the numpy.multiply() function.

...
# split each dimension (divisor of matrix dimension)
split = round(n/2)
# issue tasks
for x in range(0, n, split):
    for y in range(0, n, split):
        # determine matrix coordinates
        coords = (x, x+split, y, y+split)
        # issue task
        _ = pool.apply_async(task, args=(coords, data1, data2, result))

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

...
# close the pool
pool.close()
# wait for tasks to complete
pool.join()

And that's it.

Tying this together, the complete example of multithreaded matrix multiplication is listed below.

# multithreaded element-wise matrix multiplication
from time import time
from multiprocessing.pool import ThreadPool
from numpy import ones
from numpy import multiply
from numpy import empty

# multiply a section of one array from another and place into a third array
def task(coords, data1, data2, result):
    # unpack array indexes
    i1, i2, i3, i4 = coords
    _ = multiply(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

# record start time
start = time()
# size of square arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
result = empty((n, n))
# create the thread pool
with ThreadPool(4) as pool:
    # split each dimension (divisor of matrix dimension)
    split = round(n/2)
    # issue tasks
    for x in range(0, n, split):
        for y in range(0, n, split):
            # determine matrix coordinates
            coords = (x, x+split, y, y+split)
            # issue task
            _ = pool.apply_async(task, args=(coords, data1, data2, result))
    # close the pool
    pool.close()
    # wait for tasks to complete
    pool.join()
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example is completed in about 20.455 seconds.

On my system, this is about 6.944 seconds faster than the single-threaded version or a 1.34x speedup.

Took 20.455 seconds

Multithreaded Element-Wise Matrix Division

Element-wise matrix division can be executed in parallel using threads.

We will first explore how to execute matrix division using a single thread, then use multiple threads and compare the performance.

Single-Threaded Matrix Division

We can create two matrices of numbers and then divide one by the other.

We will use a modestly sized matrix of 50,000x50,000 to ensure it takes a non-trivial amount of time to perform element-wise operations.

The numpy.ones() function can be used to create two matrices of the given size, initialized with the value 1.0.

...
# size of arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))

We can then divide one array by the other to produce a third matrix with the "/" operator.

...
# matrix subtraction
result = data1 / data2

The whole program can then be timed, to give a general idea of how long it takes to execute.

Tying this together, the complete example is listed below.

# single threaded matrix division
from time import time
from numpy import ones
# record start time
start = time()
# size of arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
# matrix division
result = data1 / data2
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example creates two matrices and then divides one by the other.

The program takes about 27.986 seconds to complete on my system.

It may run faster or slower on your system, depending on the speed of your hardware.

Took 27.986 seconds

Single-Threaded Matrix Division with divide()

We can also perform division using the numpy.divide() function.

This can be achieved by creating a result matrix via the numpy.empty() function and then calling numpy.divide() with the input and output matrices.

For example:

...
result = empty((n, n))
# matrix division
_ = divide(data1, data2, out=result)

This is functionally the same as using the "/" operator, but allows us to have the result placed into a predefined target matrix easily, a feature we will need when it comes to multithreading the example.

Tying this together, the complete example is listed below.

# single threaded matrix division with divide()
from time import time
from numpy import ones
from numpy import divide
from numpy import empty
# record start time
start = time()
# size of arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
result = empty((n, n))
# matrix division
_ = divide(data1, data2, out=result)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the program takes about the same amount of time as using the "/" operator.

In this case, it took about 27.829 seconds to complete.

Took 27.829 seconds

Next, let's look at a multithreaded version of the program.

Multithreaded Matrix Division

We can update the element-wise matrix division to be completed in parallel using threads.

We will use a ThreadPool and issue tasks to operate on sub-sections of the input and output matrix simultaneously, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices, the input array, and the target array. It then calls the numpy.divide() function on the sub-array defined by the provided coordinates and ignores the return value,

# divide a section of one array by another and place it into a third array
def task(coords, data1, data2, result):
    # unpack array indexes
    i1, i2, i3, i4 = coords
    _ = divide(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

Next, we can create our input and output arrays as before.

...
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
result = empty((n, n))

We can then create a thread pool with one worker per physical CPU core.

...
# create the thread pool
with ThreadPool(4) as pool:
	# ...

The square arrays are then divided into quadrants (4 parts) and each is issued to the task() function for parallel execution of the numpy.subtract() function.

...
# split each dimension (divisor of matrix dimension)
split = round(n/2)
# issue tasks
for x in range(0, n, split):
    for y in range(0, n, split):
        # determine matrix coordinates
        coords = (x, x+split, y, y+split)
        # issue task
        _ = pool.apply_async(task, args=(coords, data1, data2, result))

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

...
# close the pool
pool.close()
# wait for tasks to complete
pool.join()

And that's it.

Tying this together, the complete example of multithreaded matrix division is listed below.

# multithreaded matrix division
from time import time
from multiprocessing.pool import ThreadPool
from numpy import ones
from numpy import divide
from numpy import empty

# divide a section of one array by another and place it into a third array
def task(coords, data1, data2, result):
    # unpack array indexes
    i1, i2, i3, i4 = coords
    _ = divide(data1[i1:i2,i3:i4], data2[i1:i2,i3:i4], out=result[i1:i2,i3:i4])

# record start time
start = time()
# size of square arrays
n = 50000
# create square matrices
data1 = ones((n, n))
data2 = ones((n, n))
result = empty((n, n))
# create the thread pool
with ThreadPool(4) as pool:
    # split each dimension (divisor of matrix dimension)
    split = round(n/2)
    # issue tasks
    for x in range(0, n, split):
        for y in range(0, n, split):
            # determine matrix coordinates
            coords = (x, x+split, y, y+split)
            # issue task
            _ = pool.apply_async(task, args=(coords, data1, data2, result))
    # close the pool
    pool.close()
    # wait for tasks to complete
    pool.join()
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example is completed in about 20.802 seconds.

On my system, this is about 7.027 seconds faster than the single-threaded version or a 1.34x speedup.

Took 20.802 seconds

Takeaways

You now know how to perform multithreaded element-wise matrix math functions with numpy.

If you enjoyed this tutorial, you will love my book: Concurrent NumPy in Python. It covers everything you need to master the topic with hands-on examples and clear explanations.