Using Threads With Numpy Can Result in Worse Performance

June 22, 2023 Concurrent NumPy

You can parallelize numpy operations in Python using threads.

Many numpy operations are implemented using multithreaded algorithms, such as those that call the BLAS library. Additionally, most numpy functions release the GIL, allowing tasks that call numpy functions to be parallelized using the GIL.

Nevertheless, parallelizing numpy programs using threads does not always result in a speed-up. Worse still, attempting to parallelize numpy functions that implement a multithreaded algorithm can result in worse overall performance.

In this tutorial, you will discover that parallelizing tasks in numpy programs with threads does not always result in better performance.

Let's get started.

Can Using Threads With NumPy Be Slower?

NumPy supports many linear algebra functions with vectors and arrays by calling into efficient third-party libraries such as BLAS and LAPACK.

BLAS and LAPACK are specifications for efficient vector, matrix, and linear algebra operations that are implemented using third-party libraries, such as MKL and OpenBLAS. Many of the operations in these libraries are multithreaded.

You can learn more about NumPy and BLAS in the tutorial:

Additionally, Numpy performs most math operations like arithmetic and more advanced operations using its own C functions. These functions release the global interpreter lock (GIL) in Python, allowing other Python threads to run.

The implication is that math operations on arrays in numpy can be executed in parallel using Python threads.

You can learn more about numpy releasing the GIL in the tutorial:

As such, it is good practice to speed-up a numpy program by using threads. Nevertheless, this is not always the case.

It is possible to parallelize tasks in a numpy program and see no benefit. In some cases, it is even possible to see worse performance.

Using Threads Can Result in No Benefit or Worse Performance

Adding threads to a numpy program does not always improve performance.

Sometimes, parallelizing tasks in a numpy program can offer no benefit.

Additionally, in some cases, adding threads can result in significantly worse performance.

There are a number of reasons for this, such as:

  1. BLAS threads (math operation level) compete with Python threads (task level).
  2. BLAS math operations are not multithreaded for a specific use case.
  3. Multithreading a numpy function offers no benefit.

Let's take a closer look at each case in turn.

BLAS Threads Compete with Python Threads

The most common cause is if threads are used to parallelize a math operation that is already multithreaded.

For example, a math operation like a matrix multiplication with the dot() is already executed using a multithreaded algorithm via the third-party BLAS library. If the program needs to execute the multithreaded math operation many times and is parallelized as a task in a thread pool, then the threads at operation level (BLAS threads) and the threads at the task level (Python threads) will compete for resources.

Each thread will be executing a CPU-bound task and there may be more CPU-bound tasks executing concurrently than the system can support, e.g. more than the number of physical or logical CPU cores. This means the operating system will context switch between the threads in an effort to run all tasks in parallel.

This may add an unwanted overhead that may result in the overall completion of the program taking longer than if it was only parallelized at the math operation level (BLAS threads) or at the task level (Python threads).

It may be possible to strike a balance between BLAS and Python threads within the same program and achieve better overall performance.

You can see an exploration of this in the tutorial:

BLAS Operation Not Multithreaded In Some Use Cases

Many linear algebra math functions are implemented using a multithreaded algorithm via an installed BLAS library.

Nevertheless, not all uses of these math functions can be multithreaded.

For example, we may perform a vector multiplication using the dot() function.

This function will call down to BLAS and will execute a multithreaded algorithm.

Nevertheless, the nature of the vector-to-vector multiplication may not be able to be executed using multiple threads because of the sequential nature of the operation.

Alternatively, perhaps a different function is required to perform the desired math operation, such as calling the numpy.linalg.svd() function instead of calculating an SVD manually.

Multithreading a Numpy Function Offers no Benefit

Most numpy math functions release the GIL.

This means that we can parallelize tasks that execute numpy functions with threads.

Nevertheless, even though the design of the program appears appropriate, the implementation may result in no benefit.

An example is filling a numpy array via the fill() method.

This function releases the GIL. Therefore, if we had to fill many arrays in parallel, we might expect that we can execute these array fill tasks in parallel using threads and achieve a speed-up.

This may not be the case in general. Alternatively, it may not be the case with some specific ways that the task can be executed, such as the choice of method on a ThreadPool or the choice of configuration such as "chunksize" on a map() function.

Now that we are familiar with some ways in which we may not get a lift performance when adding threads to numpy programs, let's explore some worked examples

Example of Adding Threads Results in Worse Performance

We can explore an example of multithreading a task that calls a multithreaded numpy function.

This can result in worse performance in some cases, such as when there are more overall CPU-bound threads then there are physical or logical CPU cores in the system.

In this case, we will explore parallelizing multiple matrix multiplication tasks.

BLAS-level Multithreaded Matrix Multiplication

We can develop a baseline case before we attempt to parallelize it with Python threads.

In this example, we will define two lists of 100 matrices and then multiply each pair of matrices. The matrix multiplication operation executed via the dot() method on the numpy array is implemented using BLAS and uses a multithreaded algorithm.

The complete example is listed below.

# example of multithreaded matrix multiplication
from numpy import ones
from time import time

# function for performing operation on matrices
def operation(item1, item2):
    # matrix multiplication
    return item1.dot(item2)

# record the start time
start = time()
# prepare lists of matrices
n = 2000
data1 = [ones((n,n)) for _ in range(100)]
data2 = [ones((n,n)) for _ in range(100)]
# apply math operation to items
results = [operation(item1, item2) for item1,item2 in zip(data1,data2)]
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example first creates two lists of 100 matrices, each with the size 2,000 x 2,000 initialized with one values.

Each pair of matrices are then multiplied by calling a custom function that in turn calls the dot() method. We use a custom function to perform the math operation so it is easy to parallelize in the next section.

Finally, the overall time taken for the program to execute is reported.

In this case, it took about 10.705 seconds to execute on my system.

It may take more or fewer seconds to run, depending on the speed of your system.

This provides a baseline that we can compare to the parallelized version of the same program.

Took 10.705 seconds

Next, let's update the program to parallelize the tasks with threads.

ThreadPool and BLAS-level Multithreaded Matrix Multiplication

We can update the previous example to parallelize the tasks using Python threads.

In this case, we can use the multiprocessing.pool.ThreadPool class that provides a pool of reusable worker threads. We can then issue the multiplication of each pair of matrices to the pool for execution using the starmap() function.

If you are new to the ThreadPool class, you can get started with the guide:

The complete example with this change is listed below.

# example of multithreaded matrix multiplication with thread pool
from numpy import ones
from time import time
from multiprocessing.pool import ThreadPool

# function for performing operation on matrices
def operation(item1, item2):
    # matrix multiplication
    return item1.dot(item2)

# record the start time
start = time()
# prepare lists of matrices
n = 2000
data1 = [ones((n,n)) for _ in range(100)]
data2 = [ones((n,n)) for _ in range(100)]
# create thread pool
with ThreadPool() as pool:
    # prepare arguments
    args = [(item1,item2) for item1,item2 in zip(data1,data2)]
    # issue tasks to thread pool
    results = pool.starmap(operation, args)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example first creates the two lists of matrices, as before.

The ThreadPool is then created and the pairs of matrices are prepared as arguments for each task. The tasks are then issued to the ThreadPool via the starmap() function and the call blocks until all pairs of matrix multiplication are complete.

Finally, the overall time for the program to complete is reported.

In this case, the program took about 17.354 seconds to run on my system.

This is about 6.649 seconds longer than the single-threaded version.

Took 17.354 seconds

The reason that the program took longer to complete is that the worker threads in the ThreadPool attempting to execute matrix multiplication as fast as possible are competing with the threads in BLAS attempting to perform each individual matrix multiplication operation as fast as possible.

I have 4 physical CPU cores in my system, which is 8 logical CPU cores given hyperthreading. If BLAS is using 8 threads for each matrix multiplication by default and the ThreadPool has 8 worker threads by default, then this means there are potentially 16 CPU-bound tasks attempting to execute at the same time on 4 CPU cores. This would result in significant overhead as the operating system switches between all threads, attempting to complete them all as fast as possible.

A better solution would be to limit the overall number of threads running within the program, such as only using BLAS parallelism, only using the ThreadPool for parallelism, or finding a middle ground between both BLAS and Python threads that offers overall better performance.

Note, we can tune the above example to achieve better performance, perhaps even equivalent performance to the BLAS-only example by tuning the "chunksize" argument of the starmap() function to 25.

For example:

...
# issue tasks to thread pool
results = pool.starmap(operation, args, chunksize=25)

Still, this does not result in better overall performance, only near equivalent performance to not using the ThreadPool at all.

You can learn more about adding threads to multithreaded BLAS functions and tuning for performance in the tutorial:

Next, let's look at an example where BLAS threads may offer no benefit.

Example of BLAS Threads Offering No Benefit

We can explore the case of using a numpy function that calls a multithreaded BLAS function, that offers no benefit.

In this case, we will explore the multiplication of one vector by another, e.g. a vector dot product via the dot() function.

We can confirm that the dot product of two vectors is not multithreaded by first executing the program with no BLAS threads and then the same operation with BLAS threads enabled.

Single-Threaded Vector Multiplication

We can establish a baseline by calculating a vector dot product without BLAS threads.

This can be achieved by configuring the installed BLAS library to not use threads when performing multithreaded BLAS operations.

For example:

...
from os import environ
environ['OMP_NUM_THREADS'] = '1'

You can learn more about configuring the number of threads used by BLAS in the tutorial:

We can then create two vectors with 1 billion elements and multiply them together and report the overall time the program takes to execute.

The complete example is listed below.

# example of single threaded vector multiplication
from os import environ
environ['OMP_NUM_THREADS'] = '1'
from time import time
from numpy import ones
# record the start time
start = time()
# size of arrays
n = 1000000000
# create vectors
data1 = ones(n)
data2 = ones(n)
# vector multiplication
result = data1.dot(data2)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the program first creates the two 1 billion element vectors then multiples them together.

The overall time taken for the program to execute is then reported.

In this case, the program took about 5.750 seconds to execute on my system.

Took 5.750 seconds

Next, we can execute the same program with BLAS threads enabled.

Multithreaded Vector Multiplication

We can update the previous example to allow BLAS to use thread for matrix and vector multiplication.

This can be achieved by removing the explicit configuration of the number of BLAS threads and allowing BLAS to use the default number of threads.

The complete example with this change is listed below.

# example of multithreaded vector multiplication
from time import time
from numpy import ones
# record the start time
start = time()
# size of arrays
n = 1000000000
# create vectors
data1 = ones(n)
data2 = ones(n)
# vector multiplication
result = data1.dot(data2)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the program first creates the two 1 billion element vectors then multiples them together.

The overall time taken for the program to execute is then reported.

In this case, the program took about 5.462 seconds to execute on my system.

The difference is about 0.288 milliseconds, which is probably within the margin of error.

Took 5.462 seconds

It is probably fair to say that although the array multiplication function is implemented using a multithreaded algorithm, it does not offer a benefit in this case when multiplying vectors, probably because of the single dimension of the arrays.

This highlights that we cannot rely on a multithreaded BLAS function to always offer a speedup. Perhaps in some cases, we can explore our own multithreaded design instead.

Next, let's explore multithreading a numpy function that releases the GIL and does not receive a performance benefit

Example of Multithreaded Task Offering No Benefit

Most numpy functions release the GIL, allowing Python threads to be used and achieve full parallelism,

Parallelizing tasks that execute numpy functions with threads do not always offer a lift in performance.

In this case, we will demonstrate this by filling multiple arrays with a single value in parallel using threads with the fill() method.

The fill() method does release the GIL and can be parallelized in some circumstances, such as filling different parts of the same array in parallel.

You can see an example of this in the tutorial:

We may not get a benefit by filling multiple different arrays in parallel.

Single-Threaded NumPy Array Fill

Before we explore the parallel case, we can define a single-threaded program that fills many arrays.

The example below defines 100 matrices that are 5,000 x 5,000 elements in size. Each matrix is then filled with zero values sequentially and the overall time taken to complete the program is reported.

The complete example is listed below.

# example of single-threaded array filling
from numpy import ones
from time import time
# record the start time
start = time()
# prepare data
n = 5000
# prepare data
data = [ones((n,n)) for _ in range(100)]
# perform operation on each pair of arrays
[item.fill(0) for item in data]
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the program first creates a list of 100 arrays with the given size.

The list of arrays is then traversed and each is filled with zero values.

In this case, the program takes about 7.615 seconds to complete on my system.

This provides a baseline in performance to compare to the parallel case.

Took 7.615 seconds

Next, let's update the example to fill the arrays in parallel using threads.

Multithreaded NumPy Array Fill

We can update the previous example to fill the arrays in parallel using threads.

We will create a ThreadPool with the default number of worker threads, then execute each fill() task to the pool to be executed.

This requires defining a custom function that takes an array as an argument and calls the fill() method on it with the given value.

All tasks are issued to the pool at once via the map() function with the default "chunksize" argument.

The complete example with this change is listed below.

# example of multithreaded array filling
from numpy import ones
from time import time
from multiprocessing.pool import ThreadPool

# function for performing operation on matrices
def operation(item):
    # apply array operation
    item.fill(0)

# record the start time
start = time()
# prepare data
n = 5000
# prepare data
data = [ones((n,n)) for _ in range(100)]
# create thread pool
with ThreadPool() as pool:
    # issue tasks to thread pool
    _ = pool.map(operation, data)
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the program first creates a list of 100 arrays with the given size.

The ThreadPool is then created and all tasks are issued to the pool. This call blocks until all tasks are completed.

Finally, the overall time taken to execute the program is reported.

In this case, the program takes about 7.810 seconds to complete on my system.

This is about 0.195 seconds slower (195 milliseconds) than the single-threaded version and is probably within the margin of measurement error.

Took 7.810 seconds

This example highlights that even though we are executing a numpy function that releases the GIL in parallel with threads, we do not see the expected speed-up.

Note, after some experimentation, it seems that tuning the "chunksize" argument of the map() function does not appear to offer better performance on this problem.

Takeaways

You now know that parallelizing tasks in numpy programs with threads does not always result in better overall performance.



If you enjoyed this tutorial, you will love my book: Concurrent NumPy in Python. It covers everything you need to master the topic with hands-on examples and clear explanations.