Speed-Up NumPy With Threads in Python (up to 3.41x faster)

May 7, 2023 Concurrent NumPy

You can combine BLAS threads with threading in NumPy programs.

Maximizing these types of parallelism can help you fully utilize your CPU cores for a given application and achieve a speed-up compared to not using any or only one level of parallelism. It can also be tricky as sometimes combining threading with NumPy BLAS threads can result in no benefit or even worse performance.

In this tutorial, you will discover how to combine and test threading and numpy parallelism in order to achieve the best performance.

Let's get started.

Parallelism in Python and NumPy

Python supports concurrency using threads via the threading module.

Work can be partitioned into subtasks and each subtask can be executed in a separate Python thread in parallel. This only applies if the tasks release the global interpreter lock (GIL), otherwise, the threads will not execute in parallel.

Importantly, the GIL is released by numpy when calling into C functions, such as performing operations on numpy arrays.

Therefore, Python threads can be used to execute most numpy functions in parallel.

Parallelism is supported by Python processes, but the overhead of having to transmit numpy arrays between processes often results in worse performance than the single process version of the same operations.

You can learn more about parallelism in Python using the threading module in the guide:

Python Threading: The Complete Guide

NumPy supports parallelism for some operations that use third-party libraries such as BLAS and LAPACK.

Examples include matrix multiplication via numpy.dot() and matrix decomposition like SVD via numpy.linalg.svd(). These functions call the BLAS and LAPACK library APIs, the implementation of which (such as OpenBLAS) is likely implemented in C or FORTRAN and uses threads.

As such, some NumPy functions support parallelism automatically.

You can learn more about BLAS in numpy programs in the tutorial:

What is BLAS and LAPACK in NumPy

This raises a question: which type of parallelism should we use in a given program?

Should we rely on numpy parallelism or should we split our operation into subtasks and execute them with Python threads?

Case Study Parallelism With Matrix Data

Consider a situation where we load data as numpy arrays and need to process each array in some way.

Assume that the data is numerical and that the operation is a NumPy function that calls down to BLAS and can leverage threads.

For example:

Perhaps we need to load image data, video data, or numerical data as arrays.
Perhaps each array needs a matrix operation applied, such as fit a model or compute a decomposition.

We can simplify this to a contrived situation that involves creating a square matrix of random numbers and multiplying the matrix with another matrix.

How should we structure the program to best take advantage of parallelism on one machine?

Put another way:

How can we use all CPU cores in this use case?

I would argue that we cannot know. We must discover what works well or best for a specific program. It depends on the number of matrices, the size of the matrices, and the complexity of the operation to be performed.

There are perhaps 4 main approaches to test, they are:

No threading and no BLAS threads. This is the default or naive approach that assumes we have a single CPU core. Whatever we do should be better than this case, but we need to know what the worst case is so we can beat it.
No threading and use BLAS threads. This delegates all parallelism to BLAS, which really means that we only get parallelism for the operation on the loaded data. This is a naive implementation and will do better than the above baseline. We could attempt to tweak it by specifying the number of threads to create to match the number of physical rather than logical CPU cores. This can help sometimes.
Threading with no BLAS threads. This turns off the parallelism in BLAS and focuses efforts on parallelism at a high level in the application, the task level.
Threading with BLAS threads. This combines parallelism at the task level using threading and at the operation level using numpy with BLAS. The risk is that the threads may step on each other, fighting for execution time and threading the operating system with excessive context switching.

If you are faced with the situation described above (and are running on one multi-core machine), I'd recommend testing each of these general approaches to parallelism and discovering what works well or best on your system.

Next, let's explore each of these cases with a simplified version of the case study.

Example of No Threads

The first case is the baseline.

This case involves not using task-level parallelism via the threading module and not using threads in BLAS.

We can configure the number of threads used by the BLAS library implementation via an environment variable, such as the OMP_NUM_THREADS if you have OpenBLAS installed (the most common).

You can find what BLAS library is installed by calling numpy.show_config() then look up the API documentation for that library to check the environment variable to set.

An environment variable can be set prior to running the Python program, or from within Python via the os.environ() function prior to importing numpy.

You can learn more about configuring the number of threads used by BLAS in the tutorial:

Configure the Number of NumPy BLAS Threads

The example below disables NumPy BLAS multithreading. It then creates two lists of 100 matrices. Each matrix is a square 2,000 x 2,000 matrix of random floating point values.

An operation is then applied to each pair of 100 matrices, in this case, a simple matrix multiplication.

This is a contrived test case, but we can see how creating random matrices could be replaced with loading datasets or images, and how the matrix multiplication operation could be applied with some relevant operation required on the loaded data supported by NumPy that calls down into BLAS.

The complete example is listed below.

# example of loading matrices and multiplying them together (no threads)
from os import environ
# turn off threads in numpy (with openblas)
environ['OMP_NUM_THREADS'] = '1'
from time import time
from numpy.random import rand

# function for creating a test matrix
def load_data(n=2000):
    # square matrix of random floats
    return rand(n, n)

# function for performing operation on matrices
def operation(item1, item2):
    # matrix multiplication
    return item1.dot(item2)

# record the start time
start = time()
# load data
data1 = [load_data() for _ in range(100)]
data2 = [load_data() for _ in range(100)]
# apply operation to items
results = [operation(item1, item2) for item1,item2 in zip(data1,data2)]
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example on my system takes about 32.903 seconds.

Took 32.903 seconds

We could improve the benchmarking by running the example many times and taking the average time.

This could be further automated using a module like timeit. I leave this as an exercise for the reader.

Here, the numbers are directional rather than definitive.

I don't want to give hard numbers for "this" is better than "that" for all cases. I want to teach you how you can benchmark your own use case and discover what works well/best for you.

Next, let's re-run the example with BLAS threads.

Example of Default Number of NumPy (BLAS) Threads

We can update the example to use threads to perform the matrix multiplication via BLAS.

Most BLAS libraries use threads by default, therefore we can simply remove the explicit setting of the OMP_NUM_THREADS environment variable and use the default number of threads.

The complete example with this change is listed below.

# example of loading matrices and multiplying them together (default numpy threads)
from numpy.random import rand
from time import time

# function for creating a test matrix
def load_data(n=2000):
    # square matrix of random floats
    return rand(n, n)

# function for performing operation on matrices
def operation(item1, item2):
    # matrix multiplication
    return item1.dot(item2)

# record the start time
start = time()
# load data
data1 = [load_data() for _ in range(100)]
data2 = [load_data() for _ in range(100)]
# apply operation to items
results = [operation(item1, item2) for item1,item2 in zip(data1,data2)]
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example takes about 15.427 seconds on my system.

That is about 17.476 faster than the single-threaded version or a 2.13x speedup.

Took 15.427 seconds

This is interesting. I have 4 physical CPU cores (8 logical cores), and might have thought that the example would be 4x faster, not 2x faster.

Generating random matrices is slow, perhaps that's it.

Nevertheless, we beat the baseline, but can we do better?

Example of Tuned Number of NumPy (BLAS) Threads

When installing the BLAS library it is probably compiled for the specific hardware and operating system.

As such it may configure a default number of threads based on the number of logical CPU cores in the system. Recall each physical CPU core may be considered 2 logical CPU cores if we have support for hyperthreading, which most CPUs do these days.

You can learn more about physical vs logical CPUs in the tutorial:

Number of CPUs in Python

This means in my system the default number of BLAS threads is probably set to 8. Alternatively, the BLAS library may use an arbitrary number of default threads, such as 24.

We may see a benefit by explicitly setting the number of threads supported by BLAS to be equal to the number of physical CPU cores.

I will use 4 in this case but change this to match the number of physical cores in your system, or try different values and discover what works best.

# example of loading matrices and multiplying them together (optimal numpy threads)
from os import environ
# set threads equal to number of physical cores (with openblas)
environ['OMP_NUM_THREADS'] = '4'
from numpy.random import rand
from time import time

# function for creating a test matrix
def load_data(n=2000):
    # square matrix of random floats
    return rand(n, n)

# function for performing operation on matrices
def operation(item1, item2):
    # matrix multiplication
    return item1.dot(item2)

# record the start time
start = time()
# load data
data1 = [load_data() for _ in range(100)]
data2 = [load_data() for _ in range(100)]
# apply operation to items
results = [operation(item1, item2) for item1,item2 in zip(data1,data2)]
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example takes about 15.739 seconds on my system.

This is no different from the default, so no win here (in this specific case on my specific hardware).

Took 15.739 seconds

Did you see a better result?

Example of Threading Without BLAS Threads

We can implement-task level parallelism using threading.

Specifically, we can define a task, then issue it to a pool of workers to execute as soon as possible.

We can use the multiprocessing.pool.ThreadPool class to create a pool of worker threads, one worker per physical CPU core in the system. I will use 4 workers in this case, but you can tune it for the number of cores in your system.

You can learn more about the thread pool in the tutorial:

Python ThreadPool: The Complete Guide

The task can be defined as generating two square matrices of random numbers and calculating a matrix multiplication.

Parallelism in the numpy matrix multiplication can be disabled by configuring BLAS to use a single thread.

Tying this together, the complete example is listed below.

# example with threading and no numpy threads
from os import environ
# turn off threads in numpy (with openblas)
environ['OMP_NUM_THREADS'] = '1'
from numpy.random import rand
from multiprocessing.pool import ThreadPool
from time import time

# function for creating a test matrix
def load_data(n=2000):
    # square matrix of random floats
    return rand(n, n)

# function for performing operation on matrices
def operation(item1, item2):
    # matrix multiplication
    return item1.dot(item2)

# function that defines a single task
def task():
    # load items
    item1 = load_data()
    item2 = load_data()
    # perform operation
    result = operation(item1, item2)
    return result

# record the start time
start = time()
# create thread pool
with ThreadPool(4) as pool:
    # issue tasks and gather results
    results = [pool.apply_async(task) for _ in range(100)]
    # close the thread pool
    pool.close()
    # wait for tasks to complete
    pool.join()
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

Running the example takes about 9.640 seconds.

That is about 5.787 seconds faster than using BLAS multithreading or a speedup of about 1.60x. It is about 23.263 seconds faster than the single-threaded version or about a 3.41x speedup.

Took 9.640 seconds

This is closer to the 4x speed-up over the single-threaded version I would expect with 4 physical CPU cores. For example, 33.1 * 0.25 = 8.275. The discrepancy is probably overhead in the ThreadPool and related.

Perhaps in this scenario, or just in this contrived version of the use case, task-level parallelism is a better approach than operation-level parallelism, at least with the configurations tested so far.

Example of Threading And BLAS Threads

Finally, we can try to combine task and operation-level parallelism.

That is, parallelism with Python threads and parallelism of the matrix operation in NumPy via threads in the BLAS library implementation.

This is tricky as the threads, at the operating system level, will likely step on each other.

In this case, we will use the default number of BLAS threads and configure the ThreadPool to use 2 workers. After testing on my system, this seemed to give better performance than other configurations, e.g. 2 BLAS threads and 2 workers, and other configs.

Try a suite of configurations on your system based on the number of CPU cores and see what works.

Recall, setting BLAS to 1 thread and the pool to one worker is equivalent to the baseline example (no threads) performed above, although will be slightly slower given the overhead of managing tasks by a worker in the pool.

The complete example is listed below.

# example with thread pool and numpy threads
from numpy.random import rand
from multiprocessing.pool import ThreadPool
from time import time

# function for creating a test matrix
def load_data(n=2000):
    # square matrix of random floats
    return rand(n, n)

# function for performing operation on matrices
def operation(item1, item2):
    # matrix multiplication
    return item1.dot(item2)

# function that defines a single task
def task():
    # load items
    item1 = load_data()
    item2 = load_data()
    # perform operation
    result = operation(item1, item2)
    return result

# record the start time
start = time()
# create thread pool
with ThreadPool(2) as pool:
    # issue tasks and gather results
    results = [pool.apply_async(task) for _ in range(100)]
    # close the thread pool
    pool.close()
    # wait for tasks to complete
    pool.join()
# calculate and report duration
duration = time() - start
print(f'Took {duration:.3f} seconds')

In this case, the example took about 11.199 seconds to complete on my system.

This is approximately equivalent to only using the ThreadPool and no BLAS threads.

Took 11.199 seconds

Comparison of Results

We have tested a number of instances of the case study with what we have termed operation-level parallelism and task-level parallelism.

The table below summarizes the cases tested and their wall clock time, as well as speed-up, compared to the single-threaded version.

Threading | Numpy/BLAS Threads | Time (sec) | Speed-up
--------------------------------------------------------------
No        | No                 | 32.903     | -
No        | Yes (default)      | 15.427     | 2.13x
No        | Yes (tailored)     | 15.739     | 2.09x
Yes       | No                 | 9.640      | 3.41x
Yes       | Yes (default)      | 11.199     | 2.94x

In this case (the specific number and size of data) on my system (hardware and software), it seems that task-level parallelism via threading was faster overall.

That is, turn off BLAS threads and use a ThreadPool to parallelize the tasks themselves rather than task functions.

So what should you do in your situation?

Test.

If you are using NumPy, take a close look at the functions to see if they are calling down into BLAS. If you're not sure, check this tutorial:

Which NumPy Functions Are Multithreaded

If you're still not sure, contrive a test with one BLAS thread and multiple BLAS threads and compare the performance.

Then, test your use case with different levels of parallelism, such as those listed above.

Use the code examples above as templates.

Don't assume what approach will be fastest, devise experiments and discover what approach is fastest with your data on your hardware.

You can always go further too.

Loading data could be parallel.
Generating data can be made parallel.
Using alternative numpy functions can use more or less of BLAS and LAPACK functions and in turn achieve more or less parallelism.
Issuing calls down to BLAS in batch might be possible in some cases, e.g. see the numpy.linalg.multi_dot() function.
Custom operations can be defined down in C/FORTRAN in third-party code to use native threads then called from Python.

Takeaways

You now know how to combine and test threading and numpy parallelism in order to achieve the best performance.

If you enjoyed this tutorial, you will love my book: Concurrent NumPy in Python. It covers everything you need to master the topic with hands-on examples and clear explanations.