You can **combine BLAS threads with threading** in NumPy programs.

Maximizing these types of parallelism can help you fully utilize your CPU cores for a given application and achieve a speed-up compared to not using any or only one level of parallelism. It can also be tricky as sometimes combining threading with NumPy BLAS threads can result in no benefit or even worse performance.

In this tutorial, you will discover how to combine and test threading and numpy parallelism in order to achieve the best performance.

Let’s get started.

Table of Contents

## Parallelism in Python and NumPy

Python supports concurrency using threads via the **threading** module.

Work can be partitioned into subtasks and each subtask can be executed in a separate Python thread in parallel. This only applies if the tasks release the global interpreter lock (GIL), otherwise, the threads will not execute in parallel.

Importantly, the GIL is released by numpy when calling into C functions, such as performing operations on numpy arrays.

Therefore, Python threads can be used to execute most numpy functions in parallel.

Parallelism is supported by Python processes, but the overhead of having to transmit numpy arrays between processes often results in worse performance than the single process version of the same operations.

You can learn more about parallelism in Python using the threading module in the guide:

NumPy supports parallelism for some operations that use third-party libraries such as BLAS and LAPACK.

Examples include matrix multiplication via **numpy.dot()** and matrix decomposition like SVD via **numpy.linalg.svd()**. These functions call the BLAS and LAPACK library APIs, the implementation of which (such as OpenBLAS) is likely implemented in C or FORTRAN and uses threads.

As such, some NumPy functions support parallelism automatically.

You can learn more about BLAS in numpy programs in the tutorial:

This raises a question: which type of parallelism should we use in a given program?

Should we rely on numpy parallelism or should we split our operation into subtasks and execute them with Python threads?

## Case Study Parallelism With Matrix Data

Consider a situation where we load data as numpy arrays and need to process each array in some way.

Assume that the data is numerical and that the operation is a NumPy function that calls down to BLAS and can leverage threads.

For example:

- Perhaps we need to load image data, video data, or numerical data as arrays.
- Perhaps each array needs a matrix operation applied, such as fit a model or compute a decomposition.

We can simplify this to a contrived situation that involves creating a square matrix of random numbers and multiplying the matrix with another matrix.

**How should we structure the program to best take advantage of parallelism on one machine?**

Put another way:

**How can we use all CPU cores in this use case?**

I would argue that we cannot know. We must discover what works well or best for a specific program. It depends on the number of matrices, the size of the matrices, and the complexity of the operation to be performed.

There are perhaps 4 main approaches to test, they are:

**No threading and no BLAS threads**. This is the default or naive approach that assumes we have a single CPU core. Whatever we do should be better than this case, but we need to know what the worst case is so we can beat it.**No threading and use BLAS threads**. This delegates all parallelism to BLAS, which really means that we only get parallelism for the operation on the loaded data. This is a naive implementation and will do better than the above baseline. We could attempt to tweak it by specifying the number of threads to create to match the number of physical rather than logical CPU cores. This can help sometimes.**Threading with no BLAS threads**. This turns off the parallelism in BLAS and focuses efforts on parallelism at a high level in the application, the task level.**Threading with BLAS threads**. This combines parallelism at the task level using threading and at the operation level using numpy with BLAS. The risk is that the threads may step on each other, fighting for execution time and threading the operating system with excessive context switching.

If you are faced with the situation described above (and are running on one multi-core machine), I’d recommend testing each of these general approaches to parallelism and discovering what works well or best on your system.

Next, let’s explore each of these cases with a simplified version of the case study.

## Example of No Threads

The first case is the baseline.

This case involves not using task-level parallelism via the threading module and not using threads in BLAS.

We can configure the number of threads used by the BLAS library implementation via an environment variable, such as the **OMP_NUM_THREADS** if you have OpenBLAS installed (the most common).

You can find what BLAS library is installed by calling **numpy.show_config()** then look up the API documentation for that library to check the environment variable to set.

An environment variable can be set prior to running the Python program, or from within Python via the **os.environ()** function prior to importing numpy.

You can learn more about configuring the number of threads used by BLAS in the tutorial:

The example below disables NumPy BLAS multithreading. It then creates two lists of 100 matrices. Each matrix is a square 2,000 x 2,000 matrix of random floating point values.

An operation is then applied to each pair of 100 matrices, in this case, a simple matrix multiplication.

This is a contrived test case, but we can see how creating random matrices could be replaced with loading datasets or images, and how the matrix multiplication operation could be applied with some relevant operation required on the loaded data supported by NumPy that calls down into BLAS.

The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# example of loading matrices and multiplying them together (no threads) from os import environ # turn off threads in numpy (with openblas) environ['OMP_NUM_THREADS'] = '1' from time import time from numpy.random import rand # function for creating a test matrix def load_data(n=2000): # square matrix of random floats return rand(n, n) # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # record the start time start = time() # load data data1 = [load_data() for _ in range(100)] data2 = [load_data() for _ in range(100)] # apply operation to items results = [operation(item1, item2) for item1,item2 in zip(data1,data2)] # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example on my system takes about 32.903 seconds.

1 |
Took 32.903 seconds |

We could improve the benchmarking by running the example many times and taking the average time.

This could be further automated using a module like timeit. I leave this as an exercise for the reader.

Here, the numbers are directional rather than definitive.

I don’t want to give hard numbers for “this” is better than “that” for all cases. I want to teach you how you can benchmark your own use case and discover what works well/best for you.

Next, let’s re-run the example with BLAS threads.

## Example of Default Number of NumPy (BLAS) Threads

We can update the example to use threads to perform the matrix multiplication via BLAS.

Most BLAS libraries use threads by default, therefore we can simply remove the explicit setting of the **OMP_NUM_THREADS** environment variable and use the default number of threads.

The complete example with this change is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# example of loading matrices and multiplying them together (default numpy threads) from numpy.random import rand from time import time # function for creating a test matrix def load_data(n=2000): # square matrix of random floats return rand(n, n) # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # record the start time start = time() # load data data1 = [load_data() for _ in range(100)] data2 = [load_data() for _ in range(100)] # apply operation to items results = [operation(item1, item2) for item1,item2 in zip(data1,data2)] # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example takes about 15.427 seconds on my system.

That is about 17.476 faster than the single-threaded version or a 2.13x speedup.

1 |
Took 15.427 seconds |

This is interesting. I have 4 physical CPU cores (8 logical cores), and might have thought that the example would be 4x faster, not 2x faster.

Generating random matrices is slow, perhaps that’s it.

Nevertheless, we beat the baseline, but can we do better?

### Example of Tuned Number of NumPy (BLAS) Threads

When installing the BLAS library it is probably compiled for the specific hardware and operating system.

As such it may configure a default number of threads based on the number of logical CPU cores in the system. Recall each physical CPU core may be considered 2 logical CPU cores if we have support for hyperthreading, which most CPUs do these days.

You can learn more about physical vs logical CPUs in the tutorial:

This means in my system the default number of BLAS threads is probably set to 8. Alternatively, the BLAS library may use an arbitrary number of default threads, such as 24.

We may see a benefit by explicitly setting the number of threads supported by BLAS to be equal to the number of physical CPU cores.

I will use 4 in this case but change this to match the number of physical cores in your system, or try different values and discover what works best.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# example of loading matrices and multiplying them together (optimal numpy threads) from os import environ # set threads equal to number of physical cores (with openblas) environ['OMP_NUM_THREADS'] = '4' from numpy.random import rand from time import time # function for creating a test matrix def load_data(n=2000): # square matrix of random floats return rand(n, n) # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # record the start time start = time() # load data data1 = [load_data() for _ in range(100)] data2 = [load_data() for _ in range(100)] # apply operation to items results = [operation(item1, item2) for item1,item2 in zip(data1,data2)] # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example takes about 15.739 seconds on my system.

This is no different from the default, so no win here (in this specific case on my specific hardware).

1 |
Took 15.739 seconds |

Did you see a better result?

**Overwheled by the python concurrency APIs?**

Find relief, download my FREE Python Concurrency Mind Maps

## Example of Threading Without BLAS Threads

We can implement-task level parallelism using threading.

Specifically, we can define a task, then issue it to a pool of workers to execute as soon as possible.

We can use the **multiprocessing.pool.ThreadPool** class to create a pool of worker threads, one worker per physical CPU core in the system. I will use 4 workers in this case, but you can tune it for the number of cores in your system.

You can learn more about the thread pool in the tutorial:

The task can be defined as generating two square matrices of random numbers and calculating a matrix multiplication.

Parallelism in the numpy matrix multiplication can be disabled by configuring BLAS to use a single thread.

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# example with threading and no numpy threads from os import environ # turn off threads in numpy (with openblas) environ['OMP_NUM_THREADS'] = '1' from numpy.random import rand from multiprocessing.pool import ThreadPool from time import time # function for creating a test matrix def load_data(n=2000): # square matrix of random floats return rand(n, n) # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # function that defines a single task def task(): # load items item1 = load_data() item2 = load_data() # perform operation result = operation(item1, item2) return result # record the start time start = time() # create thread pool with ThreadPool(4) as pool: # issue tasks and gather results results = [pool.apply_async(task) for _ in range(100)] # close the thread pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

Running the example takes about 9.640 seconds.

That is about 5.787 seconds faster than using BLAS multithreading or a speedup of about 1.60x. It is about 23.263 seconds faster than the single-threaded version or about a 3.41x speedup.

1 |
Took 9.640 seconds |

This is closer to the 4x speed-up over the single-threaded version I would expect with 4 physical CPU cores. For example, 33.1 * 0.25 = 8.275. The discrepancy is probably overhead in the **ThreadPool** and related.

Perhaps in this scenario, or just in this contrived version of the use case, task-level parallelism is a better approach than operation-level parallelism, at least with the configurations tested so far.

## Example of Threading And BLAS Threads

Finally, we can try to combine task and operation-level parallelism.

That is, parallelism with Python threads and parallelism of the matrix operation in NumPy via threads in the BLAS library implementation.

This is tricky as the threads, at the operating system level, will likely step on each other.

In this case, we will use the default number of BLAS threads and configure the ThreadPool to use 2 workers. After testing on my system, this seemed to give better performance than other configurations, e.g. 2 BLAS threads and 2 workers, and other configs.

Try a suite of configurations on your system based on the number of CPU cores and see what works.

Recall, setting BLAS to 1 thread and the pool to one worker is equivalent to the baseline example (no threads) performed above, although will be slightly slower given the overhead of managing tasks by a worker in the pool.

The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# example with thread pool and numpy threads from numpy.random import rand from multiprocessing.pool import ThreadPool from time import time # function for creating a test matrix def load_data(n=2000): # square matrix of random floats return rand(n, n) # function for performing operation on matrices def operation(item1, item2): # matrix multiplication return item1.dot(item2) # function that defines a single task def task(): # load items item1 = load_data() item2 = load_data() # perform operation result = operation(item1, item2) return result # record the start time start = time() # create thread pool with ThreadPool(2) as pool: # issue tasks and gather results results = [pool.apply_async(task) for _ in range(100)] # close the thread pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |

In this case, the example took about 11.199 seconds to complete on my system.

This is approximately equivalent to only using the ThreadPool and no BLAS threads.

1 |
Took 11.199 seconds |

## Comparison of Results

We have tested a number of instances of the case study with what we have termed operation-level parallelism and task-level parallelism.

The table below summarizes the cases tested and their wall clock time, as well as speed-up, compared to the single-threaded version.

1 2 3 4 5 6 7 |
Threading | Numpy/BLAS Threads | Time (sec) | Speed-up -------------------------------------------------------------- No | No | 32.903 | - No | Yes (default) | 15.427 | 2.13x No | Yes (tailored) | 15.739 | 2.09x Yes | No | 9.640 | 3.41x Yes | Yes (default) | 11.199 | 2.94x |

In this case (the specific number and size of data) on my system (hardware and software), it seems that task-level parallelism via threading was faster overall.

That is, turn off BLAS threads and use a **ThreadPool** to parallelize the tasks themselves rather than task functions.

So what should you do in your situation?

Test.

If you are using NumPy, take a close look at the functions to see if they are calling down into BLAS. If you’re not sure, check this tutorial:

If you’re still not sure, contrive a test with one BLAS thread and multiple BLAS threads and compare the performance.

Then, test your use case with different levels of parallelism, such as those listed above.

Use the code examples above as templates.

Don’t assume what approach will be fastest, devise experiments and discover what approach is fastest with your data on your hardware.

You can always go further too.

- Loading data could be parallel.
- Generating data can be made parallel.
- Using alternative numpy functions can use more or less of BLAS and LAPACK functions and in turn achieve more or less parallelism.
- Issuing calls down to BLAS in batch might be possible in some cases, e.g. see the
**numpy.linalg.multi_dot()**function. - Custom operations can be defined down in C/FORTRAN in third-party code to use native threads then called from Python.

## Takeaways

You now know how to combine and test threading and numpy parallelism in order to achieve the best performance.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

## Leave a Reply