You can **add parallelism to a numpy program** to improve its performance.

Nevertheless, there is a danger in naively adding parallelism to a program that uses numpy. This is for many reasons, such as some numpy math functions that are already multithreaded, not knowing how to configure multithreaded numpy functions, using the wrong type of parallelism, and **missing opportunities to improve the performance** of numpy math functions.

In this tutorial, you will discover the cost of naively adding parallelism to numpy programs in Python.

Let’s get started.

Table of Contents

## Cost of Implementing Parallelism Naively with Numpy

Numpy is a popular library for efficiently working with arrays of data.

In addition to working with arrays of data directly, it is a fundamental building block for many other scientific computing libraries.

As such, many modern Python programs make use of numpy directly or indirectly.

It is important that we make good or best use of our available CPU cores when using numpy so that our numpy programs run as fast as possible.

Parallelism allows us to make effective use of available hardware by executing two or more tasks simultaneously.

Numpy supports parallelisms automatically behind the scenes. This is achieved using efficient third-party libraries to implement common mathematical operations.

You can learn more about this in the tutorial:

The built-in parallelisms in numpy can conflict with parallelisms in a Python application that may use threads or processes. Nevertheless, a careful combination of built-in parallelism and the addition of threads or processes can often result in better overall performance.

As such, neglecting parallelism when using numpy can result in worse-than-expected performance. Similarly, naively implementing parallelism can result in dramatically worse performance.

There are five problems that you may suffer if you do not carefully consider parallelism when using numpy.

They are:

- Cripple math operations that are already multithreaded.
- Miss opportunities to tune inherently multithreaded math operations.
- Choose the wrong type of parallelism when developing your numpy programs.
- Miss opportunities to improve the performance of a math operation.
- Achieve much worse performance when sharing numpy array data within your application.

Let’s take a look at each concern in turn.

Run your loops using all CPUs, download my FREE book to learn how.

## Cost #1: Cripple Already Parallel Operations

Many math operations in numpy are already multithreaded.

Adding concurrency with Python threads or processes to math operations that are already multithreaded can result in significantly worse performance.

NumPy supports many linear algebra functions with vectors and arrays by calling into efficient third-party libraries such as BLAS and LAPACK.

BLAS and LAPACK are specifications for efficient vector, matrix, and linear algebra operations that are implemented using third-party libraries, such as MKL and OpenBLAS. Many of the operations in these libraries are multithreaded.

You can learn more about NumPy and BLAS in the tutorial:

Examples of multithreaded numpy functions include matrix multiplication via **numpy.dot()** and matrix decomposition like SVD via **numpy.linalg.svd()**. These functions call the BLAS library API, the implementation of which (such as OpenBLAS) is likely implemented in C or FORTRAN and uses threads.

As such, some NumPy functions support parallelism automatically.

You can learn more about which numpy functions are multithreaded in the tutorial:

We may naively use thread-based concurrency via the **threading** module or process-based concurrency via the **multiprocessing** module to speed up our numpy programs. If the operation we are attempting to parallelize is already multithreaded, such as dot(), then we will likely achieve worse performance by our efforts rather than better performance.

For example, consider the case where we have many pairs of matrices and we want to perform a math operation on each pair, such as multiplying one matrix by another.

1 2 3 4 5 |
... # iterate over pairs of matrices for matrix_a, matrix_b in matrices: # matrix multiplication result = matrix.dot(matrix_b) |

We might want to speed up this overall task by executing each matrix multiplication in a separate thread.

This means we will be performing task-level parallelism where multiple matrix multiplications will be occurring in parallel, and we will be performing operation-level parallelism, where each matrix multiplication is implemented using a multithreaded algorithm.

This would likely result in worse performance than allowing parallelism only in the task or only in the operation.

This is because the threads at each level will step on each other.

If we attempt to multithread multiple instances of a multithreaded operation, then the added threads will compete for resources and cause the overall task to take longer.

The operating system, which manages the underlying threads, will context switch between each thread, which adds overhead, which could otherwise be avoided.

## Cost #2: Miss Opportunity to Tune Performance

As we have seen, many numpy math operations are inherently parallel.

This is because they are implemented by a third-party library using a multithreaded algorithm.

The third-party library will use a default number of threads to perform the multithreaded math operation.

This is good in most cases, but not always.

In some cases, we can achieve better performance for a multithreaded math operation by configuring the number of threads used by the third-party library.

This may be for many reasons, such as:

- The default number of threads is greater than the number of physical CPU cores available in the system.
- Our program is already multi-threaded.
- The specific operation performs better when the number of threads is set to be one less than the number of physical CPU cores.

For example, we can see better performance in some cases when the number of threads used in multithreaded matrix multiplication is configured to use one minus the number of physical CPU cores.

You can see an example of this in the tutorial:

This requires firstly that the specific numpy mathematical operation we are using is multithreaded.

You can learn more about which numpy functions are multithreaded in the tutorial:

It then requires testing different numbers of threads used by the numpy multithreaded math operation.

You can learn more about how to configure the number of numpy BLAS/LAPACK threads in the tutorial:

Not knowing about the multithreaded math operations or how to tune may mean that you miss the opportunity for better performance in your application.

**Free Concurrent NumPy Course**

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

## Cost #3: Choose the Wrong Type of Parallelism

When using numpy in our program, we may decide to speed up tasks by executing them concurrently.

We may seek to do this using threads via the **threading** module, processes via the **multiprocessing** module, or coroutines via the **asyncio** module.

In fact, because we know the tasks we want to execute in parallel are CPU-intensive, we may choose process-based concurrency. This is because Python processes achieve full parallelism, whereas Python threads are limited by the global interpreter lock (GIL) and are better suited to IO-bound tasks.

You can learn more about this general recommendation in the guide:

However, this general recommendation is not appropriate when using numpy.

Numpy performs most math operations like matrix arithmetic and more advanced operations using its own C functions. These functions release the global interpreter lock (GIL) in Python, allowing other Python threads to run.

The implication is that math operations on arrays in numpy can be executed in parallel using Python threads.

You can learn more about numpy releasing the GIL in the tutorial:

The benefit of using threads for parallelisms is that threads have shared memory. We are able to share numpy arrays directly between threads easily and at no cost.

This is unlike process-based concurrency where sharing numpy arrays is computationally expensive or requires complex workarounds.

You can learn more about this in the tutorial:

As such, naively choosing processes and multiprocessing to execute numpy tasks will add unnecessary complexity and computational cost.

You can see an example of this in the tutorial:

Instead, with knowledge of how numpy and the GIL interact, we can choose threads upfront and save a lot of time.

A great example of the speed-up we can achieve with threads is in calculating element-wise math operations in parallel.

You can see an example of this in the tutorial:

**Overwheled by the python concurrency APIs?**

Find relief, download my FREE Python Concurrency Mind Maps

## Cost #4: Miss Opportunity to Improve Performance

As we have seen, many numpy math functions are implemented using multithreaded algorithms.

We may naively think that this is enough.

Nevertheless, in some cases, we might achieve better performance by disabling the multithreaded algorithm math operation and parallelizing an application at the task level.

For example, we may need to perform a matrix multiplication between many pairs of matrices.

Because the matrix multiplication math operation is implemented using a multithreaded algorithm, we don’t need to consider any further parallelism.

Instead, with testing, we may discover that executing multiple matrix multiplication tasks in parallel results in better overall performance.

We can see an example of this in the tutorial:

Naively assuming that multithreaded numpy math operations are sufficient may mean that you miss opportunities for further improving performance.

## Cost #5: Worse Performance When Sharing Data

As we have seen, if we didn’t know that numpy releases the global interpreter lock, we may choose to use process-based concurrency via the **multiprocessing** module to speed up numpy programs.

A problem with this choice is that sharing numpy data between processes is slow because of inter-process communication.

You can learn more about this in the tutorial:

For example, we might choose to share a numpy array between processes by passing it as an argument, via a **Queue** or **Pipe**.

In these cases, a copy of the numpy array will be created which will be pickled by the sending process and unpickled by the receiving process.

You can see an example of this in the tutorial:

Knowing that sharing numpy arrays via inter-process communication requires some prior knowledge of how processes work in Python.

Nevertheless, sometimes we may want or need to use multiprocessing in our numpy programs.

As such, we may require more efficient ways of sharing numpy arrays between processes. Methods that avoid copying and serializing numpy arrays.

Examples include:

- Inheriting numpy arrays.
- Sharing numpy arrays backed by a RawArray or SharedMemory buffer.
- Hosting an array in a server process and sharing proxy objects.

You can learn more about these alternate and more efficient ways of sharing numpy arrays between processes in the tutorial:

## Further Reading

This section provides additional resources that you may find helpful.

**Books**

- Concurrent NumPy in Python, Jason Brownlee, 2023 (
**my book**).

**Concurrent NumPy Guides**

- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes

**Concurrent NumPy Documentation**

- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API

**NumPy API**

**Python Concurrency APIs**

- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks

## Takeaways

You now know about the cost of naively adding parallelism to numpy programs.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

Photo by Belinda Fewings on Unsplash

## Do you have any questions?