Performance Cost of Naive Parallelism in NumPy

Last Updated on September 29, 2023

You can add parallelism to a numpy program to improve its performance.

Nevertheless, there is a danger in naively adding parallelism to a program that uses numpy. This is for many reasons, such as some numpy math functions that are already multithreaded, not knowing how to configure multithreaded numpy functions, using the wrong type of parallelism, and missing opportunities to improve the performance of numpy math functions.

In this tutorial, you will discover the cost of naively adding parallelism to numpy programs in Python.

Let’s get started.

Table of Contents

Cost of Implementing Parallelism Naively with Numpy

Numpy is a popular library for efficiently working with arrays of data.

In addition to working with arrays of data directly, it is a fundamental building block for many other scientific computing libraries.

As such, many modern Python programs make use of numpy directly or indirectly.

It is important that we make good or best use of our available CPU cores when using numpy so that our numpy programs run as fast as possible.

Parallelism allows us to make effective use of available hardware by executing two or more tasks simultaneously.

Numpy supports parallelisms automatically behind the scenes. This is achieved using efficient third-party libraries to implement common mathematical operations.

You can learn more about this in the tutorial:

NumPy Supports Multithreaded Parallelism

The built-in parallelisms in numpy can conflict with parallelisms in a Python application that may use threads or processes. Nevertheless, a careful combination of built-in parallelism and the addition of threads or processes can often result in better overall performance.

As such, neglecting parallelism when using numpy can result in worse-than-expected performance. Similarly, naively implementing parallelism can result in dramatically worse performance.

There are five problems that you may suffer if you do not carefully consider parallelism when using numpy.

They are:

Cripple math operations that are already multithreaded.
Miss opportunities to tune inherently multithreaded math operations.
Choose the wrong type of parallelism when developing your numpy programs.
Miss opportunities to improve the performance of a math operation.
Achieve much worse performance when sharing numpy array data within your application.

Let’s take a look at each concern in turn.

Run loops using all CPUs, download your FREE book to learn how.

Cost #1: Cripple Already Parallel Operations

Many math operations in numpy are already multithreaded.

Adding concurrency with Python threads or processes to math operations that are already multithreaded can result in significantly worse performance.

NumPy supports many linear algebra functions with vectors and arrays by calling into efficient third-party libraries such as BLAS and LAPACK.

BLAS and LAPACK are specifications for efficient vector, matrix, and linear algebra operations that are implemented using third-party libraries, such as MKL and OpenBLAS. Many of the operations in these libraries are multithreaded.

You can learn more about NumPy and BLAS in the tutorial:

What is BLAS and LAPACK in NumPy

Examples of multithreaded numpy functions include matrix multiplication via numpy.dot() and matrix decomposition like SVD via numpy.linalg.svd(). These functions call the BLAS library API, the implementation of which (such as OpenBLAS) is likely implemented in C or FORTRAN and uses threads.

As such, some NumPy functions support parallelism automatically.

You can learn more about which numpy functions are multithreaded in the tutorial:

Which NumPy Functions Are Multithreaded

We may naively use thread-based concurrency via the threading module or process-based concurrency via the multiprocessing module to speed up our numpy programs. If the operation we are attempting to parallelize is already multithreaded, such as dot(), then we will likely achieve worse performance by our efforts rather than better performance.

For example, consider the case where we have many pairs of matrices and we want to perform a math operation on each pair, such as multiplying one matrix by another.

...

# iterate over pairs of matrices

for matrix_a, matrix_b in matrices:

# matrix multiplication

result = matrix.dot(matrix_b)

We might want to speed up this overall task by executing each matrix multiplication in a separate thread.

This means we will be performing task-level parallelism where multiple matrix multiplications will be occurring in parallel, and we will be performing operation-level parallelism, where each matrix multiplication is implemented using a multithreaded algorithm.

This would likely result in worse performance than allowing parallelism only in the task or only in the operation.

This is because the threads at each level will step on each other.

If we attempt to multithread multiple instances of a multithreaded operation, then the added threads will compete for resources and cause the overall task to take longer.

The operating system, which manages the underlying threads, will context switch between each thread, which adds overhead, which could otherwise be avoided.

Start Now: Free Concurrent NumPy Crash Course

Cost #2: Miss Opportunity to Tune Performance

As we have seen, many numpy math operations are inherently parallel.

This is because they are implemented by a third-party library using a multithreaded algorithm.

The third-party library will use a default number of threads to perform the multithreaded math operation.

This is good in most cases, but not always.

In some cases, we can achieve better performance for a multithreaded math operation by configuring the number of threads used by the third-party library.

This may be for many reasons, such as:

The default number of threads is greater than the number of physical CPU cores available in the system.
Our program is already multi-threaded.
The specific operation performs better when the number of threads is set to be one less than the number of physical CPU cores.

For example, we can see better performance in some cases when the number of threads used in multithreaded matrix multiplication is configured to use one minus the number of physical CPU cores.

You can see an example of this in the tutorial:

Numpy Multithreaded Matrix Multiplication (up to 5x faster)

This requires firstly that the specific numpy mathematical operation we are using is multithreaded.

You can learn more about which numpy functions are multithreaded in the tutorial:

Which NumPy Functions Are Multithreaded

It then requires testing different numbers of threads used by the numpy multithreaded math operation.

You can learn more about how to configure the number of numpy BLAS/LAPACK threads in the tutorial:

Configure the Number of BLAS/LAPACK Threads for NumPy

Not knowing about the multithreaded math operations or how to tune may mean that you miss the opportunity for better performance in your application.

Free Concurrent NumPy Course

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

Learn more

Cost #3: Choose the Wrong Type of Parallelism

When using numpy in our program, we may decide to speed up tasks by executing them concurrently.

We may seek to do this using threads via the threading module, processes via the multiprocessing module, or coroutines via the asyncio module.

In fact, because we know the tasks we want to execute in parallel are CPU-intensive, we may choose process-based concurrency. This is because Python processes achieve full parallelism, whereas Python threads are limited by the global interpreter lock (GIL) and are better suited to IO-bound tasks.

You can learn more about this general recommendation in the guide:

How to Choose the Right Python Concurrency API

However, this general recommendation is not appropriate when using numpy.

Numpy performs most math operations like matrix arithmetic and more advanced operations using its own C functions. These functions release the global interpreter lock (GIL) in Python, allowing other Python threads to run.

The implication is that math operations on arrays in numpy can be executed in parallel using Python threads.

You can learn more about numpy releasing the GIL in the tutorial:

NumPy vs the Global Interpreter Lock (GIL)

The benefit of using threads for parallelisms is that threads have shared memory. We are able to share numpy arrays directly between threads easily and at no cost.

This is unlike process-based concurrency where sharing numpy arrays is computationally expensive or requires complex workarounds.

You can learn more about this in the tutorial:

Threads are 4x Faster at Sharing Data Than Processes in Python

As such, naively choosing processes and multiprocessing to execute numpy tasks will add unnecessary complexity and computational cost.

You can see an example of this in the tutorial:

Combine NumPy BLAS Threads and Multiprocessing

Instead, with knowledge of how numpy and the GIL interact, we can choose threads upfront and save a lot of time.

A great example of the speed-up we can achieve with threads is in calculating element-wise math operations in parallel.

You can see an example of this in the tutorial:

NumPy Multithreaded Element-Wise Matrix Arithmetic

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Cost #4: Miss Opportunity to Improve Performance

As we have seen, many numpy math functions are implemented using multithreaded algorithms.

We may naively think that this is enough.

Nevertheless, in some cases, we might achieve better performance by disabling the multithreaded algorithm math operation and parallelizing an application at the task level.

For example, we may need to perform a matrix multiplication between many pairs of matrices.

Because the matrix multiplication math operation is implemented using a multithreaded algorithm, we don’t need to consider any further parallelism.

Instead, with testing, we may discover that executing multiple matrix multiplication tasks in parallel results in better overall performance.

We can see an example of this in the tutorial:

Speed-Up NumPy With Threads in Python (up to 3.41x faster)

Naively assuming that multithreaded numpy math operations are sufficient may mean that you miss opportunities for further improving performance.

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Cost #5: Worse Performance When Sharing Data

As we have seen, if we didn’t know that numpy releases the global interpreter lock, we may choose to use process-based concurrency via the multiprocessing module to speed up numpy programs.

A problem with this choice is that sharing numpy data between processes is slow because of inter-process communication.

You can learn more about this in the tutorial:

Threads are 4x Faster at Sharing Data Than Processes in Python

For example, we might choose to share a numpy array between processes by passing it as an argument, via a Queue or Pipe.

In these cases, a copy of the numpy array will be created which will be pickled by the sending process and unpickled by the receiving process.

You can see an example of this in the tutorial:

Share Numpy Array Between Processes Using Function Arguments and Return Values

Knowing that sharing numpy arrays via inter-process communication requires some prior knowledge of how processes work in Python.

Nevertheless, sometimes we may want or need to use multiprocessing in our numpy programs.

As such, we may require more efficient ways of sharing numpy arrays between processes. Methods that avoid copying and serializing numpy arrays.

Examples include:

Inheriting numpy arrays.
Sharing numpy arrays backed by a RawArray or SharedMemory buffer.
Hosting an array in a server process and sharing proxy objects.

You can learn more about these alternate and more efficient ways of sharing numpy arrays between processes in the tutorial:

7 Ways to Share a Numpy Array Between Processes

Takeaways

You now know about the cost of naively adding parallelism to numpy programs.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by Belinda Fewings on Unsplash

Performance Cost of Naive Parallelism in NumPy

Cost of Implementing Parallelism Naively with Numpy

Cost #1: Cripple Already Parallel Operations

Cost #2: Miss Opportunity to Tune Performance

Cost #3: Choose the Wrong Type of Parallelism

Cost #4: Miss Opportunity to Improve Performance

Cost #5: Worse Performance When Sharing Data

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent NumPy Fast
(without the frustration)

Additional menu

Cost of Implementing Parallelism Naively with Numpy

Cost #1: Cripple Already Parallel Operations

Cost #2: Miss Opportunity to Tune Performance

Cost #3: Choose the Wrong Type of Parallelism

Cost #4: Miss Opportunity to Improve Performance

Cost #5: Worse Performance When Sharing Data

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Do you have any questions?Cancel reply

Footer

Learn Concurrent NumPy Fast (without the frustration)

Learn Concurrent NumPy Fast
(without the frustration)