Parallelism is an important consideration when using numpy.

Numpy is perhaps the most common Python library for working with arrays of numbers. It is popular when used directly, such as when performing mathematical and linear algebra operations, as well as a popular basis for many other scientific Python libraries.

Parallelism allows Python programs to make the best use of available hardware by performing tasks simultaneously. Numpy supports parallelism by default behind the scenes for many linear algebra operations. Additionally, the performance of many numpy operations can be dramatically improved by the careful addition of parallelism.

Unfortunately, the use of the wrong type of parallelism or the naive sharing of numpy arrays between Python processes can result in dramatically worse performance.

As such, parallelism must be considered carefully before being added to a numpy program.

In this tutorial, you will discover the importance of parallelism when using numpy.

Let’s get started.

Table of Contents

## 5 Reasons We Need to Worry About Parallelism for NumPy

Numpy is a popular library for efficiently working with arrays of data.

It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

— NumPy documentation

In addition to working with arrays of data directly, it is a fundamental building block for many other scientific computing libraries.

This includes libraries such as:

- SciPy for scientific computing.
- Pandas for working with data.
- Matplotlib for plotting data.
- Scikit-learn for modeling.
- And many more.

As such, many modern Python programs make use of numpy directly or indirectly.

It is important that we make good or best use of our available CPU cores when using numpy so that our numpy programs run as fast as possible.

Parallelism allows us to make effective use of available hardware by executing two or more tasks simultaneously.

Numpy supports parallelism automatically behind the scenes. This is achieved using efficient third-party libraries to implement common mathematical operations.

You can learn more about this in the tutorial:

The built-in parallelism in numpy can conflict with parallelism in a Python application that may use threads or processes. Nevertheless, a careful combination of built-in parallelism and the addition of threads or processes can often result in better overall performance.

As such, neglecting parallelism when using numpy can result in worse-than-expected performance. Similarly, naively implementing parallelism can result in dramatically worse performance.

There are 5 reasons why we need to carefully consider parallelism when using numpy in our Python programs.

They are:

- Load/Save NumPy Arrays Data Concurrently
- Calculate Math Functions Efficiently in Parallel
- Initialize Arrays Efficiently in Parallel
- Perform Element-Wise Math on Arrays in Parallel
- Share NumPy Array Data Efficiently Between Processes

Let’s take a look at each concern in more detail.

Run your loops using all CPUs, download my FREE book to learn how.

## Load/Save NumPy Arrays Data Concurrently

A common task when using numpy is loading data into arrays.

Specifically, we may have data stored in files, such as CSV files or images, or in a database as tables.

We need to load the data into memory and represent it as numpy arrays so that we can perform mathematical operations upon it or use it with a library that requires data in numpy array format.

Naively, we can load one data file at a time, such as one CSV file, one image, or one table at a time.

This can be painfully slow if we have thousands or millions of files to load.

We can use parallelism to load data into numpy arrays simultaneously. For example, we might be able to load tens of files into memory concurrently from the same hard disk or database source, or thousands of files concurrently from remote servers.

This can offer a dramatic speed-up compared to loading files one by one.

You can learn more about loading files efficiently in parallel in the tutorial:

Similarly, we may need to save numpy array data to file for later use.

Each array may be a large vector or array of data and may need to be saved to an individual file.

The save operations can be performed in parallel, offering a dramatic speed-up.

You can learn more about saving files in parallel in the tutorial:

## Calculate Math Functions Efficiently in Parallel

Many math operations on numpy arrays are implemented using efficient algorithms that make use of parallelism.

An example is a suite of linear algebra operations, such as:

- Matrix multiplication.
- Matrix decompositions.
- Solving linear systems.

You can learn which functions offer parallelism with BLAS in the tutorial:

These functions call down into third-party libraries like BLAS and LAPACK that implement the algorithms efficiently in the C or FORTRAN programming language.

These third-party implementations implement custom algorithms that make use of modern CPU instructions and threads to complete the calculation using all available CPU cores.

You can learn more BLAS and LAPACK in the tutorial:

Numpy typically installs a BLAS and LAPACK library automatically and configures it for multithreading and to make the best use of your CPU hardware automatically.

You can learn more about checking which BLAS library is installed on your system in the tutorial:

Nevertheless, if a BLAS library is not installed or not configured correctly, you may not be taking full advantage of your hardware.

If a BLAS library is installed and is being used for many math functions, you can often achieve better performance by configuring the number of threads to match the number of physical CPU cores available.

You can learn more about configuring the number of BLAS threads in the tutorial:

Finally, if you try to combine BLAS multithreading with Python threads or multiprocessing, it can often result in dramatically better or worse performance. Naively combining different types of concurrency is a recipe for disaster.

You can see an example of combining BLAS threads with Python multiprocessing-based concurrency in the tutorial:

You can see an example of combining BLAS threads with Python threading-based concurrency in the tutorial:

**Free Concurrent NumPy Course**

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

## Initialize Arrays Efficiently in Parallel

Often before using a numpy array, we must initialize it.

This may mean seeding the array with random values, such as when modeling. Alternatively, it may mean seeding the array with an initial value, such as zeros or ones.

This is an operation applied to each item in the array, called an element-wise operation. These operations are performed sequentially, one element at a time. This can be painfully slow, especially as the size of the array increases to millions or billions of items.

Importantly, numpy will perform operations on arrays using an underlying C library that releases the global interpreter lock or GIL. This allows Python threads to run in parallel.

You can learn more about numpy releasing the GIL in the tutorial:

The implication is that we can fill numpy arrays with initial values in parallel using Python threads, offering a dramatic speed-up.

You can learn more about filling numpy arrays in parallel in the tutorial:

Populating numpy arrays with random initial values can also be performed in parallel using numpy threads, offering a dramatic speed-up.

You can learn more about initializing numpy arrays with random values in parallel in the tutorial:

Not knowing that Python threads offer full parallelism when initializing numpy arrays can cause problems.

For example, naively using multiprocessing instead of threading to initialize a numpy array with random values can offer dramatically worse performance.

You can see an example of using multiprocessing to initialize a numpy array with random values in parallel in the tutorial:

**Overwheled by the python concurrency APIs?**

Find relief, download my FREE Python Concurrency Mind Maps

## Perform Element-Wise Math on Arrays in Parallel

Many math functions on arrays in numpy are not parallel by default.

An example of this is math functions that apply to each element in the array.

These are called element-wise math functions and are very common.

Examples include arithmetic operations, such as addition, subtraction, multiplication, and division. It also includes more advanced operations such as square root, power, logarithm, exponent, and more.

Because these operations are executed on each item in the array, the operations are completed sequentially and get slower as the size of the array is increased. These operations can be painfully slow when an array has millions or billions of items.

Because the math operations are element-wise, the operations can easily be made parallel. In fact, they are a type of “embarrassingly parallel” operation because they are so easy to parallelize.

Importantly, the numpy array performs most math operations like arithmetic and more advanced operations using its own C functions. These functions release the global interpreter lock (GIL) in Python, allowing other Python threads to run.

The implication is that math operations on arrays in numpy can be executed in parallel using Python threads.

You can learn more about numpy releasing the GIL in the tutorial:

This allows most numpy math operations applied element-wise on arrays to be executed in parallel using Python threads.

You can see an example of parallel arithmetic on numpy arrays in the tutorial:

You can see an example of more advanced math functions made parallel in the tutorial:

## Share NumPy Array Data Efficiently Between Processes

We may develop a program where we are using multiprocessing for process-based concurrency.

In this program, we may need to share numpy arrays between processes.

This may be for many reasons, such as loading or saving arrays in a child process separate from the main process that is operating on the arrays, using the arrays in child processes for modeling, or performing different mathematical operations on the same arrays in child processes.

Sharing numpy arrays between processes naively can be very slow and inefficient.

This is because when numpy arrays are shared between processes, a copy of the array is made and pickled by the sending process, then unpickled by the receiving processes. This is canned inter-process communication.

This copying and serializing/deserializing of the arrays happen automatically behind the scenes.

Naively choosing a method to share numpy arrays between processes can be devastating to performance.

Sharing arrays offers a dramatic computational overhead that can often negate the benefit of any performance benefits offered by operating on the arrays in parallel using processes.

As such, there are a suite of approaches that can be used to share numpy arrays between processes, each with different capabilities and benefits, such as:

- Sharing a numpy array via a function argument.
- Sharing a numpy array via an inherited global variable.
- Sharing a numpy array via a queue.
- Sharing a numpy array via a pipe.
- Sharing a
**RawArray**backed numpy array. - Sharing a
**SharedMemory**backed numpy array. - Sharing a numpy array hosted in a manager process.

You can learn more about a comparison of these approaches in the tutorial:

## Further Reading

This section provides additional resources that you may find helpful.

**Books**

- Concurrent NumPy in Python, Jason Brownlee, 2023 (
**my book**).

**Concurrent NumPy Guides**

- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes

**Concurrent NumPy Documentation**

- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API

**NumPy API**

**Python Concurrency APIs**

- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks

## Takeaways

You now know about the importance of parallelism when using numpy.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

Photo by Frankie Lopez on Unsplash

## Do you have any questions?