NumPy Supports Multithreaded Parallelism

April 16, 2023 Concurrent NumPy

Some NumPy functions run in parallel and use multiple threads, by default.

Parts of NumPy are built on top of a standard API for linear algebra operations called BLAS and BLAS libraries make use of multiple threads to speed up some operations by default.

As such, your installation of NumPy almost certainly already supports parallelism.

In this tutorial, you will discover multithreaded parallelism in NumPy.

Let's get started.

NumPy Supports Parallelism By Default

NumPy is a Python library for fast array operations.

It is built on top of a library that provides the speed of execution and provides an extensive suite of array operations.

When you installed the NumPy library, it is probably installed with support for parallelism. That is, NumPy will perform many operations in parallel using threads, when possible.

NumPy executes array operations using a library with a standardized interface for array operations called BLAS, which stands for "Basic Linear Algebra Subprograms".

The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software, LAPACK for example.
-- BLAS (Basic Linear Algebra Subprograms)

There are a number of implementations of the BLAS API, implemented in C and FORTRAN for speed.

You can learn more about BLAS and LAPACK support in numpy in the tutorial:

What is BLAS and LAPACK in NumPy

Common implementations of the library include open-source implementations like OpenBLAS and ATLAS (Automatically Tuned Linear Algebra Software), and proprietary implementations such as MKL (Math Kernel Library).

You can learn more about third-party libraries that implement BLAS in the tutorial:

NumPy BLAS/LAPACK Libraries

You probably have OpenBLAS installed as your implementation of BLAS for use with NumPy. It's the preferred and therefore most common BLAS library at the time of writing.

OpenBLAS is an open source BLAS library forked from the GotoBLAS2-1.13 BSD version. Since Mr. Kazushige Goto left TACC, GotoBLAS is no longer being maintained. Thus, we created this project to continue developing OpenBLAS/GotoBLAS.
-- What is OpenBLAS? Why did you create this project?

When a BLAS library is installed, it will detect the number of CPU cores that are in your system and are available to the operating system. It will then use this many threads, or more when an operation is executed that can be sped up using multithreading.

An example of a multithreaded operation is the numpy.dot() function that calculates the dot product of two arrays or matrices (e.g. matrix multiplication).

... many architectures now have a BLAS that also takes advantage of a multicore machine. If your numpy/scipy is compiled using one of these, then dot() will be computed in parallel (if this is faster) without you doing anything.
-- Parallel Programming with numpy and scipy

As such, your NumPy installation probably supports parallelism by default via multithreading, meaning, in turn your SciPy, Pandas, and any other libraries that use NumPy will also support parallelism by default.

How to Check Which BLAS Library is Installed

You may be interested to know which BLAS library is installed.

You probably have OpenBLAS installed, but perhaps you have ATLAS, MKL, or some other BLAS implementation.

Knowing which library is installed may help if you are looking to consult the API documentation to get more out of the library, such as customizing the number of threads used for parallelism.

You can report the details of the BLAS library that is installed and is being used by NumPy via the numpy.use_showconfig() function.

Calling this function will report the details of the BLAS implementation to stdout.

For example:

# report the details of the blas library
import numpy
numpy.show_config()

Running this program on my system reports that I am using the OpenBLAS library.

What BLAS library do you have installed? Post your output in the comments below.

blas_armpl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/opt/openblas/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/opt/openblas/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_armpl_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/opt/openblas/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/opt/openblas/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

You can learn more about checking which BLAS/LAPACK library is installed in the tutorial:

Check BLAS Library Installed For NumPy

Now that we know how to check which BLAS library is installed, let's look at how we might configure it.

How to Set The Number of Threads

When your BLAS library is installed, it will hardcode the number of threads to use by default based on the number of CPU cores in your system.

If you have 4 CPU cores, it will use all 4 CPU cores when it decides that an operation will benefit from multithreading.

If you have 4 CPU cores with hyperthreading, then it will "see" 8 CPU cores and use 8 threads for parallel operations.

This may not be desirable, as sometimes we can achieve better performance by having one thread per physical CPU core instead of one thread per logical CPU core.

NumPy provides the OMP_NUM_THREADS environment variable for configuring the number of threads used by the installed BLAS library.

NumPy itself is normally intentionally limited to a single thread during function calls, however it does support multiple Python threads running at the same time. Note that for performant linear algebra NumPy uses a BLAS backend such as OpenBLAS or MKL, which may use multiple threads that may be controlled by environment variables such as OMP_NUM_THREADS depending on what is used.
-- Number of Threads used for Linear Algebra

This environmental variable seems to be supported by some BLAS libraries, such as:

Each library will have its own custom variable name which you may need to look up in the API documentation.

If you are using multiple underlying libraries, there may be some hierarchy between the environment variables as they jostle to set the number of threads.

For example:

OpenMP: OMP_NUM_THREADS
OpenBLAS: OPENBLAS_NUM_THREADS
MKL: MKL_NUM_THREADS
VecLIB/Accelerate: VECLIB_MAXIMUM_THREADS
NumExpr: NUMEXPR_NUM_THREADS

The environment variable can be set prior to executing your Python script, for example by exporting it:

...
export OMP_NUM_THREADS=4
python ./program.py

Or on one line when executing the script:

OMP_NUM_THREADS=4 python ./program.py

It can also be set explicitly from within the Python program via the os.environ() function.

For example:

...
from os import environ
environ['OMP_NUM_THREADS'] = '4'

If you have many libraries installed and are having trouble setting the number of threads, try the following:

# set the number of threads for many common libraries
from os import environ
N_THREADS = '4'
environ['OMP_NUM_THREADS'] = N_THREADS
environ['OPENBLAS_NUM_THREADS'] = N_THREADS
environ['MKL_NUM_THREADS'] = N_THREADS
environ['VECLIB_MAXIMUM_THREADS'] = N_THREADS
environ['NUMEXPR_NUM_THREADS'] = N_THREADS

You can learn more about setting the number of threads used by BLAS/LAPACK in the tutorial:

Configure the Number of BLAS/LAPACK Threads for NumPy

Now that we know how to configure the number of threads used by the BLAS library, let's explore how we may use it in practice, such as confirming our NumPy installation does indeed support parallelism.

How to Confirm NumPy is Multithreaded

We now know that Python supports parallelism by default.

We also know how to set the number of threads used by NumPy (technically BLAS and LAPACK that sits behind NumPy).

In this section, we can explore setting the number of threads and use this to first confirm that NumPy supports parallelism and by improving performance by fine-tuning the configuration.

We will first calculate a matrix multiplication with non-trivial arrays so that BLAS will choose to use threads.

We will then run the same example with multithreading turned off and show that it is slower. This will prove that by default NumPy is using threads for relevant operations and that this is faster than not using threads.

Matrix Multiplication With Threads

Matrix multiplication is slow and underlies many linear algebra operations.

In this example, we will create two arrays of random numbers and multiply them together.

We will create two square arrays that are 8,000 numbers by 8,000 numbers. We will then multiply them together using the numpy.dot() function.

This will use the default number of threads for the BLAS library in our system.

# example of calculating a matrix multiplication (with threads)
from numpy.random import rand
# size of arrays
n = 8000
# create an array of random values
data1 = rand(n, n)
data2 = rand(n, n)
# matrix multiplication
result = data1.dot(data2)

Running the example took about 8.1 seconds on my system.

How long did it take on your system?
Let me know in the comments below.

Next, let's do the same thing, but without threads.

Matrix Multiplication Without Threads

We can disable threads by setting the number of threads to 1.

This will mean that no additional helper threads will be created and used by the BLAS library.

We can achieve this by setting the OMP_NUM_THREADS environment variable to the value '1' prior to running our program, such as as an environment variable in the shell, or prior to importing the NumPy library.

The updated example with this change is listed below.

# example of calculating a matrix multiplication (without threads)
from os import environ
environ['OMP_NUM_THREADS'] = '1'
from numpy.random import rand
# size of arrays
n = 8000
# create an array of random values
data1 = rand(n, n)
data2 = rand(n, n)
# matrix multiplication
result = data1.dot(data2)

Running the updated example took 17.0 seconds on my system.

That is nearly 2x slower.

This proves two things.

1. It proves that the 'OMP_NUM_THREADS' environment variable is effective at configuring the number of threads used by BLAS on my system (and probably yours).
2. It proves that by default, Numpy is using threads where possible to speed up array operations on my system (and probably yours).

Was there a difference on your system? How much?

If not, try another environment variable, such as those listed above.

Next, let's see if we can make things even faster.

Matrix Multiplication With Optimal Number of Threads

By default, the installation of OpenBLAS will ask the operating system how many CPU cores it has and use that as a default number of threads.

Or, it may just use a hard-coded number of threads, perhaps only if it cannot determine the number of CPU cores.

This is probably wrong if your system has CPU cores with hyperthreading, which is probably most modern CPUs.

When performing CPU-intensive operations, such as those operations performed by BLAS, we should probably have one thread per physical CPU core.

I have 4 CPU cores with hyperthreading. The operating system reports 8 cores and this is probably what is used by OpenBLAS.

You can learn more about determining the number of CPUs seen by Python and the operating system in the tutorial:

Number of CPUs in Python

We can set the number of threads used by BLAS to the number of physical cores and see a speed-up in most CPU-bound operations.

The updated version of the program with this change is listed below. Change 4 to the number of cores in your system, or use trial and error to find a good config.

# example of matrix multiplication (with optimal threads)
from os import environ
environ['OMP_NUM_THREADS'] = '4'
from numpy.random import rand
# size of arrays
n = 8000
# create an array of random values
data1 = rand(n, n)
data2 = rand(n, n)
# matrix multiplication
result = data1.dot(data2)

Running the example takes 6.2 seconds.

This is nearly 2 whole seconds faster than using the default determined (or hardcoded) by OpenBLAS.

I can now use this in all my numpy-based programs and expect a modest speed-up.

Did you see any difference on your system?
Let me know in the comments below.

What About the BLAS Threads and the GIL?

Only one Python thread can run at a time in the reference Python interpreter called CPython.

This is due to a lock in the Python interpreter called the global interpreter lock or GIL.

This lock is used by all threads in Python to ensure that access to the internal state of the interpreter is thread-safe. The lock is released sometimes, such as when the interpreter knows the call cannot touch its internal state, such as a blocking call or a call to a C-library.

This lock is not used and therefore does not impact threads used in third-party libraries, such as BLAS libraries used by NumPy.

As such, threads used in BLAS do run in parallel, unlike Python threads.

As evidence, see the example in the previous section.

Takeaways

You now know about parallelism in NumPy.

If you enjoyed this tutorial, you will love my book: Concurrent NumPy in Python. It covers everything you need to master the topic with hands-on examples and clear explanations.