Last Updated on September 29, 2023
Some NumPy functions run in parallel and use multiple threads, by default.
Parts of NumPy are built on top of a standard API for linear algebra operations called BLAS and BLAS libraries make use of multiple threads to speed up some operations by default.
As such, your installation of NumPy almost certainly already supports parallelism.
In this tutorial, you will discover multithreaded parallelism in NumPy.
Let’s get started.
NumPy Supports Parallelism By Default
NumPy is a Python library for fast array operations.
It is built on top of a library that provides the speed of execution and provides an extensive suite of array operations.
When you installed the NumPy library, it is probably installed with support for parallelism. That is, NumPy will perform many operations in parallel using threads, when possible.
NumPy executes array operations using a library with a standardized interface for array operations called BLAS, which stands for “Basic Linear Algebra Subprograms“.
The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software, LAPACK for example.
— BLAS (Basic Linear Algebra Subprograms)
There are a number of implementations of the BLAS API, implemented in C and FORTRAN for speed.
You can learn more about BLAS and LAPACK support in numpy in the tutorial:
Common implementations of the library include open-source implementations like OpenBLAS and ATLAS (Automatically Tuned Linear Algebra Software), and proprietary implementations such as MKL (Math Kernel Library).
You can learn more about third-party libraries that implement BLAS in the tutorial:
You probably have OpenBLAS installed as your implementation of BLAS for use with NumPy. It’s the preferred and therefore most common BLAS library at the time of writing.
OpenBLAS is an open source BLAS library forked from the GotoBLAS2-1.13 BSD version. Since Mr. Kazushige Goto left TACC, GotoBLAS is no longer being maintained. Thus, we created this project to continue developing OpenBLAS/GotoBLAS.
— What is OpenBLAS? Why did you create this project?
When a BLAS library is installed, it will detect the number of CPU cores that are in your system and are available to the operating system. It will then use this many threads, or more when an operation is executed that can be sped up using multithreading.
An example of a multithreaded operation is the numpy.dot() function that calculates the dot product of two arrays or matrices (e.g. matrix multiplication).
… many architectures now have a BLAS that also takes advantage of a multicore machine. If your numpy/scipy is compiled using one of these, then dot() will be computed in parallel (if this is faster) without you doing anything.
— Parallel Programming with numpy and scipy
As such, your NumPy installation probably supports parallelism by default via multithreading, meaning, in turn your SciPy, Pandas, and any other libraries that use NumPy will also support parallelism by default.
Run loops using all CPUs, download your FREE book to learn how.
How to Check Which BLAS Library is Installed
You may be interested to know which BLAS library is installed.
You probably have OpenBLAS installed, but perhaps you have ATLAS, MKL, or some other BLAS implementation.
Knowing which library is installed may help if you are looking to consult the API documentation to get more out of the library, such as customizing the number of threads used for parallelism.
You can report the details of the BLAS library that is installed and is being used by NumPy via the numpy.use_showconfig() function.
Calling this function will report the details of the BLAS implementation to stdout.
For example:
1 2 3 |
# report the details of the blas library import numpy numpy.show_config() |
Running this program on my system reports that I am using the OpenBLAS library.
What BLAS library do you have installed? Post your output in the comments below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
blas_armpl_info: NOT AVAILABLE blas_mkl_info: NOT AVAILABLE blis_info: NOT AVAILABLE openblas_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/opt/openblas/lib'] language = c define_macros = [('HAVE_CBLAS', None)] blas_opt_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/opt/openblas/lib'] language = c define_macros = [('HAVE_CBLAS', None)] lapack_armpl_info: NOT AVAILABLE lapack_mkl_info: NOT AVAILABLE openblas_lapack_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/opt/openblas/lib'] language = c define_macros = [('HAVE_CBLAS', None)] lapack_opt_info: libraries = ['openblas', 'openblas'] library_dirs = ['/usr/local/opt/openblas/lib'] language = c define_macros = [('HAVE_CBLAS', None)] Supported SIMD extensions in this NumPy install: baseline = SSE,SSE2,SSE3 found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2 not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL |
You can learn more about checking which BLAS/LAPACK library is installed in the tutorial:
Now that we know how to check which BLAS library is installed, let’s look at how we might configure it.
How to Set The Number of Threads
When your BLAS library is installed, it will hardcode the number of threads to use by default based on the number of CPU cores in your system.
If you have 4 CPU cores, it will use all 4 CPU cores when it decides that an operation will benefit from multithreading.
If you have 4 CPU cores with hyperthreading, then it will “see” 8 CPU cores and use 8 threads for parallel operations.
This may not be desirable, as sometimes we can achieve better performance by having one thread per physical CPU core instead of one thread per logical CPU core.
NumPy provides the OMP_NUM_THREADS environment variable for configuring the number of threads used by the installed BLAS library.
NumPy itself is normally intentionally limited to a single thread during function calls, however it does support multiple Python threads running at the same time. Note that for performant linear algebra NumPy uses a BLAS backend such as OpenBLAS or MKL, which may use multiple threads that may be controlled by environment variables such as OMP_NUM_THREADS depending on what is used.
— Number of Threads used for Linear Algebra
This environmental variable seems to be supported by some BLAS libraries, such as:
- OpenBLAS: Setting the number of threads using environment variables
- MKL: Setting the Number of Threads Using an OpenMP* Environment Variable
Each library will have its own custom variable name which you may need to look up in the API documentation.
If you are using multiple underlying libraries, there may be some hierarchy between the environment variables as they jostle to set the number of threads.
For example:
- OpenMP: OMP_NUM_THREADS
- OpenBLAS: OPENBLAS_NUM_THREADS
- MKL: MKL_NUM_THREADS
- VecLIB/Accelerate: VECLIB_MAXIMUM_THREADS
- NumExpr: NUMEXPR_NUM_THREADS
The environment variable can be set prior to executing your Python script, for example by exporting it:
1 2 3 |
... export OMP_NUM_THREADS=4 python ./program.py |
Or on one line when executing the script:
1 |
OMP_NUM_THREADS=4 python ./program.py |
It can also be set explicitly from within the Python program via the os.environ() function.
For example:
1 2 3 |
... from os import environ environ['OMP_NUM_THREADS'] = '4' |
If you have many libraries installed and are having trouble setting the number of threads, try the following:
1 2 3 4 5 6 7 8 |
# set the number of threads for many common libraries from os import environ N_THREADS = '4' environ['OMP_NUM_THREADS'] = N_THREADS environ['OPENBLAS_NUM_THREADS'] = N_THREADS environ['MKL_NUM_THREADS'] = N_THREADS environ['VECLIB_MAXIMUM_THREADS'] = N_THREADS environ['NUMEXPR_NUM_THREADS'] = N_THREADS |
You can learn more about setting the number of threads used by BLAS/LAPACK in the tutorial:
Now that we know how to configure the number of threads used by the BLAS library, let’s explore how we may use it in practice, such as confirming our NumPy installation does indeed support parallelism.
Free Concurrent NumPy Course
Get FREE access to my 7-day email course on concurrent NumPy.
Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.
How to Confirm NumPy is Multithreaded
We now know that Python supports parallelism by default.
We also know how to set the number of threads used by NumPy (technically BLAS and LAPACK that sits behind NumPy).
In this section, we can explore setting the number of threads and use this to first confirm that NumPy supports parallelism and by improving performance by fine-tuning the configuration.
We will first calculate a matrix multiplication with non-trivial arrays so that BLAS will choose to use threads.
We will then run the same example with multithreading turned off and show that it is slower. This will prove that by default NumPy is using threads for relevant operations and that this is faster than not using threads.
Matrix Multiplication With Threads
Matrix multiplication is slow and underlies many linear algebra operations.
In this example, we will create two arrays of random numbers and multiply them together.
We will create two square arrays that are 8,000 numbers by 8,000 numbers. We will then multiply them together using the numpy.dot() function.
This will use the default number of threads for the BLAS library in our system.
1 2 3 4 5 6 7 8 9 |
# example of calculating a matrix multiplication (with threads) from numpy.random import rand # size of arrays n = 8000 # create an array of random values data1 = rand(n, n) data2 = rand(n, n) # matrix multiplication result = data1.dot(data2) |
Running the example took about 8.1 seconds on my system.
How long did it take on your system?
Let me know in the comments below.
Next, let’s do the same thing, but without threads.
Matrix Multiplication Without Threads
We can disable threads by setting the number of threads to 1.
This will mean that no additional helper threads will be created and used by the BLAS library.
We can achieve this by setting the OMP_NUM_THREADS environment variable to the value ‘1’ prior to running our program, such as as an environment variable in the shell, or prior to importing the NumPy library.
The updated example with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 |
# example of calculating a matrix multiplication (without threads) from os import environ environ['OMP_NUM_THREADS'] = '1' from numpy.random import rand # size of arrays n = 8000 # create an array of random values data1 = rand(n, n) data2 = rand(n, n) # matrix multiplication result = data1.dot(data2) |
Running the updated example took 17.0 seconds on my system.
That is nearly 2x slower.
This proves two things.
1. It proves that the ‘OMP_NUM_THREADS’ environment variable is effective at configuring the number of threads used by BLAS on my system (and probably yours).
2. It proves that by default, Numpy is using threads where possible to speed up array operations on my system (and probably yours).
Was there a difference on your system? How much?
If not, try another environment variable, such as those listed above.
Next, let’s see if we can make things even faster.
Matrix Multiplication With Optimal Number of Threads
By default, the installation of OpenBLAS will ask the operating system how many CPU cores it has and use that as a default number of threads.
Or, it may just use a hard-coded number of threads, perhaps only if it cannot determine the number of CPU cores.
This is probably wrong if your system has CPU cores with hyperthreading, which is probably most modern CPUs.
When performing CPU-intensive operations, such as those operations performed by BLAS, we should probably have one thread per physical CPU core.
I have 4 CPU cores with hyperthreading. The operating system reports 8 cores and this is probably what is used by OpenBLAS.
You can learn more about determining the number of CPUs seen by Python and the operating system in the tutorial:
We can set the number of threads used by BLAS to the number of physical cores and see a speed-up in most CPU-bound operations.
The updated version of the program with this change is listed below. Change 4 to the number of cores in your system, or use trial and error to find a good config.
1 2 3 4 5 6 7 8 9 10 11 |
# example of matrix multiplication (with optimal threads) from os import environ environ['OMP_NUM_THREADS'] = '4' from numpy.random import rand # size of arrays n = 8000 # create an array of random values data1 = rand(n, n) data2 = rand(n, n) # matrix multiplication result = data1.dot(data2) |
Running the example takes 6.2 seconds.
This is nearly 2 whole seconds faster than using the default determined (or hardcoded) by OpenBLAS.
I can now use this in all my numpy-based programs and expect a modest speed-up.
Did you see any difference on your system?
Let me know in the comments below.
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
What About the BLAS Threads and the GIL?
Only one Python thread can run at a time in the reference Python interpreter called CPython.
This is due to a lock in the Python interpreter called the global interpreter lock or GIL.
This lock is used by all threads in Python to ensure that access to the internal state of the interpreter is thread-safe. The lock is released sometimes, such as when the interpreter knows the call cannot touch its internal state, such as a blocking call or a call to a C-library.
This lock is not used and therefore does not impact threads used in third-party libraries, such as BLAS libraries used by NumPy.
As such, threads used in BLAS do run in parallel, unlike Python threads.
As evidence, see the example in the previous section.
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent NumPy in Python, Jason Brownlee (my book!)
Guides
- Concurrent NumPy 7-Day Course
- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes
Documentation
- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API
NumPy APIs
Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
Takeaways
You now know about parallelism in NumPy.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Markus Winkler on Unsplash
Do you have any questions?