NumPy can be so much faster with the careful use of concurrency.
This includes built-in multithreaded algorithms via BLAS threads, the use of thread pools to execute NumPy functions in parallel, and efficient shared memory methods for sharing arrays between Python processes.
This course provides you with a 7-day crash course in concurrency for NumPy.
You will get a taste of what is possible and hopefully, skills that you can bring to your next project.
Let’s get started.
Concurrent NumPy Overview
Hi, thanks for your interest in Concurrent NumPy.
This course is divided into 7 lessons. You can complete them at your own pace, but it might be fun to bookmark this page and complete one per day.
You have a lot of fun ahead, including:
- Lesson 01: Faster NumPy With Concurrency (today)
- Lesson 02: Configure NumPy BLAS Threads
- Lesson 03: Multithreaded NumPy Functions
- Lesson 04: NumPy Releases the GIL
- Lesson 05: Faster NumPy Tasks with Thread Pool
- Lesson 06: Combine BLAS and Python Threads
- Lesson 07: Share NumPy array with ctype RawArray
You can download a .zip of all the source code for this course here:
Your first lesson is just below.
Before you dive in, a quick question:
Why are you interested in Concurrent NumPy?
Let me know in the comments.
Run loops using all CPUs, download your FREE book to learn how.
Lesson 01: Faster NumPy With Concurrency
Concurrency in NumPy was not an afterthought.
NumPy is how we represent arrays of numbers in Python.
It is so widely used that it is the center of an ecosystem with many libraries built on top of it such as SciPy, Pandas, scikit-learn, Matplotlib, TensorFlow, and so much more.
NumPy programs can be made much faster by adding concurrency.
There are 4 key aspects to NumPy concurrency, they are:
- Many NumPy algorithms are already multithreaded.
- Manny NumPy functions release the GIL.
- Many NumPy tasks can be accelerated with thread pools.
- NumPy arrays can be shared efficiently between processes.
Let’s consider each of these aspects in turn.
1. Faster NumPy With BLAS Threads
BLAS and LAPACK are general specifications for linear algebra math functions.
Many free or open-source libraries implement these specs, such as OpenBLAS, MKL, ATLAS, and more.
We don’t need to know much about these libraries other than they are really fast!
They are implemented in C and FORTRAN under the covers and use platform-specific tricks and CPU instructions to make code faster.
They also implement functions using multithreaded algorithms.
Knowing about BLAS and how to configure the number of BLAS threads that NumPy will use in our programs can offer a big speed-up.
2. NumPy Functions Release The GIL
The Python interpreter is implemented using a Global Interpreter Lock or GIL.
This ensures that the Python interpreter is thread-safe, but prevents more than one Python thread from running at a time, while the lock is held.
Importantly, NumPy will release the GIL.
This means that we can execute multiple Python threads concurrently and achieve full parallelism.
3. Faster NumPy With Thread Pools
Because most NumPy functions release the GIL, we can parallelize NumPy tasks with Python threads.
Many NumPy array operations are applied element-wise.
This means the same mathematical operation is applied to each cell of an array independently.
We can divide these element-wise operations into sub-tasks and execute them in parallel using a pool of worker threads, such as with the ThreadPool or ThreadPoolExecutor class.
4. Faster NumPy By Efficient Array Sharing
Sharing data between Python processes is typically very slow.
It is slow because data needs to be shared using inter-process communication methods, where data is pickled by the sender, transmitted, and unpickled by the receiver.
Luckily, there are many different ways to share NumPy arrays between Python processes. At least 9 methods by my count.
Some methods that involve making a copy of an array as part of sharing it can be very fast such as process inheritance.
We now know some of the key aspects of how we can develop faster NumPy programs with concurrency.
In the following lessons, we will look at worked examples.
Lesson 02: Configure NumPy BLAS Threads
NumPy makes use of third-party libraries to perform array functions efficiently.
One example is the BLAS specification used by NumPy to execute vector, matrix, and linear algebra operations.
A BLAS library is installed automatically when we install NumPy. A common example is the OpenBLAS library.
We can check which BLAS library is installed on our system using the numpy.show_config() function.
For example:
1 2 3 4 |
# SuperFastPython.com # report the details of the blas library import numpy numpy.show_config() |
Run the example.
Let me know in the comments which BLAS library you have installed.
Many algorithms available in the BLAS library are multithreaded.
This means that many of the NumPy functions we use will be multithreaded and executed in parallel automatically. We will look at some examples in the next lesson.
We can configure the number of threads used by BLAS NumPy functions. By default, BLAS may use one thread per physical CPU core or one thread per logical CPU core.
Tuning the number of threads in our program can result in better performance in many cases.
We can set the number of threads used by BLAS via the OMP_NUM_THREADS environment variable.
This variable can be set via os.environ function at the beginning of our NumPy program.
For example:
1 2 3 4 |
... # set the number of numpy blas threads import os os.environ['OMP_NUM_THREADS'] = '4' |
You can learn more about configuring the number of BLAS threads in the tutorial:
Free Concurrent NumPy Course
Get FREE access to my 7-day email course on concurrent NumPy.
Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.
Lesson 03: Multithreaded NumPy Functions
Many NumPy functions are already multithreaded.
Some commonly used examples include:
- Matrix multiplication with the dot() function.
- Matrix solvers such as the numpy.linalg.solve() function.
- Matrix decompositions such as the numpy.linalg.svd() function.
You can see a list of all NumPy functions that are multithreaded here:
In the previous lesson, we learned about BLAS libraries and how to configure the number of BLAS threads via the OMP_NUM_THREADS environment variable.
In this lesson, we are going to perform matrix multiplication and benchmark how long it takes with different numbers of BLAS threads.
The complete program is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# SuperFastPython.com # example of multithreaded matrix multiplication import os os. environ['OMP_NUM_THREADS'] = '2' import time import numpy # record the start time start = time.perf_counter() # create an array of random values data1 = numpy.random.rand(8000, 8000) data2 = numpy.random.rand(8000, 8000) # matrix multiplication result = data1.dot(data2) # calculate and report duration duration = time.perf_counter() - start print(f'Took {duration:.3f} seconds') |
Try running the example.
Then re-run the same example again with a different number of BLAS threads from 1 to the number of physical or logical CPU cores in your system.
Let me know in the comments the number of threads that give the best result on your system.
In some cases, we can get a 5x speed-up for matrix multiplication with a well-configured number of BLAS threads.
You can learn more about multithreaded matrix multiplication in the tutorial:
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Lesson 04: NumPy Releases the GIL
The internals of the Python interpreter are not thread-safe.
As such, the Python interpreter makes use of a Global Interpreter Lock, or GIL for short, to make instructions executed by the Python interpreter (called Python bytecodes) thread-safe.
The effect of the GIL is that whenever a thread within a Python program wants to run, it must acquire the lock before executing. This is not a problem for most Python programs that have a single thread of execution, called the main thread.
It can become a problem in multithreaded Python programs.
The GIL does not affect BLAS threads that we saw in the previous lesson, because the BLAS threads are run outside of Python, down in an external library like OpenBLAS.
The GIL will affect our normal Python programs, meaning only one thread can run at a time.
Unless the GIL is released.
Importantly, NumPy will release the GIL when performing most operations on arrays.
This includes many operations, such as:
- Methods on the ndarray objects, like sum()
- Arithmetic operators such as +, -, *, / and more.
- Math functions such as power(), sqrt(), and more.
The API documentation suggests that generally, all operations on arrays will release the GIL, except those that are operating upon arrays of Python objects.
This means that we can execute NumPy functions in multiple Python threads and achieve full parallelism.
This can offer a significant speed-up, especially in programs where we need to perform the same mathematical operation on many arrays.
This also impacts the many open-source libraries that make use of NumPy arrays, such as SciPy, scikit-learn, pandas, and more.
You can learn more about NumPy releasing the GIL in the tutorial:
Lesson 05: Faster NumPy Tasks with Thread Pool
We can use a Python thread pool to perform operations on a NumPy array in parallel.
This can offer a dramatic speedup.
We learned in the previous lesson that most NumPy functions will release the GIL, allowing multiple Python threads that are calling NumPy functions to achieve full parallelism.
This capability can be used to multithreaded many element-wise operations on NumPy arrays, such as:
- Filling, initializing, and populating NumPy arrays.
- Array arithmetic on NumPy arrays, such as addition and subtraction.
- Math functions on arrays such as log() and sqrt().
This can be achieved by splitting up the array into chunks and issuing each chunk to a different work thread.
Each worker thread can then perform the operation on the chunk in parallel, allowing the overall duration of the task to be reduced by 2x or 3x.
Consider the case of initializing a large NumPy array, such as with 1.0 values.
We can use the numpy.ones() function directly.
For example:
1 2 3 4 5 6 7 8 9 10 11 12 |
# SuperFastPython.com # initialize a numpy array slowly import time import numpy # record the start time time_start = time.perf_counter() # create a new matrix and fill with 1 data = numpy.ones((50000,50000)) # calculate the duration time_duration = time.perf_counter() - time_start # report progress print(f'Took {time_duration:.3f} seconds') |
Run the example.
Record how long it takes.
Now we can initialize the same array in parallel using a thread pool with 4 worker threads.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# SuperFastPython.com # initialize a numpy array in parallel with threads from concurrent.futures import time import numpy # fill a portion of a larger array with a value def fill_subarray(coords, data, value): # unpack array indexes i1, i2, i3, i4 = coords # populate subarray data[i1:i2,i3:i4].fill(value) # record the start time time_start = time.perf_counter() # create an empty array n = 50000 data = numpy.empty((n,n)) # create the thread pool with concurrent.futures.ThreadPoolExecutor(4) as exe: # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = exe.submit(fill_subarray, coords, data, 1) # calculate the duration time_duration = time.perf_counter() - time_start # report the duration print(f'Took {time_duration:.3f} seconds') |
Run the example.
Compare the run time to the single-threaded version above.
Vary the number of worker threads and number of chunks the array is divided into and see what gives the best result.
Let me know in the comments what you discover.
You can learn more about multithreaded array filling in the tutorial:
Lesson 06: Combine BLAS and Python Threads
We can combine BLAS threads with Python threads in NumPy programs.
Maximizing these types of parallelism can help us fully utilize our CPU cores for a given application and achieve a speed-up compared to not using any or only one level of parallelism.
It can be tricky as sometimes combining threading with NumPy BLAS threads can result in no benefit or even worse performance.
Consider a program that performs some operation on pairs of NumPy arrays, but must perform the same task on many pairs of arrays.
The operation we wish to perform can be parallel with BLAS threads, e.g. matrix multiplication.
Additionally, we can parallelize the tasks themselves, e.g. worker threads handling multiple pairs of arrays on which to perform the operation.
What configuration of BLAS threads and worker threads should we use?
Some examples include:
- No worker threads and no BLAS threads.
- No worker threads and a number of BLAS threads.
- A number of worker threads and no BLAS threads.
- A number of worker threads and a number of BLAS threads.
We cannot know which configuration will work best for a given application on a given system.
Instead, it is a good idea to test a suite of configurations in order to discover what works best.
The example below executes 100 tasks using worker threads where each task is completed using a BLAS function that is multithreaded.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# SuperFastPython.com # example that combines a thread pool and blas threads import os os. environ['OMP_NUM_THREADS'] = '4' import concurrent.futures import time import numpy # function that defines a single task def task(): # simulate loading arrays item1 = numpy.random.rand(2000, 2000) item2 = numpy.random.rand(2000, 2000) # perform multithreaded operation result = item1.dot(item2) # return the result return result # record the start time time_start = time.perf_counter() # create thread pool with concurrent.futures.ThreadPoolExecutor(2) as exe: # issue tasks and gather results results = [exe.submit(task) for _ in range(100)] # calculate the duration time_duration = time.perf_counter() - time_start # report the duration print(f'Took {time_duration:.3f} seconds') |
Run the example.
Vary the number of BLAS threads and worker threads and discover a configuration that gives the best results.
Let me know in the comments what you discover.
You can learn more about combining BLAS threads and Python threads in the tutorial:
Lesson 07: Share NumPy array with ctype RawArray
Python offers process-based concurrency via the multiprocessing module.
Process-based concurrency is appropriate for those tasks that are CPU-bound.
Consider the situation where we need to share NumPy arrays between processes. This may be for many reasons, such as data is loaded as an array in one process and analyzed differently in different subprocesses.
Sharing Python objects and data between processes is slow.
This is because any data, like NumPy arrays, shared between processes must be transmitted using inter-process communication (ICP) requiring the data first be pickled by the sender and then unpickled by the receiver.
One approach to sharing NumPy arrays between processes is to use shared memory.
Python processes do not have shared memory. Instead, processes must simulate shared memory.
We can create a block of shared memory and use it as a buffer to back a NumPy array.
This will allow us to share the NumPy array directly and freely between processes.
It is also fast as each process is able to read and write directly on the same array of data in main memory.
This can be achieved by first creating a ctype RawArray large enough to hold our NumPy array.
1 2 3 |
... # create the shared array array = RawArray('d', n) |
We can then create a new NumPy array that uses the ctype RawArray as a buffer via the frombuffer() function.
1 2 3 |
... # create a new numpy array backed by the raw array data = frombuffer(array, dtype=double, count=len(array)) |
And that’s it.
A complete example of creating a RawArray-backed NumPy array and sharing it between Python processes is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# SuperFastPython.com # share numpy array between processes via a shared rawarray import multiprocessing import numpy # task executed in a child process def task(array): # create a new numpy array backed by the raw array data = numpy.frombuffer(array, dtype=numpy.double, count=len(array)) # check the contents print(f'Child {data[:10]}') # increment the data data[:] += 1 # confirm change print(f'Child {data[:10]}') # protect the entry point if __name__ == '__main__': # define the size of the numpy array n = 10000000 # create the shared array array = multiprocessing.RawArray('d', n) # create a new numpy array backed by the raw array data = numpy.frombuffer(array, dtype=numpy.double, count=len(array)) # populate the array data.fill(1.0) # confirm contents of the new array print(data[:10], len(data)) # create a child process child = multiprocessing.Process(target=task, args=(array,)) # start the child process child.start() # wait for the child process to complete child.join() # check some data in the shared array print(data[:10]) |
Run the example.
Vary the size of the array and the operations performed on the array in each process.
Let me know what you discover.
You can learn more about developing a RawArray-backed NumPy array in the tutorial:
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent NumPy in Python, Jason Brownlee (my book!)
Guides
- Concurrent NumPy 7-Day Course
- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes
Documentation
- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API
NumPy APIs
Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
Thank You
Thank you for letting me help you learn more about concurrent NumPy.
If you ever have any questions about this course or Python concurrency in general, please reach out. Just ask in the comments below.
If you enjoyed this course and want to know more, check out my new book:
This book distills only what you need to know to get started and be effective with concurrent NumPy, super fast.
The book is divided into 32 chapters spread across 4 parts covering a suite of topics such as BLAS threads, thread pools, and inter-process array sharing.
Everything you need to get started, then get really good at concurrent NumPy, all in one book.
It’s exactly how I would teach you concurrent NumPy if we were sitting together, pair programming.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Did you enjoy this course?
Let me know in the comments below.
Photo by Chris Cordes on Unsplash
Do you have any questions?