Concurrent NumPy 7-Day Course

October 4, 2023 Concurrent NumPy

NumPy can be so much faster with the careful use of concurrency.

This includes built-in multithreaded algorithms via BLAS threads, the use of thread pools to execute NumPy functions in parallel, and efficient shared memory methods for sharing arrays between Python processes.

This course provides you with a 7-day crash course in concurrency for NumPy.

You will get a taste of what is possible and hopefully, skills that you can bring to your next project.

Let's get started.

Concurrent NumPy Overview

Hi, thanks for your interest in Concurrent NumPy.

This course is divided into 7 lessons. You can complete them at your own pace, but it might be fun to bookmark this page and complete one per day.

You have a lot of fun ahead, including:

Lesson 01: Faster NumPy With Concurrency (today)
Lesson 02: Configure NumPy BLAS Threads
Lesson 03: Multithreaded NumPy Functions
Lesson 04: NumPy Releases the GIL
Lesson 05: Faster NumPy Tasks with Thread Pool
Lesson 06: Combine BLAS and Python Threads
Lesson 07: Share NumPy array with ctype RawArray

You can download a .zip of all the source code for this course here:

Download the source code for this course

Your first lesson is just below.

Before you dive in, a quick question:

Why are you interested in Concurrent NumPy?
Let me know in the comments.

Lesson 01: Faster NumPy With Concurrency

Concurrency in NumPy was not an afterthought.

NumPy is how we represent arrays of numbers in Python.

It is so widely used that it is the center of an ecosystem with many libraries built on top of it such as SciPy, Pandas, scikit-learn, Matplotlib, TensorFlow, and so much more.

NumPy programs can be made much faster by adding concurrency.

There are 4 key aspects to NumPy concurrency, they are:

Many NumPy algorithms are already multithreaded.
Manny NumPy functions release the GIL.
Many NumPy tasks can be accelerated with thread pools.
NumPy arrays can be shared efficiently between processes.

Let's consider each of these aspects in turn.

1. Faster NumPy With BLAS Threads

BLAS and LAPACK are general specifications for linear algebra math functions.

Many free or open-source libraries implement these specs, such as OpenBLAS, MKL, ATLAS, and more.

We don't need to know much about these libraries other than they are really fast!

They are implemented in C and FORTRAN under the covers and use platform-specific tricks and CPU instructions to make code faster.

They also implement functions using multithreaded algorithms.

Knowing about BLAS and how to configure the number of BLAS threads that NumPy will use in our programs can offer a big speed-up.

2. NumPy Functions Release The GIL

The Python interpreter is implemented using a Global Interpreter Lock or GIL.

This ensures that the Python interpreter is thread-safe, but prevents more than one Python thread from running at a time, while the lock is held.

Importantly, NumPy will release the GIL.

This means that we can execute multiple Python threads concurrently and achieve full parallelism.

3. Faster NumPy With Thread Pools

Because most NumPy functions release the GIL, we can parallelize NumPy tasks with Python threads.

Many NumPy array operations are applied element-wise.

This means the same mathematical operation is applied to each cell of an array independently.

We can divide these element-wise operations into sub-tasks and execute them in parallel using a pool of worker threads, such as with the ThreadPool or ThreadPoolExecutor class.

4. Faster NumPy By Efficient Array Sharing

Sharing data between Python processes is typically very slow.

It is slow because data needs to be shared using inter-process communication methods, where data is pickled by the sender, transmitted, and unpickled by the receiver.

Luckily, there are many different ways to share NumPy arrays between Python processes. At least 9 methods by my count.

Some methods that involve making a copy of an array as part of sharing it can be very fast such as process inheritance.

We now know some of the key aspects of how we can develop faster NumPy programs with concurrency.

In the following lessons, we will look at worked examples.

Lesson 02: Configure NumPy BLAS Threads

NumPy makes use of third-party libraries to perform array functions efficiently.

One example is the BLAS specification used by NumPy to execute vector, matrix, and linear algebra operations.

A BLAS library is installed automatically when we install NumPy. A common example is the OpenBLAS library.

We can check which BLAS library is installed on our system using the numpy.show_config() function.

For example:

# SuperFastPython.com
# report the details of the blas library
import numpy
numpy.show_config()

Run the example.

Let me know in the comments which BLAS library you have installed.

Many algorithms available in the BLAS library are multithreaded.

This means that many of the NumPy functions we use will be multithreaded and executed in parallel automatically. We will look at some examples in the next lesson.

We can configure the number of threads used by BLAS NumPy functions. By default, BLAS may use one thread per physical CPU core or one thread per logical CPU core.

Tuning the number of threads in our program can result in better performance in many cases.

We can set the number of threads used by BLAS via the OMP_NUM_THREADS environment variable.

This variable can be set via os.environ function at the beginning of our NumPy program.

For example:

...
# set the number of numpy blas threads
import os
os.environ['OMP_NUM_THREADS'] = '4'

You can learn more about configuring the number of BLAS threads in the tutorial:

Configure the Number of BLAS/LAPACK Threads for NumPy

Lesson 03: Multithreaded NumPy Functions

Many NumPy functions are already multithreaded.

Some commonly used examples include:

Matrix multiplication with the dot() function.
Matrix solvers such as the numpy.linalg.solve() function.
Matrix decompositions such as the numpy.linalg.svd() function.

You can see a list of all NumPy functions that are multithreaded here:

Which NumPy Functions Are Multithreaded

In the previous lesson, we learned about BLAS libraries and how to configure the number of BLAS threads via the OMP_NUM_THREADS environment variable.

In this lesson, we are going to perform matrix multiplication and benchmark how long it takes with different numbers of BLAS threads.

The complete program is listed below.

# SuperFastPython.com
# example of multithreaded matrix multiplication
import os
os. environ['OMP_NUM_THREADS'] = '2'
import time
import numpy
# record the start time
start = time.perf_counter()
# create an array of random values
data1 = numpy.random.rand(8000, 8000)
data2 = numpy.random.rand(8000, 8000)
# matrix multiplication
result = data1.dot(data2)
# calculate and report duration
duration = time.perf_counter() - start
print(f'Took {duration:.3f} seconds')

Try running the example.

Then re-run the same example again with a different number of BLAS threads from 1 to the number of physical or logical CPU cores in your system.

Let me know in the comments the number of threads that give the best result on your system.

In some cases, we can get a 5x speed-up for matrix multiplication with a well-configured number of BLAS threads.

You can learn more about multithreaded matrix multiplication in the tutorial:

Numpy Multithreaded Matrix Multiplication (up to 5x faster)

Lesson 04: NumPy Releases the GIL

The internals of the Python interpreter are not thread-safe.

As such, the Python interpreter makes use of a Global Interpreter Lock, or GIL for short, to make instructions executed by the Python interpreter (called Python bytecodes) thread-safe.

The effect of the GIL is that whenever a thread within a Python program wants to run, it must acquire the lock before executing. This is not a problem for most Python programs that have a single thread of execution, called the main thread.

It can become a problem in multithreaded Python programs.

The GIL does not affect BLAS threads that we saw in the previous lesson, because the BLAS threads are run outside of Python, down in an external library like OpenBLAS.

The GIL will affect our normal Python programs, meaning only one thread can run at a time.

Unless the GIL is released.

Importantly, NumPy will release the GIL when performing most operations on arrays.

This includes many operations, such as:

Methods on the ndarray objects, like sum()
Arithmetic operators such as +, -, *, / and more.
Math functions such as power(), sqrt(), and more.

The API documentation suggests that generally, all operations on arrays will release the GIL, except those that are operating upon arrays of Python objects.

This means that we can execute NumPy functions in multiple Python threads and achieve full parallelism.

This can offer a significant speed-up, especially in programs where we need to perform the same mathematical operation on many arrays.

This also impacts the many open-source libraries that make use of NumPy arrays, such as SciPy, scikit-learn, pandas, and more.

You can learn more about NumPy releasing the GIL in the tutorial:

NumPy vs the Global Interpreter Lock (GIL)

Lesson 05: Faster NumPy Tasks with Thread Pool

We can use a Python thread pool to perform operations on a NumPy array in parallel.

This can offer a dramatic speedup.

We learned in the previous lesson that most NumPy functions will release the GIL, allowing multiple Python threads that are calling NumPy functions to achieve full parallelism.

This capability can be used to multithreaded many element-wise operations on NumPy arrays, such as:

Filling, initializing, and populating NumPy arrays.
Array arithmetic on NumPy arrays, such as addition and subtraction.
Math functions on arrays such as log() and sqrt().

This can be achieved by splitting up the array into chunks and issuing each chunk to a different work thread.

Each worker thread can then perform the operation on the chunk in parallel, allowing the overall duration of the task to be reduced by 2x or 3x.

Consider the case of initializing a large NumPy array, such as with 1.0 values.

We can use the numpy.ones() function directly.

For example:

# SuperFastPython.com
# initialize a numpy array slowly
import time
import numpy
# record the start time
time_start = time.perf_counter()
# create a new matrix and fill with 1
data = numpy.ones((50000,50000))
# calculate the duration
time_duration = time.perf_counter() - time_start
# report progress
print(f'Took {time_duration:.3f} seconds')

Run the example.

Record how long it takes.

Now we can initialize the same array in parallel using a thread pool with 4 worker threads.

# SuperFastPython.com
# initialize a numpy array in parallel with threads
from concurrent.futures
import time
import numpy

# fill a portion of a larger array with a value
def fill_subarray(coords, data, value):
    # unpack array indexes
    i1, i2, i3, i4 = coords
    # populate subarray
    data[i1:i2,i3:i4].fill(value)

# record the start time
time_start = time.perf_counter()
# create an empty array
n = 50000
data = numpy.empty((n,n))
# create the thread pool
with concurrent.futures.ThreadPoolExecutor(4) as exe:
    # split each dimension (divisor of matrix dimension)
    split = round(n/2)
    # issue tasks
    for x in range(0, n, split):
        for y in range(0, n, split):
            # determine matrix coordinates
            coords = (x, x+split, y, y+split)
            # issue task
            _ = exe.submit(fill_subarray, coords, data, 1)
# calculate the duration
time_duration = time.perf_counter() - time_start
# report the duration
print(f'Took {time_duration:.3f} seconds')

Run the example.

Compare the run time to the single-threaded version above.

Vary the number of worker threads and number of chunks the array is divided into and see what gives the best result.

Let me know in the comments what you discover.

You can learn more about multithreaded array filling in the tutorial:

ThreadPoolExecutor Fill NumPy Array

Lesson 06: Combine BLAS and Python Threads

We can combine BLAS threads with Python threads in NumPy programs.

Maximizing these types of parallelism can help us fully utilize our CPU cores for a given application and achieve a speed-up compared to not using any or only one level of parallelism.

It can be tricky as sometimes combining threading with NumPy BLAS threads can result in no benefit or even worse performance.

Consider a program that performs some operation on pairs of NumPy arrays, but must perform the same task on many pairs of arrays.

The operation we wish to perform can be parallel with BLAS threads, e.g. matrix multiplication.

Additionally, we can parallelize the tasks themselves, e.g. worker threads handling multiple pairs of arrays on which to perform the operation.

What configuration of BLAS threads and worker threads should we use?

Some examples include:

No worker threads and no BLAS threads.
No worker threads and a number of BLAS threads.
A number of worker threads and no BLAS threads.
A number of worker threads and a number of BLAS threads.

We cannot know which configuration will work best for a given application on a given system.

Instead, it is a good idea to test a suite of configurations in order to discover what works best.

The example below executes 100 tasks using worker threads where each task is completed using a BLAS function that is multithreaded.

# SuperFastPython.com
# example that combines a thread pool and blas threads
import os
os. environ['OMP_NUM_THREADS'] = '4'
import concurrent.futures
import time
import numpy

# function that defines a single task
def task():
    # simulate loading arrays
    item1 = numpy.random.rand(2000, 2000)
    item2 = numpy.random.rand(2000, 2000)
    # perform multithreaded operation
    result = item1.dot(item2)
    # return the result
    return result

# record the start time
time_start = time.perf_counter()
# create thread pool
with concurrent.futures.ThreadPoolExecutor(2) as exe:
    # issue tasks and gather results
    results = [exe.submit(task) for _ in range(100)]
# calculate the duration
time_duration = time.perf_counter() - time_start
# report the duration
print(f'Took {time_duration:.3f} seconds')

Run the example.

Vary the number of BLAS threads and worker threads and discover a configuration that gives the best results.

Let me know in the comments what you discover.

You can learn more about combining BLAS threads and Python threads in the tutorial:

Speed-Up NumPy With Threads in Python (up to 3.41x faster)

Lesson 07: Share NumPy array with ctype RawArray

Python offers process-based concurrency via the multiprocessing module.

Process-based concurrency is appropriate for those tasks that are CPU-bound.

Consider the situation where we need to share NumPy arrays between processes. This may be for many reasons, such as data is loaded as an array in one process and analyzed differently in different subprocesses.

Sharing Python objects and data between processes is slow.

This is because any data, like NumPy arrays, shared between processes must be transmitted using inter-process communication (ICP) requiring the data first be pickled by the sender and then unpickled by the receiver.

One approach to sharing NumPy arrays between processes is to use shared memory.

Python processes do not have shared memory. Instead, processes must simulate shared memory.

We can create a block of shared memory and use it as a buffer to back a NumPy array.

This will allow us to share the NumPy array directly and freely between processes.

It is also fast as each process is able to read and write directly on the same array of data in main memory.

This can be achieved by first creating a ctype RawArray large enough to hold our NumPy array.

...
# create the shared array
array = RawArray('d', n)

We can then create a new NumPy array that uses the ctype RawArray as a buffer via the frombuffer() function.

...
# create a new numpy array backed by the raw array
data = frombuffer(array, dtype=double, count=len(array))

And that's it.

A complete example of creating a RawArray-backed NumPy array and sharing it between Python processes is listed below.

# SuperFastPython.com
# share numpy array between processes via a shared rawarray
import multiprocessing
import numpy

# task executed in a child process
def task(array):
    # create a new numpy array backed by the raw array
    data = numpy.frombuffer(array, dtype=numpy.double, count=len(array))
    # check the contents
    print(f'Child {data[:10]}')
    # increment the data
    data[:] += 1
    # confirm change
    print(f'Child {data[:10]}')

# protect the entry point
if __name__ == '__main__':
    # define the size of the numpy array
    n = 10000000
    # create the shared array
    array = multiprocessing.RawArray('d', n)
    # create a new numpy array backed by the raw array
    data = numpy.frombuffer(array, dtype=numpy.double, count=len(array))
    # populate the array
    data.fill(1.0)
    # confirm contents of the new array
    print(data[:10], len(data))
    # create a child process
    child = multiprocessing.Process(target=task, args=(array,))
    # start the child process
    child.start()
    # wait for the child process to complete
    child.join()
    # check some data in the shared array
    print(data[:10])

Run the example.

Vary the size of the array and the operations performed on the array in each process.

Let me know what you discover.

You can learn more about developing a RawArray-backed NumPy array in the tutorial:

Share a Numpy Array Between Processes Backed By RawArray

Thank You

Thank you for letting me help you learn more about concurrent NumPy.

If you ever have any questions about this course or Python concurrency in general, please reach out. Just ask in the comments below.

If you enjoyed this course and want to know more, check out my new book:

Concurrent NumPy In Python

This book distills only what you need to know to get started and be effective with concurrent NumPy, super fast.

The book is divided into 32 chapters spread across 4 parts covering a suite of topics such as BLAS threads, thread pools, and inter-process array sharing.

Everything you need to get started, then get really good at concurrent NumPy, all in one book.

It's exactly how I would teach you concurrent NumPy if we were sitting together, pair programming.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Did you enjoy this course?
Let me know in the comments below.

If you enjoyed this tutorial, you will love my book: Concurrent NumPy in Python. It covers everything you need to master the topic with hands-on examples and clear explanations.