Concurrent NumPy 7-Day Course

NumPy can be so much faster with the careful use of concurrency.

This includes built-in multithreaded algorithms via BLAS threads, the use of thread pools to execute NumPy functions in parallel, and efficient shared memory methods for sharing arrays between Python processes.

This course provides you with a 7-day crash course in concurrency for NumPy.

You will get a taste of what is possible and hopefully, skills that you can bring to your next project.

Let’s get started.

Table of Contents

Concurrent NumPy Overview

Hi, thanks for your interest in Concurrent NumPy.

This course is divided into 7 lessons. You can complete them at your own pace, but it might be fun to bookmark this page and complete one per day.

You have a lot of fun ahead, including:

Lesson 01: Faster NumPy With Concurrency (today)
Lesson 02: Configure NumPy BLAS Threads
Lesson 03: Multithreaded NumPy Functions
Lesson 04: NumPy Releases the GIL
Lesson 05: Faster NumPy Tasks with Thread Pool
Lesson 06: Combine BLAS and Python Threads
Lesson 07: Share NumPy array with ctype RawArray

You can download a .zip of all the source code for this course here:

Download the source code for this course

Your first lesson is just below.

Before you dive in, a quick question:

Why are you interested in Concurrent NumPy?
Let me know in the comments.

Run loops using all CPUs, download your FREE book to learn how.

Lesson 01: Faster NumPy With Concurrency

Concurrency in NumPy was not an afterthought.

NumPy is how we represent arrays of numbers in Python.

It is so widely used that it is the center of an ecosystem with many libraries built on top of it such as SciPy, Pandas, scikit-learn, Matplotlib, TensorFlow, and so much more.

NumPy programs can be made much faster by adding concurrency.

There are 4 key aspects to NumPy concurrency, they are:

Many NumPy algorithms are already multithreaded.
Manny NumPy functions release the GIL.
Many NumPy tasks can be accelerated with thread pools.
NumPy arrays can be shared efficiently between processes.

Let’s consider each of these aspects in turn.

1. Faster NumPy With BLAS Threads

BLAS and LAPACK are general specifications for linear algebra math functions.

Many free or open-source libraries implement these specs, such as OpenBLAS, MKL, ATLAS, and more.

We don’t need to know much about these libraries other than they are really fast!

They are implemented in C and FORTRAN under the covers and use platform-specific tricks and CPU instructions to make code faster.

They also implement functions using multithreaded algorithms.

Knowing about BLAS and how to configure the number of BLAS threads that NumPy will use in our programs can offer a big speed-up.

2. NumPy Functions Release The GIL

The Python interpreter is implemented using a Global Interpreter Lock or GIL.

This ensures that the Python interpreter is thread-safe, but prevents more than one Python thread from running at a time, while the lock is held.

Importantly, NumPy will release the GIL.

This means that we can execute multiple Python threads concurrently and achieve full parallelism.

3. Faster NumPy With Thread Pools

Because most NumPy functions release the GIL, we can parallelize NumPy tasks with Python threads.

Many NumPy array operations are applied element-wise.

This means the same mathematical operation is applied to each cell of an array independently.

We can divide these element-wise operations into sub-tasks and execute them in parallel using a pool of worker threads, such as with the ThreadPool or ThreadPoolExecutor class.

4. Faster NumPy By Efficient Array Sharing

Sharing data between Python processes is typically very slow.

It is slow because data needs to be shared using inter-process communication methods, where data is pickled by the sender, transmitted, and unpickled by the receiver.

Luckily, there are many different ways to share NumPy arrays between Python processes. At least 9 methods by my count.

Some methods that involve making a copy of an array as part of sharing it can be very fast such as process inheritance.

We now know some of the key aspects of how we can develop faster NumPy programs with concurrency.

In the following lessons, we will look at worked examples.

Start Now: Free Concurrent NumPy Crash Course

Lesson 02: Configure NumPy BLAS Threads

NumPy makes use of third-party libraries to perform array functions efficiently.

One example is the BLAS specification used by NumPy to execute vector, matrix, and linear algebra operations.

A BLAS library is installed automatically when we install NumPy. A common example is the OpenBLAS library.

We can check which BLAS library is installed on our system using the numpy.show_config() function.

For example:

# SuperFastPython.com

# report the details of the blas library

import numpy

numpy.show_config()

Run the example.

Let me know in the comments which BLAS library you have installed.

Many algorithms available in the BLAS library are multithreaded.

This means that many of the NumPy functions we use will be multithreaded and executed in parallel automatically. We will look at some examples in the next lesson.

We can configure the number of threads used by BLAS NumPy functions. By default, BLAS may use one thread per physical CPU core or one thread per logical CPU core.

Tuning the number of threads in our program can result in better performance in many cases.

We can set the number of threads used by BLAS via the OMP_NUM_THREADS environment variable.

This variable can be set via os.environ function at the beginning of our NumPy program.

For example:

...

# set the number of numpy blas threads

import os

os.environ['OMP_NUM_THREADS'] = '4'

You can learn more about configuring the number of BLAS threads in the tutorial:

Configure the Number of BLAS/LAPACK Threads for NumPy

Free Concurrent NumPy Course

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

Learn more

Lesson 03: Multithreaded NumPy Functions

Many NumPy functions are already multithreaded.

Some commonly used examples include:

Matrix multiplication with the dot() function.
Matrix solvers such as the numpy.linalg.solve() function.
Matrix decompositions such as the numpy.linalg.svd() function.

You can see a list of all NumPy functions that are multithreaded here:

Which NumPy Functions Are Multithreaded

In the previous lesson, we learned about BLAS libraries and how to configure the number of BLAS threads via the OMP_NUM_THREADS environment variable.

In this lesson, we are going to perform matrix multiplication and benchmark how long it takes with different numbers of BLAS threads.

The complete program is listed below.

# SuperFastPython.com

# example of multithreaded matrix multiplication

import os

os. environ['OMP_NUM_THREADS'] = '2'

import time

import numpy

# record the start time

start = time.perf_counter()

# create an array of random values

data1 = numpy.random.rand(8000, 8000)

data2 = numpy.random.rand(8000, 8000)

# matrix multiplication

result = data1.dot(data2)

# calculate and report duration

duration = time.perf_counter() - start

print(f'Took {duration:.3f} seconds')

Try running the example.

Then re-run the same example again with a different number of BLAS threads from 1 to the number of physical or logical CPU cores in your system.

Let me know in the comments the number of threads that give the best result on your system.

In some cases, we can get a 5x speed-up for matrix multiplication with a well-configured number of BLAS threads.

You can learn more about multithreaded matrix multiplication in the tutorial:

Numpy Multithreaded Matrix Multiplication (up to 5x faster)

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Lesson 04: NumPy Releases the GIL

The internals of the Python interpreter are not thread-safe.

As such, the Python interpreter makes use of a Global Interpreter Lock, or GIL for short, to make instructions executed by the Python interpreter (called Python bytecodes) thread-safe.

The effect of the GIL is that whenever a thread within a Python program wants to run, it must acquire the lock before executing. This is not a problem for most Python programs that have a single thread of execution, called the main thread.

It can become a problem in multithreaded Python programs.

The GIL does not affect BLAS threads that we saw in the previous lesson, because the BLAS threads are run outside of Python, down in an external library like OpenBLAS.

The GIL will affect our normal Python programs, meaning only one thread can run at a time.

Unless the GIL is released.

Importantly, NumPy will release the GIL when performing most operations on arrays.

This includes many operations, such as:

Methods on the ndarray objects, like sum()
Arithmetic operators such as +, -, *, / and more.
Math functions such as power(), sqrt(), and more.

The API documentation suggests that generally, all operations on arrays will release the GIL, except those that are operating upon arrays of Python objects.

This means that we can execute NumPy functions in multiple Python threads and achieve full parallelism.

This can offer a significant speed-up, especially in programs where we need to perform the same mathematical operation on many arrays.

This also impacts the many open-source libraries that make use of NumPy arrays, such as SciPy, scikit-learn, pandas, and more.

You can learn more about NumPy releasing the GIL in the tutorial:

NumPy vs the Global Interpreter Lock (GIL)

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Lesson 05: Faster NumPy Tasks with Thread Pool

We can use a Python thread pool to perform operations on a NumPy array in parallel.

This can offer a dramatic speedup.

We learned in the previous lesson that most NumPy functions will release the GIL, allowing multiple Python threads that are calling NumPy functions to achieve full parallelism.

This capability can be used to multithreaded many element-wise operations on NumPy arrays, such as:

Filling, initializing, and populating NumPy arrays.
Array arithmetic on NumPy arrays, such as addition and subtraction.
Math functions on arrays such as log() and sqrt().

This can be achieved by splitting up the array into chunks and issuing each chunk to a different work thread.

Each worker thread can then perform the operation on the chunk in parallel, allowing the overall duration of the task to be reduced by 2x or 3x.

Consider the case of initializing a large NumPy array, such as with 1.0 values.

We can use the numpy.ones() function directly.

For example:

# SuperFastPython.com

# initialize a numpy array slowly

import time

import numpy

# record the start time

time_start = time.perf_counter()

# create a new matrix and fill with 1

data = numpy.ones((50000,50000))

# calculate the duration

time_duration = time.perf_counter() - time_start

# report progress

print(f'Took {time_duration:.3f} seconds')

Run the example.

Record how long it takes.

Now we can initialize the same array in parallel using a thread pool with 4 worker threads.

# SuperFastPython.com

# initialize a numpy array in parallel with threads

from concurrent.futures

import time

import numpy

# fill a portion of a larger array with a value

def fill_subarray(coords, data, value):

# unpack array indexes

i1, i2, i3, i4 = coords

# populate subarray

data[i1:i2,i3:i4].fill(value)

# record the start time

time_start = time.perf_counter()

# create an empty array

n = 50000

data = numpy.empty((n,n))

# create the thread pool

with concurrent.futures.ThreadPoolExecutor(4) as exe:

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = exe.submit(fill_subarray, coords, data, 1)

# calculate the duration

time_duration = time.perf_counter() - time_start

# report the duration

print(f'Took {time_duration:.3f} seconds')

Run the example.

Compare the run time to the single-threaded version above.

Vary the number of worker threads and number of chunks the array is divided into and see what gives the best result.

Let me know in the comments what you discover.

You can learn more about multithreaded array filling in the tutorial:

ThreadPoolExecutor Fill NumPy Array

Lesson 06: Combine BLAS and Python Threads

We can combine BLAS threads with Python threads in NumPy programs.

Maximizing these types of parallelism can help us fully utilize our CPU cores for a given application and achieve a speed-up compared to not using any or only one level of parallelism.

It can be tricky as sometimes combining threading with NumPy BLAS threads can result in no benefit or even worse performance.

Consider a program that performs some operation on pairs of NumPy arrays, but must perform the same task on many pairs of arrays.

The operation we wish to perform can be parallel with BLAS threads, e.g. matrix multiplication.

Additionally, we can parallelize the tasks themselves, e.g. worker threads handling multiple pairs of arrays on which to perform the operation.

What configuration of BLAS threads and worker threads should we use?

Some examples include:

No worker threads and no BLAS threads.
No worker threads and a number of BLAS threads.
A number of worker threads and no BLAS threads.
A number of worker threads and a number of BLAS threads.

We cannot know which configuration will work best for a given application on a given system.

Instead, it is a good idea to test a suite of configurations in order to discover what works best.

The example below executes 100 tasks using worker threads where each task is completed using a BLAS function that is multithreaded.

# SuperFastPython.com

# example that combines a thread pool and blas threads

import os

os. environ['OMP_NUM_THREADS'] = '4'

import concurrent.futures

import time

import numpy

# function that defines a single task

def task():

# simulate loading arrays

item1 = numpy.random.rand(2000, 2000)

item2 = numpy.random.rand(2000, 2000)

# perform multithreaded operation

result = item1.dot(item2)

# return the result

return result

# record the start time

time_start = time.perf_counter()

# create thread pool

with concurrent.futures.ThreadPoolExecutor(2) as exe:

# issue tasks and gather results

results = [exe.submit(task) for _ in range(100)]

# calculate the duration

time_duration = time.perf_counter() - time_start

# report the duration

print(f'Took {time_duration:.3f} seconds')

Run the example.

Vary the number of BLAS threads and worker threads and discover a configuration that gives the best results.

Let me know in the comments what you discover.

You can learn more about combining BLAS threads and Python threads in the tutorial:

Speed-Up NumPy With Threads in Python (up to 3.41x faster)

Lesson 07: Share NumPy array with ctype RawArray

Python offers process-based concurrency via the multiprocessing module.

Process-based concurrency is appropriate for those tasks that are CPU-bound.

Consider the situation where we need to share NumPy arrays between processes. This may be for many reasons, such as data is loaded as an array in one process and analyzed differently in different subprocesses.

Sharing Python objects and data between processes is slow.

This is because any data, like NumPy arrays, shared between processes must be transmitted using inter-process communication (ICP) requiring the data first be pickled by the sender and then unpickled by the receiver.

One approach to sharing NumPy arrays between processes is to use shared memory.

Python processes do not have shared memory. Instead, processes must simulate shared memory.

We can create a block of shared memory and use it as a buffer to back a NumPy array.

This will allow us to share the NumPy array directly and freely between processes.

It is also fast as each process is able to read and write directly on the same array of data in main memory.

This can be achieved by first creating a ctype RawArray large enough to hold our NumPy array.

...

# create the shared array

array = RawArray('d', n)

We can then create a new NumPy array that uses the ctype RawArray as a buffer via the frombuffer() function.

...

# create a new numpy array backed by the raw array

data = frombuffer(array, dtype=double, count=len(array))

And that’s it.

A complete example of creating a RawArray-backed NumPy array and sharing it between Python processes is listed below.

# SuperFastPython.com

# share numpy array between processes via a shared rawarray

import multiprocessing

import numpy

# task executed in a child process

def task(array):

# create a new numpy array backed by the raw array

data = numpy.frombuffer(array, dtype=numpy.double, count=len(array))

# check the contents

print(f'Child {data[:10]}')

# increment the data

data[:] += 1

# confirm change

print(f'Child {data[:10]}')

# protect the entry point

if __name__ == '__main__':

# define the size of the numpy array

n = 10000000

# create the shared array

array = multiprocessing.RawArray('d', n)

# create a new numpy array backed by the raw array

data = numpy.frombuffer(array, dtype=numpy.double, count=len(array))

# populate the array

data.fill(1.0)

# confirm contents of the new array

print(data[:10], len(data))

# create a child process

child = multiprocessing.Process(target=task, args=(array,))

# start the child process

child.start()

# wait for the child process to complete

child.join()

# check some data in the shared array

print(data[:10])

Run the example.

Vary the size of the array and the operations performed on the array in each process.

Let me know what you discover.

You can learn more about developing a RawArray-backed NumPy array in the tutorial:

Share a Numpy Array Between Processes Backed By RawArray

Thank You

Thank you for letting me help you learn more about concurrent NumPy.

If you ever have any questions about this course or Python concurrency in general, please reach out. Just ask in the comments below.

If you enjoyed this course and want to know more, check out my new book:

Concurrent NumPy In Python

This book distills only what you need to know to get started and be effective with concurrent NumPy, super fast.

The book is divided into 32 chapters spread across 4 parts covering a suite of topics such as BLAS threads, thread pools, and inter-process array sharing.

Everything you need to get started, then get really good at concurrent NumPy, all in one book.

It’s exactly how I would teach you concurrent NumPy if we were sitting together, pair programming.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Did you enjoy this course?
Let me know in the comments below.

Photo by Chris Cordes on Unsplash

Concurrent NumPy Overview

Lesson 01: Faster NumPy With Concurrency

1. Faster NumPy With BLAS Threads

2. NumPy Functions Release The GIL

3. Faster NumPy With Thread Pools

4. Faster NumPy By Efficient Array Sharing

Lesson 02: Configure NumPy BLAS Threads

Lesson 03: Multithreaded NumPy Functions

Lesson 04: NumPy Releases the GIL

Lesson 05: Faster NumPy Tasks with Thread Pool

Lesson 06: Combine BLAS and Python Threads

Lesson 07: Share NumPy array with ctype RawArray

Further Reading

Thank You

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent NumPy Fast
(without the frustration)

Additional menu

Concurrent NumPy Overview

Lesson 01: Faster NumPy With Concurrency

1. Faster NumPy With BLAS Threads

2. NumPy Functions Release The GIL

3. Faster NumPy With Thread Pools

4. Faster NumPy By Efficient Array Sharing

Lesson 02: Configure NumPy BLAS Threads

Lesson 03: Multithreaded NumPy Functions

Lesson 04: NumPy Releases the GIL

Lesson 05: Faster NumPy Tasks with Thread Pool

Lesson 06: Combine BLAS and Python Threads

Lesson 07: Share NumPy array with ctype RawArray

Further Reading

Thank You

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Leave a Reply Cancel reply

Footer

Learn Concurrent NumPy Fast (without the frustration)

Learn Concurrent NumPy Fast
(without the frustration)