Parallel NumPy Vector Math with Threads

Last Updated on September 29, 2023

You can use threads to apply a math function to each item in a numpy vector.

Most numpy math functions execute C-code and release the Global Interpreter Lock (GIL). This means they can be executed in parallel using Python threads, offering some speed-up.

In this tutorial, you will discover how to parallelize numpy math functions on vectors using threads.

Let’s get started.

Table of Contents

Need to Apply Function to NumPy Vector in Parallel

Consider the situation where you have a vector of floating point values represented as a numpy array.

We then need to apply a mathematical operation to each item in the vector to produce a new or transformed vector.

The mathematical operation could be something complex, but it may also be something simple such as a numpy.log(), numpy.sin(), or a numpy.exp().

For example:

...

# apply math function to vector to create transformed vector

result = numpy.exp(data)

If we have a very large vector, such as 10 million, 100 million or billions of items, can we apply the math function on the vector in parallel?

Run loops using all CPUs, download your FREE book to learn how.

How to Do NumPy Vector Math in Parallel with Threads

NumPy math functions like numpy.exp() are typically implemented in C code and called from Python.

You can see a long list of them here:

Numpy Mathematical functions

Similarly, we may call a few of these functions in a sequence in order to perform a custom transform.

By default, these operations are not implemented to support parallelism, unlike other numpy functions that call down into the BLAS or LAPACK library, such as numpy.dot() and numpy.linalg.svd().

As such, if we want to perform math operations on a large vector in parallel, we must implement it ourselves.

There are many ways we can implement this.

In this tutorial, let’s look at a less common approach that few Python developers will reach for when parallelizing code: threading.

The Python interpreter is not thread-safe. This means Python will only allow one Python thread to run at a time. In turn, threads cannot be used for parallelism in Python, in most cases.

One of the exceptions is when Python calls a C-library and explicitly releases the global interpreter lock (GIL), allowing other threads to run.

Most numpy functions are implemented in C and will release this lock.

This means that we can use threads to execute numpy math functions on arrays in parallel.

The exceptions are few but important: while a thread is waiting for IO (for you to type something, say, or for something to come in the network) python releases the GIL so other threads can run. And, more importantly for us, while numpy is doing an array operation, python also releases the GIL.
— Parallel Programming with numpy and scipy

You can learn more about numpy releasing the GIL in the tutorial:

NumPy vs the Global Interpreter Lock (GIL)

Python provides the multiprocessing.pool.ThreadPool class that we can use to create a pool of worker threads to execute arbitrary tasks, such as apply a numpy function to an array.

For example, we create a ThreadPool and specify the number of workers to create, then call the map() method in order to apply a numpy math function to an iterable of vectors.

...

# create the thread pool

with ThreadPool(4) as pool:

# apply math function to a list of vectors

result_list = pool.map(exp, vectors)

If you are new to the ThreadPool class, you can learn more in the guide:

Python ThreadPool: The Complete Guide

If we have a very large vector with millions or billions of items, we can split the vector into smaller pieces, apply the math function to each in parallel, then stitch the results back together.

The numpy.split() function can be used to split a large vector into sub-vectors and the numpy.concatenate() function can combine the results back into one large vector again.

For example:

...

# split the array into chunks

chunks = split(data, n_workers)

...

# convert list back to one array

result = concatenate(result_list)

Now that we know how to apply numpy math operations to vectors in parallel using threads, let’s look at some worked examples.

Start Now: Free Concurrent NumPy Crash Course

Example of Vector Operation (sequential)

Before we look at parallelizing math operations on a vector, let’s look at the sequential version.

In this example, we will first define a 1,000,000,000 (one billion) item vector of random floats via the numpy.random.rand() function, then apply the numpy.exp() math operation to each item in the vector to produce a result vector.

We will time the entire program and report the duration in seconds.

The complete example is listed below.

# example of vector operation

from numpy.random import rand

from numpy import exp

from time import time

# record start time

start = time()

# size of the vector

n = 1000000000

# create vector of random floats

data = rand(n)

# apply math operation to each item

result = exp(data)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example on my system takes about 14.459 seconds.

It may run faster or slower on your system, depending on the speed of your hardware and version of Python and NumPy.

1	Took 14.459 seconds

Next, let’s explore applying the math function in parallel.

Free Concurrent NumPy Course

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

Learn more

Example of Parallel Vector Operation with ThreadPool

We can update the example to apply the math operation to the vector in parallel using threads.

In doing so, we need to think about the number of workers to use, which defines the number of subtasks or subvectors we will split our large vector into.

Modern systems have multiple CPU cores. For example, I have 4 physical CPU cores.

Modern operating systems, like MacOS, Linux, and Windows will see double the number of physical CPU cores. This is because modern CPU cores use hyperthreading that allows each CPU core to function like two CPU cores and offers up to 30% benefit in some use cases. I have 4 CPU cores, so my operating system and Python see 8 logical CPU cores available.

Therefore, it is good to explore using at least two configurations when parallelizing code:

One worker per physical CPU core.
One worker per logical CPU core.

Let’s explore both cases.

One Worker Per Physical CPU

We can create a ThreadPool with one worker per physical CPU core.

I have 4 CPU cores, so this example will use 4 workers. Update it to match the number of cores in your system.

...

# define the number of workers and splits

n_workers = 4

# create the process pool

with ThreadPool(n_workers) as pool:

# ...

We can then split the one billion item vector into sub-vectors, issue the tasks to the worker threads, then combine the results back into one large vector.

...

# split the array into chunks

chunks = split(data, n_workers)

# issue tasks with chunking

result_list = pool.map(exp, chunks)

# convert list back to one array

result = concatenate(result_list)

Tying this together, the complete example is listed below.

# example of partitioned vector operation using a thread pool

from multiprocessing.pool import ThreadPool

from numpy.random import rand

from numpy import exp

from numpy import concatenate

from numpy import split

from time import time

# record start time

start = time()

# size of the vector

n = 1000000000

# create vector of random floats

data = rand(n)

# define the number of workers and splits

n_workers = 4

# create the process pool

with ThreadPool(n_workers) as pool:

# split the array into chunks

chunks = split(data, n_workers)

# issue tasks to the thread pool

result_list = pool.map(exp, chunks)

# convert list back to one array

result = concatenate(result_list)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

Running the example takes about 12.716 seconds on my system.

That is about 1.743 seconds faster than the sequential version or a speed-up of about 1.14x.

Not great, but better than nothing.

1	Took 12.716 seconds

Next, let’s look at configuring the example with one worker per logical CPU core, to see if hyperthreading offers an additional benefit.

One Worker Per Logical CPU (hyperthreading)

We can configure the thread pool with one worker per logical CPU core.

I have 8 logical CPU cores and will therefore create that many thread pool workers, and that many tasks to issue to the thread pool. Update to match the number of CPU cores in your system.

...

# define the number of workers and splits

n_workers = 8

If you’re not sure how many logical CPU cores you have available, you can find out with a function call. See the tutorial:

Number of CPUs in Python

The complete example with this change is listed below

# example of partitioned vector operation using a thread pool

from multiprocessing.pool import ThreadPool

from numpy.random import rand

from numpy import exp

from numpy import concatenate

from numpy import split

from time import time

# record start time

start = time()

# size of the vector

n = 1000000000

# create vector of random floats

data = rand(n)

# define the number of workers and splits

n_workers = 8

# create the process pool

with ThreadPool(n_workers) as pool:

# split the array into chunks

chunks = split(data, n_workers)

# issue tasks to the thread pool

result_list = pool.map(exp, chunks)

# convert list back to one array

result = concatenate(result_list)

# calculate and report duration

duration = time() - start

print(f'Took {duration:.3f} seconds')

The example took about 12.234 seconds to complete on my system.

That is about 2.225 seconds faster than the sequential version or a 1.18x speed-up.

This may suggest a benefit of hyperthreading in this case (this specific function, data size, and my system).

Always test in order to discover the best configuration for a given example.

1	Took 12.234 seconds

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Takeaways

You now know how to parallelize numpy math functions on vectors using threads.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by henry perks on Unsplash

Parallel NumPy Vector Math with Threads

Need to Apply Function to NumPy Vector in Parallel

How to Do NumPy Vector Math in Parallel with Threads

Example of Vector Operation (sequential)

Example of Parallel Vector Operation with ThreadPool

One Worker Per Physical CPU

One Worker Per Logical CPU (hyperthreading)

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent NumPy Fast
(without the frustration)

Additional menu

Need to Apply Function to NumPy Vector in Parallel

How to Do NumPy Vector Math in Parallel with Threads

Example of Vector Operation (sequential)

Example of Parallel Vector Operation with ThreadPool

One Worker Per Physical CPU

One Worker Per Logical CPU (hyperthreading)

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Leave a Reply Cancel reply

Footer

Learn Concurrent NumPy Fast (without the frustration)

Learn Concurrent NumPy Fast
(without the frustration)