Last Updated on September 29, 2023
You can use threads to apply a math function to each item in a numpy vector.
Most numpy math functions execute C-code and release the Global Interpreter Lock (GIL). This means they can be executed in parallel using Python threads, offering some speed-up.
In this tutorial, you will discover how to parallelize numpy math functions on vectors using threads.
Let’s get started.
Need to Apply Function to NumPy Vector in Parallel
Consider the situation where you have a vector of floating point values represented as a numpy array.
We then need to apply a mathematical operation to each item in the vector to produce a new or transformed vector.
The mathematical operation could be something complex, but it may also be something simple such as a numpy.log(), numpy.sin(), or a numpy.exp().
For example:
1 2 3 |
... # apply math function to vector to create transformed vector result = numpy.exp(data) |
If we have a very large vector, such as 10 million, 100 million or billions of items, can we apply the math function on the vector in parallel?
Run loops using all CPUs, download your FREE book to learn how.
How to Do NumPy Vector Math in Parallel with Threads
NumPy math functions like numpy.exp() are typically implemented in C code and called from Python.
You can see a long list of them here:
Similarly, we may call a few of these functions in a sequence in order to perform a custom transform.
By default, these operations are not implemented to support parallelism, unlike other numpy functions that call down into the BLAS or LAPACK library, such as numpy.dot() and numpy.linalg.svd().
As such, if we want to perform math operations on a large vector in parallel, we must implement it ourselves.
There are many ways we can implement this.
In this tutorial, let’s look at a less common approach that few Python developers will reach for when parallelizing code: threading.
The Python interpreter is not thread-safe. This means Python will only allow one Python thread to run at a time. In turn, threads cannot be used for parallelism in Python, in most cases.
One of the exceptions is when Python calls a C-library and explicitly releases the global interpreter lock (GIL), allowing other threads to run.
Most numpy functions are implemented in C and will release this lock.
This means that we can use threads to execute numpy math functions on arrays in parallel.
The exceptions are few but important: while a thread is waiting for IO (for you to type something, say, or for something to come in the network) python releases the GIL so other threads can run. And, more importantly for us, while numpy is doing an array operation, python also releases the GIL.
— Parallel Programming with numpy and scipy
You can learn more about numpy releasing the GIL in the tutorial:
Python provides the multiprocessing.pool.ThreadPool class that we can use to create a pool of worker threads to execute arbitrary tasks, such as apply a numpy function to an array.
For example, we create a ThreadPool and specify the number of workers to create, then call the map() method in order to apply a numpy math function to an iterable of vectors.
1 2 3 4 5 |
... # create the thread pool with ThreadPool(4) as pool: # apply math function to a list of vectors result_list = pool.map(exp, vectors) |
If you are new to the ThreadPool class, you can learn more in the guide:
If we have a very large vector with millions or billions of items, we can split the vector into smaller pieces, apply the math function to each in parallel, then stitch the results back together.
The numpy.split() function can be used to split a large vector into sub-vectors and the numpy.concatenate() function can combine the results back into one large vector again.
For example:
1 2 3 4 5 6 |
... # split the array into chunks chunks = split(data, n_workers) ... # convert list back to one array result = concatenate(result_list) |
Now that we know how to apply numpy math operations to vectors in parallel using threads, let’s look at some worked examples.
Example of Vector Operation (sequential)
Before we look at parallelizing math operations on a vector, let’s look at the sequential version.
In this example, we will first define a 1,000,000,000 (one billion) item vector of random floats via the numpy.random.rand() function, then apply the numpy.exp() math operation to each item in the vector to produce a result vector.
We will time the entire program and report the duration in seconds.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# example of vector operation from numpy.random import rand from numpy import exp from time import time # record start time start = time() # size of the vector n = 1000000000 # create vector of random floats data = rand(n) # apply math operation to each item result = exp(data) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example on my system takes about 14.459 seconds.
It may run faster or slower on your system, depending on the speed of your hardware and version of Python and NumPy.
1 |
Took 14.459 seconds |
Next, let’s explore applying the math function in parallel.
Free Concurrent NumPy Course
Get FREE access to my 7-day email course on concurrent NumPy.
Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.
Example of Parallel Vector Operation with ThreadPool
We can update the example to apply the math operation to the vector in parallel using threads.
In doing so, we need to think about the number of workers to use, which defines the number of subtasks or subvectors we will split our large vector into.
Modern systems have multiple CPU cores. For example, I have 4 physical CPU cores.
Modern operating systems, like MacOS, Linux, and Windows will see double the number of physical CPU cores. This is because modern CPU cores use hyperthreading that allows each CPU core to function like two CPU cores and offers up to 30% benefit in some use cases. I have 4 CPU cores, so my operating system and Python see 8 logical CPU cores available.
Therefore, it is good to explore using at least two configurations when parallelizing code:
- One worker per physical CPU core.
- One worker per logical CPU core.
Let’s explore both cases.
One Worker Per Physical CPU
We can create a ThreadPool with one worker per physical CPU core.
I have 4 CPU cores, so this example will use 4 workers. Update it to match the number of cores in your system.
1 2 3 4 5 6 |
... # define the number of workers and splits n_workers = 4 # create the process pool with ThreadPool(n_workers) as pool: # ... |
We can then split the one billion item vector into sub-vectors, issue the tasks to the worker threads, then combine the results back into one large vector.
1 2 3 4 5 6 7 |
... # split the array into chunks chunks = split(data, n_workers) # issue tasks with chunking result_list = pool.map(exp, chunks) # convert list back to one array result = concatenate(result_list) |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# example of partitioned vector operation using a thread pool from multiprocessing.pool import ThreadPool from numpy.random import rand from numpy import exp from numpy import concatenate from numpy import split from time import time # record start time start = time() # size of the vector n = 1000000000 # create vector of random floats data = rand(n) # define the number of workers and splits n_workers = 4 # create the process pool with ThreadPool(n_workers) as pool: # split the array into chunks chunks = split(data, n_workers) # issue tasks to the thread pool result_list = pool.map(exp, chunks) # convert list back to one array result = concatenate(result_list) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
Running the example takes about 12.716 seconds on my system.
That is about 1.743 seconds faster than the sequential version or a speed-up of about 1.14x.
Not great, but better than nothing.
1 |
Took 12.716 seconds |
Next, let’s look at configuring the example with one worker per logical CPU core, to see if hyperthreading offers an additional benefit.
One Worker Per Logical CPU (hyperthreading)
We can configure the thread pool with one worker per logical CPU core.
I have 8 logical CPU cores and will therefore create that many thread pool workers, and that many tasks to issue to the thread pool. Update to match the number of CPU cores in your system.
1 2 3 |
... # define the number of workers and splits n_workers = 8 |
If you’re not sure how many logical CPU cores you have available, you can find out with a function call. See the tutorial:
The complete example with this change is listed below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# example of partitioned vector operation using a thread pool from multiprocessing.pool import ThreadPool from numpy.random import rand from numpy import exp from numpy import concatenate from numpy import split from time import time # record start time start = time() # size of the vector n = 1000000000 # create vector of random floats data = rand(n) # define the number of workers and splits n_workers = 8 # create the process pool with ThreadPool(n_workers) as pool: # split the array into chunks chunks = split(data, n_workers) # issue tasks to the thread pool result_list = pool.map(exp, chunks) # convert list back to one array result = concatenate(result_list) # calculate and report duration duration = time() - start print(f'Took {duration:.3f} seconds') |
The example took about 12.234 seconds to complete on my system.
That is about 2.225 seconds faster than the sequential version or a 1.18x speed-up.
This may suggest a benefit of hyperthreading in this case (this specific function, data size, and my system).
Always test in order to discover the best configuration for a given example.
1 |
Took 12.234 seconds |
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent NumPy in Python, Jason Brownlee (my book!)
Guides
- Concurrent NumPy 7-Day Course
- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes
Documentation
- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API
NumPy APIs
Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
Takeaways
You now know how to parallelize numpy math functions on vectors using threads.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by henry perks on Unsplash
Do you have any questions?