Last Updated on September 29, 2023
You can fill a NumPy array in parallel using Python threads.
NumPy will release the global interpreter lock (GIL) when calling a fill function, allowing Python worker threads in the ThreadPoolExecutor to run in parallel and populate different sub-arrays of a large shared array.
This can offer up to a 3x speed-up when filling NumPy arrays, depending on the number of CPU cores available and the size of the array that is being initialized.
In this tutorial, you will discover how to speed up the filling of NumPy arrays using parallelism with the ThreadPoolExecutor.
Let’s get started.
Need to Fill Large NumPy Array
It is common to create a NumPy array and fill it with an initial value.
For example, we may create an array and need to initialize it with zeros, ones, or a specific value.
If the array is very large, e.g. millions of values, initializing the array may take a long time, e.g. many seconds.
How can we fill large NumPy arrays faster using the ThreadPoolExecutor?
Run loops using all CPUs, download your FREE book to learn how.
How to Use ThreadPoolExecutor to Fill Large NumPy Array
We can fill a NumPy array faster using multiple concurrent threads in the ThreadPoolExecutor.
Most NumPy functions are executed in C-code. As such, they release the global interpreter lock (GIL) allowing other Python threads to run in parallel.
This means that we can use multiple Python threads to fill the same NumPy array in parallel.
You can learn more about NumPy releasing the GIL in the tutorial:
This can be achieved by dividing the NumPy array into sections and having each thread fill a different section.
Firstly, we can create a ThreadPoolExecutor with one worker thread per CPU core in our system.
For example, if we have 4 CPU cores then, we can create a ThreadPoolExecutor with 4 worker threads:
1 2 3 4 |
... # create the thread pool with ThreadPoolExecutor(4) as pool: # ... |
You can learn more about configuring the number of workers for the ThreadPoolExecutor in the tutorial:
We can then define a task function that takes the shared NumPy array as an argument as well as the coordinates to fill and the value with which to fill them.
For example:
1 2 3 4 5 6 |
# fill a portion of a larger array with a value def fill_subarray(coords, data, value): # unpack array indexes i1, i2, i3, i4 = coords # populate subarray data[i1:i2,i3:i4].fill(value) |
We can then divide our large NumPy array into sections and issue each as a task to the thread pool to have it populated in parallel.
This can be challenging to make generic as it depends on the number of threads and the dimensions of the array.
Let’s take a simple case of a two-dimensional NumPy array, e.g. a matrix and 4 physical CPU cores.
In this case, we would ideally want to split our matrix into 4 tasks to be completed in parallel. This is probably the most efficient use of resources.
For a square matrix, we can find the midpoint of each dimension, then enumerate the four quadrants using for-loops, using the indices into the larger matrices directly. These indices can be gathered into a tuple to represent the coordinates for the task to populate in the larger array.
For example:
1 2 3 4 5 6 7 8 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) |
We can then issue each set of coordinates to our fill_subarray() function using the submit() method of the ThreadPoolExecutor.
For example:
1 2 3 |
... # issue task _ = pool.submit(fill_subarray, coords, data, 1) |
You can learn more about submitting asynchronous tasks to the ThreadPoolExecutor in the tutorial:
And that’s it.
Note, if you prefer to use the ThreadPool to fill the array instead of the ThreadPoolExecutor, see the tutorial:
Now that we know how to fill a large array in parallel using threads, let’s look at some worked examples.
Example of Filing a NumPy Array (slow)
Before we develop an example of multithreaded array filling, let’s confirm that single-threaded array filling is slow.
We can explore a case of creating and filling an array using a built-in function.
Examples of built-in functions that create and populate an array include numpy.ones() and numpy.zeros().
We will use the numpy.ones() function to create an array of a given size, in this case, a matrix with the dimensions 50,000×50,000.
The function below creates an array using the numpy.ones() function with the given size and populates it with ones, then returns how long the task took in seconds.
1 2 3 4 5 6 7 8 9 10 11 12 |
# task function def task(n=50000): # record the start time start = time() # create a new matrix and fill with 1 data = ones((n,n)) # calculate and report duration duration = time() - start # report progress print(f'> took {duration:.3f} seconds, sum={data.sum():.0f}') # return duration return duration |
We can then repeat this task a number of times and report the average time it takes to complete in seconds.
This is important to do as a single run of the task may report an unstable time, taking longer or shorter based on the system. Averaging over multiple runs gives a more fair idea of the expected performance of the task.
The experiment() function below executes the task 3 times and returns the average time taken in seconds.
1 2 3 4 5 6 |
# experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats |
Finally, we can run the experiment and report the average time taken.
1 2 3 4 |
... # run the experiment and report the result avg = experiment() print(f'Average duration: {avg:.3f} seconds') |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# single-threaded example of filling an array with ones from time import time from numpy import ones # task function def task(n=50000): # record the start time start = time() # create a new matrix and fill with 1 data = ones((n,n)) # calculate and report duration duration = time() - start # report progress print(f'> took {duration:.3f} seconds, sum={data.sum():.0f}') # return duration return duration # experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats # run the experiment and report the result avg = experiment() print(f'Average duration: {avg:.3f} seconds') |
Running the example creates a 50,000 x 50,000 array and populates it with ones. This is repeated 3 times and the average time taken is reported in seconds.
On my system, it took about 6.212 seconds to complete with Python 3.11.4 and NumPy 1.25.0.
This example may execute in more or less time depending on the capabilities of your system.
We can imagine how the performance of filling an array may get worse as the size of the array is increased further.
1 2 3 4 |
> took 6.131 seconds, sum=2500000000 > took 6.207 seconds, sum=2500000000 > took 6.299 seconds, sum=2500000000 Average duration: 6.212 seconds |
Next, let’s explore how we can perform the same operation faster with parallelism.
Free Concurrent NumPy Course
Get FREE access to my 7-day email course on concurrent NumPy.
Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.
Example of ThreadPoolExecutor Filing a NumPy Array (fast)
We can fill the same array faster using threads.
We will use a ThreadPoolExecutor and issue tasks to operate on sub-sections of the provided array simultaneously, as described in the strategy outlined above.
Firstly, we can define a task function that takes a tuple of array indices and the shared array. It then calls the fill() function on the sub-array defined by the provided coordinates.
1 2 3 4 5 6 |
# fill a portion of a larger array with a value def fill_subarray(coords, data, value): # unpack array indexes i1, i2, i3, i4 = coords # populate subarray data[i1:i2,i3:i4].fill(value) |
Python threads can share memory directly. This means that we can share the same array between all tasks executed in worker threads and they will all operate on the same structure in memory, not copies.
Next, we can create our shared array.
1 2 3 |
... # create an empty array data = empty((n,n)) |
We can then create a thread pool with one worker per physical CPU core.
I have 4 physical CPU cores in my system, so I will configure the ThreadPoolExecutor to use 4 workers. Update the ThreadPoolExecutor configuration accordingly to match the number of physical CPU cores in your system.
1 2 3 4 |
... # create the thread pool with ThreadPoolExecutor(4) as pool: # ... |
Our array is square, which can be divided into quadrants (4 parts) and each is issued to the task() function for parallel filling.
1 2 3 4 5 6 7 8 9 10 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.submit(fill_subarray, coords, data, 1) |
The main thread will then exit the context manager for the ThreadPoolExecutor, and block until all tasks in the thread pool are completed and the worker threads have been terminated.
And that’s it.
The updated task() function with these changes is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# task function def task(n=50000): # record the start time start = time() # create an empty array data = empty((n,n)) # create the thread pool with ThreadPoolExecutor(4) as pool: # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.submit(fill_subarray, coords, data, 1) # calculate and report duration duration = time() - start # report progress print(f'> took {duration:.3f} seconds, sum={data.sum():.0f}') # return duration return duration |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
# multithreaded array filling with thread pool and one worker per physical cpu from time import time from concurrent.futures import ThreadPoolExecutor from numpy import empty # fill a portion of a larger array with a value def fill_subarray(coords, data, value): # unpack array indexes i1, i2, i3, i4 = coords # populate subarray data[i1:i2,i3:i4].fill(value) # task function def task(n=50000): # record the start time start = time() # create an empty array data = empty((n,n)) # create the thread pool with ThreadPoolExecutor(4) as pool: # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.submit(fill_subarray, coords, data, 1) # calculate and report duration duration = time() - start # report progress print(f'> took {duration:.3f} seconds, sum={data.sum():.0f}') # return duration return duration # experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats # run the experiment and report the result avg = experiment() print(f'Average duration: {avg:.3f} seconds') |
Running the example creates the empty array, then creates the ThreadPoolExecutor.
The large array is divided into quadrants and the indices are issued to the fill task via the thread pool. Each task fills its portion of the shared array in parallel, then the thread pool is closed and released.
This is repeated three times and the average time taken is reported.
In this case, it took about 2.546 seconds on my system.
Compared to the single-threaded version above, this is about 3.666 seconds faster, or a speed-up of about 2.44x.
1 2 3 4 |
> took 2.537 seconds, sum=2500000000 > took 2.586 seconds, sum=2500000000 > took 2.514 seconds, sum=2500000000 Average duration: 2.546 seconds |
Next, let’s see if we can do better.
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Example of ThreadPoolExecutor Filing a NumPy Array (faster)
We do not have to split the large array into quadrants or use one worker thread per physical CPU core.
Modern CPU cores use hyperthreading that can offer up to a 30% performance benefit on some tasks. This may offer a benefit in the case of filling arrays with the same value, given the repetition of the task function.
In this case, we will update the ThreadPoolExecutor to have one worker per logical CPU core, e.g. double the number of physical CPUs.
I have 4 physical CPU cores, which is seen as 8 logical CPU cores by the system. Update the configuration to match your system accordingly.
1 2 3 4 |
... # create the thread pool with ThreadPoolExecutor(8) as pool: # ... |
Next, we can split the array into more than four pieces.
In this case, we will split each dimension into 4, giving 4 * 4 or 16 sub-arrays to fill.
You can experiment with other divisions of the array into sub-arrays.
1 2 3 |
... # split each dimension (divisor of matrix dimension) split = round(n/4) |
And that’s it.
The update task() function with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# task function def task(n=50000): # record the start time start = time() # create an empty array data = empty((n,n)) # create the thread pool with ThreadPoolExecutor(8) as pool: # split each dimension (divisor of matrix dimension) split = round(n/4) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.submit(fill_subarray, coords, data, 1) # calculate and report duration duration = time() - start # report progress print(f'> took {duration:.3f} seconds, sum={data.sum():.0f}') # return duration return duration |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
# multithreaded array filling with thread pool and one worker per physical cpu from time import time from concurrent.futures import ThreadPoolExecutor from numpy import empty # fill a portion of a larger array with a value def fill_subarray(coords, data, value): # unpack array indexes i1, i2, i3, i4 = coords # populate subarray data[i1:i2,i3:i4].fill(value) # task function def task(n=50000): # record the start time start = time() # create an empty array data = empty((n,n)) # create the thread pool with ThreadPoolExecutor(8) as pool: # split each dimension (divisor of matrix dimension) split = round(n/4) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.submit(fill_subarray, coords, data, 1) # calculate and report duration duration = time() - start # report progress print(f'> took {duration:.3f} seconds, sum={data.sum():.0f}') # return duration return duration # experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats # run the experiment and report the result avg = experiment() print(f'Average duration: {avg:.3f} seconds') |
Running the example creates the empty array, creates the ThreadPoolExecutor, then divides the shared array into 16 parts to be filled in parallel.
In this case, the example took about 2.024 seconds to complete on average.
This is about 4.188 seconds faster than the single-threaded version or a speed-up of about 3.07x.
It is also faster than the one worker per physical CPU example developed above. Specifically, it is 0.522 seconds (522 milliseconds) faster or a speed-up of about 1.26x.
This highlights that it is important to continue testing different configurations, even after parallelism offers a significant speed-up.
1 2 3 4 |
> took 2.022 seconds, sum=2500000000 > took 1.969 seconds, sum=2500000000 > took 2.080 seconds, sum=2500000000 Average duration: 2.024 seconds |
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent NumPy in Python, Jason Brownlee (my book!)
Guides
- Concurrent NumPy 7-Day Course
- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes
Documentation
- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API
NumPy APIs
Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
Takeaways
You now know how to speed up the filling of NumPy arrays using parallelism with the ThreadPoolExecutor.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Nicolas Picard on Unsplash
Do you have any questions?