Last Updated on September 29, 2023
You can fill a Numpy array in parallel using Python threads.
Numpy will release the global interpreter lock (GIL) when calling a fill function, allowing Python threads to run in parallel and populate different sub-arrays of a large shared array. This can offer up to a 3x speed-up, depending on the number of CPU cores available and the size of the array that is being initialized.
In this tutorial, you will discover how to speed up the filling of Numpy arrays using parallelism.
Let’s get started.
Need to Use Parallelism to Fill Numpy Arrays Faster
It is common to create a Numpy array and fill it with an initial value.
For example, we may create a vector or a matrix and need to initialize it with zeros, ones, or a specific value.
There are many ways to do this in Numpy.
For example, functions like numpy.ones() and numpy.zeros() will create an array of a given size and fill it with values.
Another approach is to create an uninitialized array with numpy.empty() and call the fill() method to populate it with a given value.
Filling a large array can be slow, taking many seconds. The larger the array, the longer it can take to fill.
How can we fill large Numpy arrays faster using parallelism?
Run loops using all CPUs, download your FREE book to learn how.
How to Fill NumPy Arrays Faster Using Parallelism
We can fill a Numpy array faster using multiple concurrent threads.
Most numpy functions are executed in C-code. As such, they release the global interpreter lock (GIL) allowing other Python threads to run in parallel.
This means that we can use multiple Python threads to fill the same numpy array in parallel.
You can learn more about numpy releasing the GIL in the tutorial:
This can be achieved by dividing the numpy array into sections and having each thread fill a different section.
An easy way to implement this is to use a thread pool.
The Python standard library provides the multiprocessing.pool.ThreadPool class that can be used to create and manage a pool of worker threads.
A ThreadPool instance can be created and configured with one worker per physical CPU in the system.
For example, if you have 4 CPU cores a thread pool with 4 workers can be created as follows:
1 2 3 4 |
... # create the thread pool with ThreadPool(4) as pool: # ... |
You can learn more about configuring the number of workers in the ThreadPool in the tutorial:
We can then define a task function that takes the shared numpy array as an argument as well as the coordinates to fill and the value with which to fill them.
For example:
1 2 3 4 5 6 |
# fill a portion of a larger array with a value def fill_subarray(coords, data, value): # unpack array indexes i1, i2, i3, i4 = coords # populate subarray data[i1:i2,i3:i4].fill(value) |
We can then divide our large numpy array into sections and issue each as a task to the thread pool to have it populated in parallel.
This can be challenging to make generic as it depends on the number of threads and the dimensions of the array.
Let’s take a simple case of a two-dimensional numpy array, e.g. a matrix and 4 physical CPU cores.
In this case, we would ideally want to split our matrix into 4 tasks to be completed in parallel. This is probably the most efficient use of resources.
For a square matrix, we can find the midpoint of each dimension, then enumerate the four quadrants using for-loops, using the indices into the larger matrices directly. These indices can be gathered into a tuple to represent the coordinates for the task to populate in the larger array.
For example:
1 2 3 4 5 6 7 8 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) |
Tasks can then be issued via the async_apply() method on the ThreadPool object that returns an AsyncResult object, which we do not need in this case.
For example:
1 2 3 |
... # issue task _ = pool.apply_async(task, args=(coords, data1, data2, result)) |
You can learn more about issuing tasks to the ThreadPool via the apply_async() method in the tutorial:
And that’s it.
Now that we know how to fill a large array in parallel using threads, let’s look at some worked examples.
Single Threaded Fill Numpy Array (slow)
Before we develop an example of multithreaded array filling, let’s confirm that single-threaded array filling is slow.
In this section, we will develop a number of examples of filling an array with a fixed value.
Example With numpy.ones()
We can explore a case of creating and filling an array using a built-in function.
Examples of built-in functions that create and populate an array include numpy.ones() and numpy.zeros().
We will use the numpy.ones() function to create an array of a given size, in this case a matrix with the dimensions 50,000×50,000.
The function below creates an array using the numpy.ones() function with the given size and populates it with ones, then returns how long the task took in seconds.
1 2 3 4 5 6 7 8 9 10 |
# task function def task(n=50000): # record the start time start = time() # create a new matrix and fill with 1 data = ones((n,n)) # calculate and report duration duration = time() - start # return duration return duration |
We can then repeat this task a number of times and report the average time it takes to complete in seconds.
This is important to do as a single run of the task may report an unstable time, taking longer or shorter based on the system. Averaging over multiple runs gives a more fair idea of the expected performance of the task.
The experiment() function below executes the task 3 times and returns the average time taken in seconds.
1 2 3 4 5 6 |
# experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats |
Finally, we can run the experiment and report the average time taken.
1 2 3 4 |
... # run the experiment and report the result duration = experiment() print(f'Took {duration:.3f} seconds') |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# single-threaded example of filling an array with ones from time import time from numpy import ones # task function def task(n=50000): # record the start time start = time() # create a new matrix and fill with 1 data = ones((n,n)) # calculate and report duration duration = time() - start # return duration return duration # experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats # run the experiment and report the result duration = experiment() print(f'Took {duration:.3f} seconds') |
Running the example creates a 50,000 x 50,000 array and populates it with ones. This is repeated 3 times and the average time taken is reported in seconds.
On my system, it took about 5.871 seconds to complete.
This example may execute in more or less time depending on the capabilities of your system.
We can imagine how the performance of filling an array may get worse as the size of the array is increased further.
1 |
Took 5.871 seconds |
Next, let’s look at an alternate way of filling an array.
Example with ndarray.fill()
We can create an empty uninitialized array then fill it by calling the fill() method on the array.
For example:
1 2 3 4 5 |
... # create an empty array data = empty((n,n)) # fill with 1s data.fill(1) |
This is functionally equivalent to calling the numpy.ones() function. Nevertheless, let’s confirm that it takes about the same amount of time to execute.
The updated task() function with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 |
# task function def task(n=50000): # record the start time start = time() # create an empty array data = empty((n,n)) # fill with 1s data.fill(1) # calculate and report duration duration = time() - start # return duration return duration |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# single-threaded example of creating an array and filling it with fill() from time import time from numpy import empty # task function def task(n=50000): # record the start time start = time() # create an empty array data = empty((n,n)) # fill with 1s data.fill(1) # calculate and report duration duration = time() - start # return duration return duration # experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats # run the experiment and report the result duration = experiment() print(f'Took {duration:.3f} seconds') |
Running the example creates an empty 50,000 x 50,000 array and fills it with one value. This is repeated 3 times and the average time taken is reported in seconds.
In this case, the example took about 5.877 seconds on my system.
This is nearly identical to the time taken to call the numpy.ones() function, as we might expect.
1 |
Took 5.877 seconds |
Next, let’s explore the performance of assigning the values of the array manually.
Example with Mass Assignment
We can create an uninitialized array and then assign the contents of the array manually.
For example:
1 2 3 4 5 |
... # create an empty array data = empty((n,n)) # fill with 1s data[:] = 1 |
This is equivalent to calling the fill() method on the array.
The task() function updated with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 |
# task function def task(n=50000): # record the start time start = time() # create an empty array data = empty((n,n)) # fill with 1s data[:] = 1 # calculate and report duration duration = time() - start # return duration return duration |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# single-threaded example of creating an empty array and filling it with assignment from time import time from numpy import empty # task function def task(n=50000): # record the start time start = time() # create an empty array data = empty((n,n)) # fill with 1s data[:] = 1 # calculate and report duration duration = time() - start # return duration return duration # experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats # run the experiment and report the result duration = experiment() print(f'Took {duration:.3f} seconds') |
Running the example creates an empty 50,000 x 50,000 array and assigns the contents one value. This is repeated 3 times and the average time taken is reported in seconds.
In this case, the example took about 5.834 seconds on my system.
This is roughly equivalent to the time taken to create and fill the array via numpy.ones() and via numpy.empty() and the fill() method.
1 |
Took 5.834 seconds |
We can see that all of the tested approaches to populate an empty array take roughly the same time.
Next, let’s explore how we can perform the same operation faster with parallelism.
Free Concurrent NumPy Course
Get FREE access to my 7-day email course on concurrent NumPy.
Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.
Multithreaded Fill Numpy Array (fast)
We can fill the same array faster using threads.
We will use a ThreadPool and issue tasks to operate on sub-sections of the provided array simultaneously, as described in the strategy outlined above.
Firstly, we can define a task function that takes a tuple of array indices and the shared array. It then calls the fill() function on the sub-array defined by the provided coordinates.
1 2 3 4 5 6 |
# fill a portion of a larger array with a value def fill_subarray(coords, data, value): # unpack array indexes i1, i2, i3, i4 = coords # populate subarray data[i1:i2,i3:i4].fill(value) |
Python threads can share memory directly. This means that we can share the same array between all tasks executed in worker threads and they will all operate on the same structure in memory, not copies.
Next, we can create our shared array.
1 2 3 |
... # create an empty array data = empty((n,n)) |
We can then create a thread pool with one worker per physical CPU core.
I have 4 physical CPU cores in my system, so I will configure the ThreadPool to use 4 workers. Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.
1 2 3 4 |
... # create the thread pool with ThreadPool(4) as pool: # ... |
Our array is square, which can be divided into quadrants (4 parts) and each is issued to the task() function for parallel filling.
1 2 3 4 5 6 7 8 9 10 |
... # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(fill_subarray, args=(coords, data, 1)) |
The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.
1 2 3 4 5 |
... # close the pool pool.close() # wait for tasks to complete pool.join() |
And that’s it.
The updated task() function with these changes is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# task function def task(n=50000): # record the start time start = time() # create an empty array data = empty((n,n)) # create the thread pool with ThreadPool(4) as pool: # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(fill_subarray, args=(coords, data, 1)) # close the pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start # return duration return duration |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# multithreaded array filling with thread pool and one worker per physical cpu from time import time from multiprocessing.pool import ThreadPool from numpy import empty # fill a portion of a larger array with a value def fill_subarray(coords, data, value): # unpack array indexes i1, i2, i3, i4 = coords # populate subarray data[i1:i2,i3:i4].fill(value) # task function def task(n=50000): # record the start time start = time() # create an empty array data = empty((n,n)) # create the thread pool with ThreadPool(4) as pool: # split each dimension (divisor of matrix dimension) split = round(n/2) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(fill_subarray, args=(coords, data, 1)) # close the pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start # return duration return duration # experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats # run the experiment and report the result duration = experiment() print(f'Took {duration:.3f} seconds') |
Running the example creates the empty array, then creates the thread pool. The large array is divided into quadrants and the indices are issued to the fill task via the thread pool. Each task fills its portion of the shared array in parallel, then the thread pool is closed and released.
This is repeated three times and the average time taken is reported.
In this case, it took about 2.308 seconds on my system.
Compared to the single-threaded version that calls the fill() method on the array, this is about 3.569 second faster, or a speed-up of about 2.55x.
1 |
Took 2.308 seconds |
Next, let’s see if we can do better.
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Tuned Multithreaded Fill Numpy Array (faster)
We do not have to split the large array into quadrants or use one worker thread per physical CPU core.
Modern CPU cores use hyperthreading that can offer up to a 30% performance benefit on some tasks. This may offer a benefit in the case of filling arrays with the same value, given the repetition of the task function.
In this case, we will update the ThreadPool to have one worker per logical CPU core, e.g. double the number of physical CPUs. I have 4 physical CPU cores, which is seen as 8 logical CPU cores by the system. Update the configuration to match your system accordingly.
1 2 3 4 |
... # create the thread pool with ThreadPool(8) as pool: # ... |
Next, we can split the array into more than four pieces.
In this case, we will split each dimension into 4, giving 4×4 or 16 sub-arrays.
You can experiment with other divisions of the array into sub-arrays.
1 2 3 |
... # split each dimension (divisor of matrix dimension) split = round(n/4) |
And that’s it.
The update task() function with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# task function def task(n=50000): # record the start time start = time() # create an empty array data = empty((n,n)) # create the thread pool with ThreadPool(8) as pool: # split each dimension (divisor of matrix dimension) split = round(n/4) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(fill_subarray, args=(coords, data, 1)) # close the pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start # return duration return duration |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# multithreaded array filling with thread pool and one worker per logical cpu from time import time from multiprocessing.pool import ThreadPool from numpy import empty # fill a portion of a larger array with a value def fill_subarray(coords, data, value): # unpack array indexes i1, i2, i3, i4 = coords # populate subarray data[i1:i2,i3:i4].fill(value) # task function def task(n=50000): # record the start time start = time() # create an empty array data = empty((n,n)) # create the thread pool with ThreadPool(8) as pool: # split each dimension (divisor of matrix dimension) split = round(n/4) # issue tasks for x in range(0, n, split): for y in range(0, n, split): # determine matrix coordinates coords = (x, x+split, y, y+split) # issue task _ = pool.apply_async(fill_subarray, args=(coords, data, 1)) # close the pool pool.close() # wait for tasks to complete pool.join() # calculate and report duration duration = time() - start # return duration return duration # experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats # run the experiment and report the result duration = experiment() print(f'Took {duration:.3f} seconds') |
Running the example creates the empty array, creates the ThreadPool, then divides the shared array into 16 parts to be filled in parallel.
In this case, the example took about 1.880 seconds to complete on average.
This is about 3.997 seconds faster than the single threaded version or a speed-up of about 3.13x.
It is also faster than the one worker per physical CPU example developed above. Specifically, it is 0.428 seconds (428 milliseconds) faster or a speed-up of about 1.23x.
1 |
Took 1.880 seconds |
This highlights that it is important to continue testing different configurations, even after parallelism offers a significant speed-up.
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent NumPy in Python, Jason Brownlee (my book!)
Guides
- Concurrent NumPy 7-Day Course
- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes
Documentation
- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API
NumPy APIs
Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
Takeaways
You now know how to speed-up the filling of numpy arrays using parallelism.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Jannic Böhme on Unsplash
Do you have any questions?