Parallel Numpy Array Fill (up to 3x faster)

Last Updated on September 29, 2023

You can fill a Numpy array in parallel using Python threads.

Numpy will release the global interpreter lock (GIL) when calling a fill function, allowing Python threads to run in parallel and populate different sub-arrays of a large shared array. This can offer up to a 3x speed-up, depending on the number of CPU cores available and the size of the array that is being initialized.

In this tutorial, you will discover how to speed up the filling of Numpy arrays using parallelism.

Let’s get started.

Table of Contents

Need to Use Parallelism to Fill Numpy Arrays Faster

It is common to create a Numpy array and fill it with an initial value.

For example, we may create a vector or a matrix and need to initialize it with zeros, ones, or a specific value.

There are many ways to do this in Numpy.

For example, functions like numpy.ones() and numpy.zeros() will create an array of a given size and fill it with values.

Another approach is to create an uninitialized array with numpy.empty() and call the fill() method to populate it with a given value.

Filling a large array can be slow, taking many seconds. The larger the array, the longer it can take to fill.

How can we fill large Numpy arrays faster using parallelism?

Run loops using all CPUs, download your FREE book to learn how.

How to Fill NumPy Arrays Faster Using Parallelism

We can fill a Numpy array faster using multiple concurrent threads.

Most numpy functions are executed in C-code. As such, they release the global interpreter lock (GIL) allowing other Python threads to run in parallel.

This means that we can use multiple Python threads to fill the same numpy array in parallel.

You can learn more about numpy releasing the GIL in the tutorial:

NumPy vs the Global Interpreter Lock (GIL)

This can be achieved by dividing the numpy array into sections and having each thread fill a different section.

An easy way to implement this is to use a thread pool.

The Python standard library provides the multiprocessing.pool.ThreadPool class that can be used to create and manage a pool of worker threads.

A ThreadPool instance can be created and configured with one worker per physical CPU in the system.

For example, if you have 4 CPU cores a thread pool with 4 workers can be created as follows:

...

# create the thread pool

with ThreadPool(4) as pool:

# ...

You can learn more about configuring the number of workers in the ThreadPool in the tutorial:

ThreadPool Configure The Number of Worker Threads

We can then define a task function that takes the shared numpy array as an argument as well as the coordinates to fill and the value with which to fill them.

For example:

# fill a portion of a larger array with a value

def fill_subarray(coords, data, value):

# unpack array indexes

i1, i2, i3, i4 = coords

# populate subarray

data[i1:i2,i3:i4].fill(value)

We can then divide our large numpy array into sections and issue each as a task to the thread pool to have it populated in parallel.

This can be challenging to make generic as it depends on the number of threads and the dimensions of the array.

Let’s take a simple case of a two-dimensional numpy array, e.g. a matrix and 4 physical CPU cores.

In this case, we would ideally want to split our matrix into 4 tasks to be completed in parallel. This is probably the most efficient use of resources.

For a square matrix, we can find the midpoint of each dimension, then enumerate the four quadrants using for-loops, using the indices into the larger matrices directly. These indices can be gathered into a tuple to represent the coordinates for the task to populate in the larger array.

For example:

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

Tasks can then be issued via the async_apply() method on the ThreadPool object that returns an AsyncResult object, which we do not need in this case.

For example:

...

# issue task

_ = pool.apply_async(task, args=(coords, data1, data2, result))

You can learn more about issuing tasks to the ThreadPool via the apply_async() method in the tutorial:

How to Use ThreadPool apply_async() in Python

And that’s it.

Now that we know how to fill a large array in parallel using threads, let’s look at some worked examples.

Start Now: Free Concurrent NumPy Crash Course

Single Threaded Fill Numpy Array (slow)

Before we develop an example of multithreaded array filling, let’s confirm that single-threaded array filling is slow.

In this section, we will develop a number of examples of filling an array with a fixed value.

Example With numpy.ones()

We can explore a case of creating and filling an array using a built-in function.

Examples of built-in functions that create and populate an array include numpy.ones() and numpy.zeros().

We will use the numpy.ones() function to create an array of a given size, in this case a matrix with the dimensions 50,000×50,000.

The function below creates an array using the numpy.ones() function with the given size and populates it with ones, then returns how long the task took in seconds.

# task function

def task(n=50000):

# record the start time

start = time()

# create a new matrix and fill with 1

data = ones((n,n))

# calculate and report duration

duration = time() - start

# return duration

return duration

We can then repeat this task a number of times and report the average time it takes to complete in seconds.

This is important to do as a single run of the task may report an unstable time, taking longer or shorter based on the system. Averaging over multiple runs gives a more fair idea of the expected performance of the task.

The experiment() function below executes the task 3 times and returns the average time taken in seconds.

# experiment that averages duration of task function

def experiment(repeats=3):

# repeat the experiment and gather results

results = [task() for _ in range(repeats)]

# return the average of the results

return sum(results) / repeats

Finally, we can run the experiment and report the average time taken.

...

# run the experiment and report the result

duration = experiment()

print(f'Took {duration:.3f} seconds')

Tying this together, the complete example is listed below.

# single-threaded example of filling an array with ones

from time import time

from numpy import ones

# task function

def task(n=50000):

# record the start time

start = time()

# create a new matrix and fill with 1

data = ones((n,n))

# calculate and report duration

duration = time() - start

# return duration

return duration

# experiment that averages duration of task function

def experiment(repeats=3):

# repeat the experiment and gather results

results = [task() for _ in range(repeats)]

# return the average of the results

return sum(results) / repeats

# run the experiment and report the result

duration = experiment()

print(f'Took {duration:.3f} seconds')

Running the example creates a 50,000 x 50,000 array and populates it with ones. This is repeated 3 times and the average time taken is reported in seconds.

On my system, it took about 5.871 seconds to complete.

This example may execute in more or less time depending on the capabilities of your system.

We can imagine how the performance of filling an array may get worse as the size of the array is increased further.

1	Took 5.871 seconds

Next, let’s look at an alternate way of filling an array.

Example with ndarray.fill()

We can create an empty uninitialized array then fill it by calling the fill() method on the array.

For example:

...

# create an empty array

data = empty((n,n))

# fill with 1s

data.fill(1)

This is functionally equivalent to calling the numpy.ones() function. Nevertheless, let’s confirm that it takes about the same amount of time to execute.

The updated task() function with this change is listed below.

# task function

def task(n=50000):

# record the start time

start = time()

# create an empty array

data = empty((n,n))

# fill with 1s

data.fill(1)

# calculate and report duration

duration = time() - start

# return duration

return duration

Tying this together, the complete example is listed below.

# single-threaded example of creating an array and filling it with fill()

from time import time

from numpy import empty

# task function

def task(n=50000):

# record the start time

start = time()

# create an empty array

data = empty((n,n))

# fill with 1s

data.fill(1)

# calculate and report duration

duration = time() - start

# return duration

return duration

# experiment that averages duration of task function

def experiment(repeats=3):

# repeat the experiment and gather results

results = [task() for _ in range(repeats)]

# return the average of the results

return sum(results) / repeats

# run the experiment and report the result

duration = experiment()

print(f'Took {duration:.3f} seconds')

Running the example creates an empty 50,000 x 50,000 array and fills it with one value. This is repeated 3 times and the average time taken is reported in seconds.

In this case, the example took about 5.877 seconds on my system.

This is nearly identical to the time taken to call the numpy.ones() function, as we might expect.

1	Took 5.877 seconds

Next, let’s explore the performance of assigning the values of the array manually.

Example with Mass Assignment

We can create an uninitialized array and then assign the contents of the array manually.

For example:

...

# create an empty array

data = empty((n,n))

# fill with 1s

data[:] = 1

This is equivalent to calling the fill() method on the array.

The task() function updated with this change is listed below.

# task function

def task(n=50000):

# record the start time

start = time()

# create an empty array

data = empty((n,n))

# fill with 1s

data[:] = 1

# calculate and report duration

duration = time() - start

# return duration

return duration

Tying this together, the complete example is listed below.

# single-threaded example of creating an empty array and filling it with assignment

from time import time

from numpy import empty

# task function

def task(n=50000):

# record the start time

start = time()

# create an empty array

data = empty((n,n))

# fill with 1s

data[:] = 1

# calculate and report duration

duration = time() - start

# return duration

return duration

# experiment that averages duration of task function

def experiment(repeats=3):

# repeat the experiment and gather results

results = [task() for _ in range(repeats)]

# return the average of the results

return sum(results) / repeats

# run the experiment and report the result

duration = experiment()

print(f'Took {duration:.3f} seconds')

Running the example creates an empty 50,000 x 50,000 array and assigns the contents one value. This is repeated 3 times and the average time taken is reported in seconds.

In this case, the example took about 5.834 seconds on my system.

This is roughly equivalent to the time taken to create and fill the array via numpy.ones() and via numpy.empty() and the fill() method.

1	Took 5.834 seconds

We can see that all of the tested approaches to populate an empty array take roughly the same time.

Next, let’s explore how we can perform the same operation faster with parallelism.

Free Concurrent NumPy Course

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

Learn more

Multithreaded Fill Numpy Array (fast)

We can fill the same array faster using threads.

We will use a ThreadPool and issue tasks to operate on sub-sections of the provided array simultaneously, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices and the shared array. It then calls the fill() function on the sub-array defined by the provided coordinates.

# fill a portion of a larger array with a value

def fill_subarray(coords, data, value):

# unpack array indexes

i1, i2, i3, i4 = coords

# populate subarray

data[i1:i2,i3:i4].fill(value)

Python threads can share memory directly. This means that we can share the same array between all tasks executed in worker threads and they will all operate on the same structure in memory, not copies.

Next, we can create our shared array.

...

# create an empty array

data = empty((n,n))

We can then create a thread pool with one worker per physical CPU core.

I have 4 physical CPU cores in my system, so I will configure the ThreadPool to use 4 workers. Update the ThreadPool configuration accordingly to match the number of physical CPU cores in your system.

...

# create the thread pool

with ThreadPool(4) as pool:

# ...

Our array is square, which can be divided into quadrants (4 parts) and each is issued to the task() function for parallel filling.

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(fill_subarray, args=(coords, data, 1))

The main thread will then close the pool to signal no further tasks will be issued to the pool, then block and wait for all tasks to be completed.

...

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

And that’s it.

The updated task() function with these changes is listed below.

# task function

def task(n=50000):

# record the start time

start = time()

# create an empty array

data = empty((n,n))

# create the thread pool

with ThreadPool(4) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(fill_subarray, args=(coords, data, 1))

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

# calculate and report duration

duration = time() - start

# return duration

return duration

Tying this together, the complete example is listed below.

# multithreaded array filling with thread pool and one worker per physical cpu

from time import time

from multiprocessing.pool import ThreadPool

from numpy import empty

# fill a portion of a larger array with a value

def fill_subarray(coords, data, value):

# unpack array indexes

i1, i2, i3, i4 = coords

# populate subarray

data[i1:i2,i3:i4].fill(value)

# task function

def task(n=50000):

# record the start time

start = time()

# create an empty array

data = empty((n,n))

# create the thread pool

with ThreadPool(4) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(fill_subarray, args=(coords, data, 1))

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

# calculate and report duration

duration = time() - start

# return duration

return duration

# experiment that averages duration of task function

def experiment(repeats=3):

# repeat the experiment and gather results

results = [task() for _ in range(repeats)]

# return the average of the results

return sum(results) / repeats

# run the experiment and report the result

duration = experiment()

print(f'Took {duration:.3f} seconds')

Running the example creates the empty array, then creates the thread pool. The large array is divided into quadrants and the indices are issued to the fill task via the thread pool. Each task fills its portion of the shared array in parallel, then the thread pool is closed and released.

This is repeated three times and the average time taken is reported.

In this case, it took about 2.308 seconds on my system.

Compared to the single-threaded version that calls the fill() method on the array, this is about 3.569 second faster, or a speed-up of about 2.55x.

1	Took 2.308 seconds

Next, let’s see if we can do better.

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Tuned Multithreaded Fill Numpy Array (faster)

We do not have to split the large array into quadrants or use one worker thread per physical CPU core.

Modern CPU cores use hyperthreading that can offer up to a 30% performance benefit on some tasks. This may offer a benefit in the case of filling arrays with the same value, given the repetition of the task function.

In this case, we will update the ThreadPool to have one worker per logical CPU core, e.g. double the number of physical CPUs. I have 4 physical CPU cores, which is seen as 8 logical CPU cores by the system. Update the configuration to match your system accordingly.

...

# create the thread pool

with ThreadPool(8) as pool:

# ...

Next, we can split the array into more than four pieces.

In this case, we will split each dimension into 4, giving 4×4 or 16 sub-arrays.

You can experiment with other divisions of the array into sub-arrays.

...

# split each dimension (divisor of matrix dimension)

split = round(n/4)

And that’s it.

The update task() function with this change is listed below.

# task function

def task(n=50000):

# record the start time

start = time()

# create an empty array

data = empty((n,n))

# create the thread pool

with ThreadPool(8) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/4)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(fill_subarray, args=(coords, data, 1))

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

# calculate and report duration

duration = time() - start

# return duration

return duration

Tying this together, the complete example is listed below.

# multithreaded array filling with thread pool and one worker per logical cpu

from time import time

from multiprocessing.pool import ThreadPool

from numpy import empty

# fill a portion of a larger array with a value

def fill_subarray(coords, data, value):

# unpack array indexes

i1, i2, i3, i4 = coords

# populate subarray

data[i1:i2,i3:i4].fill(value)

# task function

def task(n=50000):

# record the start time

start = time()

# create an empty array

data = empty((n,n))

# create the thread pool

with ThreadPool(8) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/4)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.apply_async(fill_subarray, args=(coords, data, 1))

# close the pool

pool.close()

# wait for tasks to complete

pool.join()

# calculate and report duration

duration = time() - start

# return duration

return duration

# experiment that averages duration of task function

def experiment(repeats=3):

# repeat the experiment and gather results

results = [task() for _ in range(repeats)]

# return the average of the results

return sum(results) / repeats

# run the experiment and report the result

duration = experiment()

print(f'Took {duration:.3f} seconds')

Running the example creates the empty array, creates the ThreadPool, then divides the shared array into 16 parts to be filled in parallel.

In this case, the example took about 1.880 seconds to complete on average.

This is about 3.997 seconds faster than the single threaded version or a speed-up of about 3.13x.

It is also faster than the one worker per physical CPU example developed above. Specifically, it is 0.428 seconds (428 milliseconds) faster or a speed-up of about 1.23x.

1	Took 1.880 seconds

This highlights that it is important to continue testing different configurations, even after parallelism offers a significant speed-up.

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Takeaways

You now know how to speed-up the filling of numpy arrays using parallelism.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by Jannic Böhme on Unsplash

Parallel Numpy Array Fill (up to 3x faster)

Need to Use Parallelism to Fill Numpy Arrays Faster

How to Fill NumPy Arrays Faster Using Parallelism

Single Threaded Fill Numpy Array (slow)

Example With numpy.ones()

Example with ndarray.fill()

Example with Mass Assignment

Multithreaded Fill Numpy Array (fast)

Tuned Multithreaded Fill Numpy Array (faster)

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent NumPy Fast
(without the frustration)

Additional menu

Need to Use Parallelism to Fill Numpy Arrays Faster

How to Fill NumPy Arrays Faster Using Parallelism

Single Threaded Fill Numpy Array (slow)

Example With numpy.ones()

Example with ndarray.fill()

Example with Mass Assignment

Multithreaded Fill Numpy Array (fast)

Tuned Multithreaded Fill Numpy Array (faster)

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Do you have any questions?Cancel reply

Footer

Learn Concurrent NumPy Fast (without the frustration)

Learn Concurrent NumPy Fast
(without the frustration)