ThreadPoolExecutor Fill NumPy Array

Last Updated on September 29, 2023

You can fill a NumPy array in parallel using Python threads.

NumPy will release the global interpreter lock (GIL) when calling a fill function, allowing Python worker threads in the ThreadPoolExecutor to run in parallel and populate different sub-arrays of a large shared array.

This can offer up to a 3x speed-up when filling NumPy arrays, depending on the number of CPU cores available and the size of the array that is being initialized.

In this tutorial, you will discover how to speed up the filling of NumPy arrays using parallelism with the ThreadPoolExecutor.

Let’s get started.

Table of Contents

Need to Fill Large NumPy Array

It is common to create a NumPy array and fill it with an initial value.

For example, we may create an array and need to initialize it with zeros, ones, or a specific value.

If the array is very large, e.g. millions of values, initializing the array may take a long time, e.g. many seconds.

How can we fill large NumPy arrays faster using the ThreadPoolExecutor?

Run loops using all CPUs, download your FREE book to learn how.

How to Use ThreadPoolExecutor to Fill Large NumPy Array

We can fill a NumPy array faster using multiple concurrent threads in the ThreadPoolExecutor.

Most NumPy functions are executed in C-code. As such, they release the global interpreter lock (GIL) allowing other Python threads to run in parallel.

This means that we can use multiple Python threads to fill the same NumPy array in parallel.

You can learn more about NumPy releasing the GIL in the tutorial:

NumPy vs the Global Interpreter Lock (GIL)

This can be achieved by dividing the NumPy array into sections and having each thread fill a different section.

Firstly, we can create a ThreadPoolExecutor with one worker thread per CPU core in our system.

For example, if we have 4 CPU cores then, we can create a ThreadPoolExecutor with 4 worker threads:

...

# create the thread pool

with ThreadPoolExecutor(4) as pool:

# ...

You can learn more about configuring the number of workers for the ThreadPoolExecutor in the tutorial:

Configure Max Workers for the ThreadPoolExecutor

We can then define a task function that takes the shared NumPy array as an argument as well as the coordinates to fill and the value with which to fill them.

For example:

# fill a portion of a larger array with a value

def fill_subarray(coords, data, value):

# unpack array indexes

i1, i2, i3, i4 = coords

# populate subarray

data[i1:i2,i3:i4].fill(value)

We can then divide our large NumPy array into sections and issue each as a task to the thread pool to have it populated in parallel.

This can be challenging to make generic as it depends on the number of threads and the dimensions of the array.

Let’s take a simple case of a two-dimensional NumPy array, e.g. a matrix and 4 physical CPU cores.

In this case, we would ideally want to split our matrix into 4 tasks to be completed in parallel. This is probably the most efficient use of resources.

For a square matrix, we can find the midpoint of each dimension, then enumerate the four quadrants using for-loops, using the indices into the larger matrices directly. These indices can be gathered into a tuple to represent the coordinates for the task to populate in the larger array.

For example:

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

We can then issue each set of coordinates to our fill_subarray() function using the submit() method of the ThreadPoolExecutor.

For example:

...

# issue task

_ = pool.submit(fill_subarray, coords, data, 1)

You can learn more about submitting asynchronous tasks to the ThreadPoolExecutor in the tutorial:

How to use ThreadPoolExecutor submit()

And that’s it.

Note, if you prefer to use the ThreadPool to fill the array instead of the ThreadPoolExecutor, see the tutorial:

Parallel Numpy Array Fill (up to 3x faster)

Now that we know how to fill a large array in parallel using threads, let’s look at some worked examples.

Start Now: Free Concurrent NumPy Crash Course

Example of Filing a NumPy Array (slow)

Before we develop an example of multithreaded array filling, let’s confirm that single-threaded array filling is slow.

We can explore a case of creating and filling an array using a built-in function.

Examples of built-in functions that create and populate an array include numpy.ones() and numpy.zeros().

We will use the numpy.ones() function to create an array of a given size, in this case, a matrix with the dimensions 50,000×50,000.

The function below creates an array using the numpy.ones() function with the given size and populates it with ones, then returns how long the task took in seconds.

# task function

def task(n=50000):

# record the start time

start = time()

# create a new matrix and fill with 1

data = ones((n,n))

# calculate and report duration

duration = time() - start

# report progress

print(f'> took {duration:.3f} seconds, sum={data.sum():.0f}')

# return duration

return duration

We can then repeat this task a number of times and report the average time it takes to complete in seconds.

This is important to do as a single run of the task may report an unstable time, taking longer or shorter based on the system. Averaging over multiple runs gives a more fair idea of the expected performance of the task.

The experiment() function below executes the task 3 times and returns the average time taken in seconds.

# experiment that averages duration of task function

def experiment(repeats=3):

# repeat the experiment and gather results

results = [task() for _ in range(repeats)]

# return the average of the results

return sum(results) / repeats

Finally, we can run the experiment and report the average time taken.

...

# run the experiment and report the result

avg = experiment()

print(f'Average duration: {avg:.3f} seconds')

Tying this together, the complete example is listed below.

# single-threaded example of filling an array with ones

from time import time

from numpy import ones

# task function

def task(n=50000):

# record the start time

start = time()

# create a new matrix and fill with 1

data = ones((n,n))

# calculate and report duration

duration = time() - start

# report progress

print(f'> took {duration:.3f} seconds, sum={data.sum():.0f}')

# return duration

return duration

# experiment that averages duration of task function

def experiment(repeats=3):

# repeat the experiment and gather results

results = [task() for _ in range(repeats)]

# return the average of the results

return sum(results) / repeats

# run the experiment and report the result

avg = experiment()

print(f'Average duration: {avg:.3f} seconds')

Running the example creates a 50,000 x 50,000 array and populates it with ones. This is repeated 3 times and the average time taken is reported in seconds.

On my system, it took about 6.212 seconds to complete with Python 3.11.4 and NumPy 1.25.0.

This example may execute in more or less time depending on the capabilities of your system.

We can imagine how the performance of filling an array may get worse as the size of the array is increased further.

> took 6.131 seconds, sum=2500000000

> took 6.207 seconds, sum=2500000000

> took 6.299 seconds, sum=2500000000

Average duration: 6.212 seconds

Next, let’s explore how we can perform the same operation faster with parallelism.

Free Concurrent NumPy Course

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

Learn more

Example of ThreadPoolExecutor Filing a NumPy Array (fast)

We can fill the same array faster using threads.

We will use a ThreadPoolExecutor and issue tasks to operate on sub-sections of the provided array simultaneously, as described in the strategy outlined above.

Firstly, we can define a task function that takes a tuple of array indices and the shared array. It then calls the fill() function on the sub-array defined by the provided coordinates.

# fill a portion of a larger array with a value

def fill_subarray(coords, data, value):

# unpack array indexes

i1, i2, i3, i4 = coords

# populate subarray

data[i1:i2,i3:i4].fill(value)

Python threads can share memory directly. This means that we can share the same array between all tasks executed in worker threads and they will all operate on the same structure in memory, not copies.

Next, we can create our shared array.

...

# create an empty array

data = empty((n,n))

We can then create a thread pool with one worker per physical CPU core.

I have 4 physical CPU cores in my system, so I will configure the ThreadPoolExecutor to use 4 workers. Update the ThreadPoolExecutor configuration accordingly to match the number of physical CPU cores in your system.

...

# create the thread pool

with ThreadPoolExecutor(4) as pool:

# ...

Our array is square, which can be divided into quadrants (4 parts) and each is issued to the task() function for parallel filling.

...

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.submit(fill_subarray, coords, data, 1)

The main thread will then exit the context manager for the ThreadPoolExecutor, and block until all tasks in the thread pool are completed and the worker threads have been terminated.

And that’s it.

The updated task() function with these changes is listed below.

# task function

def task(n=50000):

# record the start time

start = time()

# create an empty array

data = empty((n,n))

# create the thread pool

with ThreadPoolExecutor(4) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.submit(fill_subarray, coords, data, 1)

# calculate and report duration

duration = time() - start

# report progress

print(f'> took {duration:.3f} seconds, sum={data.sum():.0f}')

# return duration

return duration

Tying this together, the complete example is listed below.

# multithreaded array filling with thread pool and one worker per physical cpu

from time import time

from concurrent.futures import ThreadPoolExecutor

from numpy import empty

# fill a portion of a larger array with a value

def fill_subarray(coords, data, value):

# unpack array indexes

i1, i2, i3, i4 = coords

# populate subarray

data[i1:i2,i3:i4].fill(value)

# task function

def task(n=50000):

# record the start time

start = time()

# create an empty array

data = empty((n,n))

# create the thread pool

with ThreadPoolExecutor(4) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/2)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.submit(fill_subarray, coords, data, 1)

# calculate and report duration

duration = time() - start

# report progress

print(f'> took {duration:.3f} seconds, sum={data.sum():.0f}')

# return duration

return duration

# experiment that averages duration of task function

def experiment(repeats=3):

# repeat the experiment and gather results

results = [task() for _ in range(repeats)]

# return the average of the results

return sum(results) / repeats

# run the experiment and report the result

avg = experiment()

print(f'Average duration: {avg:.3f} seconds')

Running the example creates the empty array, then creates the ThreadPoolExecutor.

The large array is divided into quadrants and the indices are issued to the fill task via the thread pool. Each task fills its portion of the shared array in parallel, then the thread pool is closed and released.

This is repeated three times and the average time taken is reported.

In this case, it took about 2.546 seconds on my system.

Compared to the single-threaded version above, this is about 3.666 seconds faster, or a speed-up of about 2.44x.

> took 2.537 seconds, sum=2500000000

> took 2.586 seconds, sum=2500000000

> took 2.514 seconds, sum=2500000000

Average duration: 2.546 seconds

Next, let’s see if we can do better.

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Example of ThreadPoolExecutor Filing a NumPy Array (faster)

We do not have to split the large array into quadrants or use one worker thread per physical CPU core.

Modern CPU cores use hyperthreading that can offer up to a 30% performance benefit on some tasks. This may offer a benefit in the case of filling arrays with the same value, given the repetition of the task function.

In this case, we will update the ThreadPoolExecutor to have one worker per logical CPU core, e.g. double the number of physical CPUs.

I have 4 physical CPU cores, which is seen as 8 logical CPU cores by the system. Update the configuration to match your system accordingly.

...

# create the thread pool

with ThreadPoolExecutor(8) as pool:

# ...

Next, we can split the array into more than four pieces.

In this case, we will split each dimension into 4, giving 4 * 4 or 16 sub-arrays to fill.

You can experiment with other divisions of the array into sub-arrays.

...

# split each dimension (divisor of matrix dimension)

split = round(n/4)

And that’s it.

The update task() function with this change is listed below.

# task function

def task(n=50000):

# record the start time

start = time()

# create an empty array

data = empty((n,n))

# create the thread pool

with ThreadPoolExecutor(8) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/4)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.submit(fill_subarray, coords, data, 1)

# calculate and report duration

duration = time() - start

# report progress

print(f'> took {duration:.3f} seconds, sum={data.sum():.0f}')

# return duration

return duration

Tying this together, the complete example is listed below.

# multithreaded array filling with thread pool and one worker per physical cpu

from time import time

from concurrent.futures import ThreadPoolExecutor

from numpy import empty

# fill a portion of a larger array with a value

def fill_subarray(coords, data, value):

# unpack array indexes

i1, i2, i3, i4 = coords

# populate subarray

data[i1:i2,i3:i4].fill(value)

# task function

def task(n=50000):

# record the start time

start = time()

# create an empty array

data = empty((n,n))

# create the thread pool

with ThreadPoolExecutor(8) as pool:

# split each dimension (divisor of matrix dimension)

split = round(n/4)

# issue tasks

for x in range(0, n, split):

for y in range(0, n, split):

# determine matrix coordinates

coords = (x, x+split, y, y+split)

# issue task

_ = pool.submit(fill_subarray, coords, data, 1)

# calculate and report duration

duration = time() - start

# report progress

print(f'> took {duration:.3f} seconds, sum={data.sum():.0f}')

# return duration

return duration

# experiment that averages duration of task function

def experiment(repeats=3):

# repeat the experiment and gather results

results = [task() for _ in range(repeats)]

# return the average of the results

return sum(results) / repeats

# run the experiment and report the result

avg = experiment()

print(f'Average duration: {avg:.3f} seconds')

Running the example creates the empty array, creates the ThreadPoolExecutor, then divides the shared array into 16 parts to be filled in parallel.

In this case, the example took about 2.024 seconds to complete on average.

This is about 4.188 seconds faster than the single-threaded version or a speed-up of about 3.07x.

It is also faster than the one worker per physical CPU example developed above. Specifically, it is 0.522 seconds (522 milliseconds) faster or a speed-up of about 1.26x.

This highlights that it is important to continue testing different configurations, even after parallelism offers a significant speed-up.

> took 2.022 seconds, sum=2500000000

> took 1.969 seconds, sum=2500000000

> took 2.080 seconds, sum=2500000000

Average duration: 2.024 seconds

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Takeaways

You now know how to speed up the filling of NumPy arrays using parallelism with the ThreadPoolExecutor.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by Nicolas Picard on Unsplash

ThreadPoolExecutor Fill NumPy Array

Need to Fill Large NumPy Array

How to Use ThreadPoolExecutor to Fill Large NumPy Array

Example of Filing a NumPy Array (slow)

Example of ThreadPoolExecutor Filing a NumPy Array (fast)

Example of ThreadPoolExecutor Filing a NumPy Array (faster)

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent NumPy Fast
(without the frustration)

Additional menu

Need to Fill Large NumPy Array

How to Use ThreadPoolExecutor to Fill Large NumPy Array

Example of Filing a NumPy Array (slow)

Example of ThreadPoolExecutor Filing a NumPy Array (fast)

Example of ThreadPoolExecutor Filing a NumPy Array (faster)

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Leave a Reply Cancel reply

Footer

Learn Concurrent NumPy Fast (without the frustration)

Learn Concurrent NumPy Fast
(without the frustration)