ThreadPoolExecutor map() Configure chunksize

You can set the chunksize argument to the map() method of the ThreadPoolExecutor to any value as it has no effect.

Batches of tasks issued to the ThreadPoolExecutor via the map() method are not chunked, meaning that the chunksize argument is ignored.

In this tutorial, you will discover how to configure the chunksize argument to map() in the ThreadPoolExecutor.

Let’s get started.

Table of Contents

Need to Configure ThreadPoolExecutor map() chunksize

The ThreadPoolExecutor class extends the Executor class.

The Executor class provides a method called map() for issuing tasks in batch.

The map() function takes the name of the target task function and an iterable of arguments, one for each task to be issued, then returns an iterable of the results from each task, once all tasks are done.

You can learn more about the map() method on the ThreadPoolExecutor in the tutorial:

How to Use map() with the ThreadPoolExecutor in Python

The map() method provides a “chunksize” argument.

Similarly, the multiprocessing.pool.ThreadPool class also provides a map() method and a chunksize argument.

Configuring the chunksize argument of the map() method on the ThreadPool can have a large impact on performance.

You can learn more about how to configure the chunksize argument to the map() method on the ThreadPool in the tutorial:

How to Configure ThreadPool map() Chunksize

Given that chunksize is an important argument on map() how can we use it with the ThreadPoolExecutor?

How can we best configure the “chunksize” argument for the map() method on the ThreadPoolExecutor class?

Run loops using all CPUs, download your FREE book to learn how.

How to Configure ThreadPoolExecutor map() chunksize Argument

In this section, we will explore how to configure the “chunksize” argument to the map() method on the ThreadPoolExecutor.

What is Chunksize?

The “chunksize” argument controls how tasks issued in a batch to a thread or process pool are grouped together into chunks and sent to workers in the pool.

this method chops iterables into a number of chunks which it submits to the pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integer. For very long iterables, using a large value for chunksize can significantly improve performance compared to the default size of 1.
— concurrent.futures — Launching parallel tasks

For example, if an Executor had 5 workers and 100 tasks were issued, then a chunksize of 20 could be set to ensure that 20 tasks were sent to each worker, e.g. 100/20 = 5.

What is the Default Value for chunksize?

The default value of the “chunksize” argument in the Executor class is one.

1	map(func, *iterables, timeout=None, chunksize=1)

How to Configure Chunksize

We cannot configure the “chunksize” argument to the map() method for the ThreadPoolExecutor.

The documentation suggests that the “chunksize” argument to map() on the ThreadPoolExecutor has no effect, that it is only used by the ProcessPoolExecutor.

With ThreadPoolExecutor, chunksize has no effect.
— concurrent.futures — Launching parallel tasks

How Does ThreadPoolExecutor Not Use chunksize

If we review the source code for the Executor, we can see an implementation of the map() method that does not make use of the “chunksize” argument to the map() method.

If we check the source code of the ProcessPoolExecutor, we can see that it overrides the map() method and makes use of the “chunksize” argument.

Finally, if we review the source code for the ThreadPoolExecutor, we can see that it does not override the map() method, and therefore makes use of the implementation of the map() method in the Executor class that does not use the “chunksize” argument.

Why Does ThreadPoolExecutor Not Use chunksize

The ThreadPoolExecutor does not make use of the “chunksize” argument in the map() by design.

We can infer that making use of a “chunksize” argument in map() does not offer a performance benefit when using threads, but does offer a benefit when using processes.

This may be because of the computational cost involved in serializing data arguments sent to and received from child processes in Python, and no such cost when using threads.

We do not see any mention of a chunksize argument in the PEP that introduces the concurrent.futures functionality, suggesting it was added later.

PEP 3148 – futures – execute computations asynchronously

Nevertheless, we can see that it was added later in 2011, and comments that chunksize is not needed by thread pools.

I guess we could make the ThreadPoolExecutor API accept the chunksize argument just so the APIs match, but then not actually use it (and document that it’s not used, of course). That would make it easier to switch between the two APIs, at least.
— concurrent.futures.ProcessPoolExecutor.map() doesn’t batch function arguments by chunks

Now that we know that the ThreadPoolExecutor map() method does not use the chunksize argument, we can look at a worked example.

Download Now: Free ThreadPoolExecutor PDF Cheat Sheet

ThreadPoolExecutor Effect of Different chunksize Values

We can explore the effect (or lack of effect) of the chunksize argument on the map() method in the ThreadPoolExecutor.

In this example, we will define a task that blocks for a random amount of time. Then issue a batch of 1,000 of these tasks to 100 workers using different chunksize values and compare the total time taken to execute all tasks.

This will show that regardless of the chunksize argument value, the overall execution of all tasks does not manfully change, showing empirically that the chunksize argument is not used in the ThreadPoolExecutor.

Firstly, we can define a task to execute in the thread pool.

The task takes an integer argument value, a requirement of the map() method. It then generates a random value, blocks between 0 and 0.2 seconds to simulate effort, then returns the provided argument multiplied by the generated value.

The task() function below implements this.

# task executed in the thread pool

def task(value):

# generate random value in in [0,1]

rand_val = random()

# block for a fraction of 0.2 seconds

sleep(0.2 * rand_val)

# return a unique value

return value * rand_val

Next, we can define a function to benchmark the time taken to execute many tasks in the ThreadPoolExecutor with a given “chunksize” argument.

The function takes a given chunksize, creates a thread pool with 100 workers, then issues 1,000 tasks to the thread pool via the map() method using the given chunksize argument.

The use_threadpool() argument below implements this.

# issue tasks with thread pool

def use_threadpool(chunksize_value):

# create the thread pool

with ThreadPoolExecutor(100) as tp:

# issue batch of tasks

for result in tp.map(task, range(1000), chunksize=chunksize_value):

pass

Next, we can benchmark different chunksize values.

Firstly, we will use the use_threadpool() function once with an arbitrary chunksize value, to ensure all classes are loaded.

...

# warm things up

use_threadpool(None)

Next, we can define the different chunksize values to benchmark.

In this case, we test the default, one task per work (the default at the time of writing), 2, 5, and 10. A chunksize of 10 is the maximum value we can use that will engage all workers (if chunksize was used) because 1,000 tasks divided by 10 tasks per worker is 100 workers with 10 tasks each.

We then iterate over each chunksize and evaluate the time taken to execute all tasks with the chunksize. Each value is evaluated over 10 trials, and the time is averaged. This is to give a more robust estimate of performance.

# warm things up

use_threadpool(None)

# chunksizes to trial

chunk_sizes = [None, 1, 2, 5, 10]

# evaluate each chunksize

for size in chunk_sizes:

# benchmark the thread pool

n_trials = 10

results = list()

for trial in range(n_trials):

# record start time

start_time = time()

# perform task

use_threadpool(size)

# calculate duration

duration = time() - start_time

results.append(duration)

print(f'> Size={size}, trial {trial} took {duration} seconds')

# report average result

average = sum(results) / n_trials

print(f'Size={size} Benchmark: {average:.3f} seconds')

Tying this together, the complete example is listed below.

# SuperFastPython.com

# benchmark different chunksize values for the threadpoolexecutor

from concurrent.futures import ThreadPoolExecutor

from time import sleep

from time import time

from random import random

# task executed in the thread pool

def task(value):

# generate random value in in [0,1]

rand_val = random()

# block for a fraction of 0.2 seconds

sleep(0.2 * rand_val)

# return a unique value

return value * rand_val

# issue tasks with thread pool

def use_threadpool(chunksize_value):

# create the thread pool

with ThreadPoolExecutor(100) as tp:

# issue batch of tasks

for result in tp.map(task, range(1000), chunksize=chunksize_value):

pass

# warm things up

use_threadpool(None)

# chunksizes to trial

chunk_sizes = [None, 1, 2, 5, 10]

# evaluate each chunksize

for size in chunk_sizes:

# benchmark the thread pool

n_trials = 10

results = list()

for trial in range(n_trials):

# record start time

start_time = time()

# perform task

use_threadpool(size)

# calculate duration

duration = time() - start_time

results.append(duration)

print(f'> Size={size}, trial {trial} took {duration} seconds')

# report average result

average = sum(results) / n_trials

print(f'Size={size} Benchmark: {average:.3f} seconds')

Running the example first runs the benchmark with an arbitrary chunksize value.

Next, each chunksize is evaluated with 10 trials each, then the average is reported.

We can see that each trial takes about 1.4 or 1.5 seconds to complete.

A complete list of results is listed below.

> Size=None, trial 0 took 1.1506719589233398 seconds

> Size=None, trial 1 took 1.11806321144104 seconds

> Size=None, trial 2 took 1.1720118522644043 seconds

> Size=None, trial 3 took 1.1320080757141113 seconds

> Size=None, trial 4 took 1.1261382102966309 seconds

> Size=None, trial 5 took 1.1563951969146729 seconds

> Size=None, trial 6 took 1.1678581237792969 seconds

> Size=None, trial 7 took 1.141017198562622 seconds

> Size=None, trial 8 took 1.1404149532318115 seconds

> Size=None, trial 9 took 1.166010856628418 seconds

Size=None Benchmark: 1.147 seconds

> Size=1, trial 0 took 1.147402048110962 seconds

> Size=1, trial 1 took 1.1525521278381348 seconds

> Size=1, trial 2 took 1.1430439949035645 seconds

> Size=1, trial 3 took 1.1227641105651855 seconds

> Size=1, trial 4 took 1.1598842144012451 seconds

> Size=1, trial 5 took 1.1597020626068115 seconds

> Size=1, trial 6 took 1.136141061782837 seconds

> Size=1, trial 7 took 1.1442019939422607 seconds

> Size=1, trial 8 took 1.1641101837158203 seconds

> Size=1, trial 9 took 1.1760389804840088 seconds

Size=1 Benchmark: 1.151 seconds

> Size=2, trial 0 took 1.1664016246795654 seconds

> Size=2, trial 1 took 1.1545510292053223 seconds

> Size=2, trial 2 took 1.0985357761383057 seconds

> Size=2, trial 3 took 1.1387717723846436 seconds

> Size=2, trial 4 took 1.1320688724517822 seconds

> Size=2, trial 5 took 1.14510178565979 seconds

> Size=2, trial 6 took 1.1600267887115479 seconds

> Size=2, trial 7 took 1.1403017044067383 seconds

> Size=2, trial 8 took 1.2002391815185547 seconds

> Size=2, trial 9 took 1.1112680435180664 seconds

Size=2 Benchmark: 1.145 seconds

> Size=5, trial 0 took 1.1697099208831787 seconds

> Size=5, trial 1 took 1.1489500999450684 seconds

> Size=5, trial 2 took 1.1547861099243164 seconds

> Size=5, trial 3 took 1.1116399765014648 seconds

> Size=5, trial 4 took 1.1498167514801025 seconds

> Size=5, trial 5 took 1.1647369861602783 seconds

> Size=5, trial 6 took 1.1305930614471436 seconds

> Size=5, trial 7 took 1.1448819637298584 seconds

> Size=5, trial 8 took 1.13112211227417 seconds

> Size=5, trial 9 took 1.1304850578308105 seconds

Size=5 Benchmark: 1.144 seconds

> Size=10, trial 0 took 1.1805949211120605 seconds

> Size=10, trial 1 took 1.1461248397827148 seconds

> Size=10, trial 2 took 1.1465327739715576 seconds

> Size=10, trial 3 took 1.137136697769165 seconds

> Size=10, trial 4 took 1.1699650287628174 seconds

> Size=10, trial 5 took 1.0957980155944824 seconds

> Size=10, trial 6 took 1.1895999908447266 seconds

> Size=10, trial 7 took 1.120788812637329 seconds

> Size=10, trial 8 took 1.1164579391479492 seconds

> Size=10, trial 9 took 1.126460075378418 seconds

Size=10 Benchmark: 1.143 seconds

We can summarize the results, as follows

Size=None Benchmark: 1.147 seconds

Size=1 Benchmark: 1.151 seconds

Size=2 Benchmark: 1.145 seconds

Size=5 Benchmark: 1.144 seconds

Size=10 Benchmark: 1.143 seconds

We can see no meaningful difference between the time taken with different chunksize argument values to the map() method.

This matches our expectations because the chunksize argument is not used by the map() method in the ThreadPoolExecutor.

Why aren’t all trials and all averages identical?

The reason why all trials and all averages are not identical is because of randomness on the machine executing the program. Other programs are running, including programs within the operating system that we don’t have control over. These take resources away from the system and result in very minor differences (fractions of a second) between each trial.

Free Python ThreadPoolExecutor Course

Download your FREE ThreadPoolExecutor PDF cheat sheet and get BONUS access to my free 7-day crash course on the ThreadPoolExecutor API.

Discover how to use the ThreadPoolExecutor class including how to configure the number of workers and how to execute tasks asynchronously.

Learn more