Last Updated on September 12, 2022
You can make your program slower by using the ThreadPoolExecutor in Python.
In this tutorial, you will discover the anti-pattern for using the ThreadPoolExecutor and how to avoid it on your projects.
Let’s get started.
ThreadPoolExecutor Can Be Slower Than a For Loop
The ThreadPoolExecutor in Python provides a pool of reusable threads for executing ad hoc tasks.
You can submit tasks to the thread pool by calling the submit() function and passing in the name of the function you wish to execute on another thread.
Calling the submit() function will return a Future object that allows you to check on the status of the task and get the result from the task once it completes.
You can also submit tasks by calling the map() function and specifying the name of the function to execute and the iterable of items to which your function will be applied.
The ThreadPoolExecutor is designed to speed-up your program by executing tasks concurrently.
Nevertheless, in some use cases, using the ThreadPoolExecutor can make your program slower. Sometimes dramatically slower than performing the same task in a for loop.
How can the ThreadPoolExecutor make your program slower?
Run loops using all CPUs, download your FREE book to learn how.
ThreadPoolExecutor Can Be Slower for CPU-Bound Tasks
Using the ThreadPoolExecutor for a CPU-bound task can be slower than not using it.
This is because Python threads are constrained by the Global Interpreter Lock, or GIL.
The GIL is a programming pattern in the reference Python interpreter (CPython) that uses synchronization to ensure that only one thread can execute instructions at a time within a Python process.
This means that although we may have multiple threads in the thread pool, only one thread can execute at a time.
This is fine when the tasks executed by the thread pool are blocking, such as IO-bound tasks that might read from a file or internet connection.
This is a problem when the tasks executed by the thread pool are CPU-bound, meaning that the speed of their execution is determined by the speed of the CPU. These tasks do not block and therefore run as fast as possible. Because of the GIL, the threads executing these tasks will run one at a time and step on each other via context switching.
Context switching is a programming pattern that allows more than one thread of execution to run on one CPU, e.g. changes the “context” for the CPU that executes instructions. In a context switch, the operating system will store the state of the thread that is executing so that it can be resumed later, and allows another thread of execution to run and stores its state.
This is a problem with CPU bound tasks because context switching is a relatively expensive operation. Having many threads running at the same time with the same priority on the same type of task will likely force the operating system to context switch between them often, introducing an unnecessary computational overhead.
The result is the overall task will be slower when executing it with the ThreadPoolExecutor compared to executing it directly in a for loop.
Given that executing CPU-bound tasks with the ThreadPoolExecutor will likely result in worse performance, we might refer to this usage as an anti-pattern. That is a pattern of usage that can be easily identified and must be avoided, e.g. a bad solution to the problem of concurrency in Python.
- Using the ThreadPoolExecutor for CPU-bound Tasks is an Anti-pattern.
The ProcessPoolExecutor should probably be used instead for CPU-bound tasks. This is because it uses processes instead of threads, and as such, it is not constrained by the GIL.
Now that we know why using a ThreadPoolExecutor can be slower than a for loop in some cases, let’s look at a worked example.
Example of ThreadPoolExecutor Being Slower Than a For Loop
Let’s look at an example where using a ThreadPoolExecutor can be slower than a for loop.
CPU-Bound Tasks in a For Loop
First, let’s define a simple CPU-bound task to execute many times.
In this case, we can square a number. That is, given a numeric input, return the squared value.
1 2 3 |
# perform some math operation def operation(value): return value**2 |
Next, let’s perform this operation many times, such as one million (1,000,000) times, and report a message when we are done. That is, we will square the numbers from zero to 999,999.
We can use a list comprehension, which is a pythonic for-loop.
1 2 3 4 |
... # perform a math operation many times values = [operation(i) for i in range(1000000)] print('done') |
This could just as easily be written as a for loop directly; for example:
1 2 3 4 5 6 |
... # perform a math operation many times values = list() for i in range(1000000): values.append(operation(i)) print('done') |
Or using the map() function, which too might be more pythonic; for example:
1 2 3 4 |
... # perform a math operation many times values = list(map(operation, range(1000000))) print('done') |
We’ll stick with the list comprehension. The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 |
# SuperFastPython.com # example of performing a simple math task many times in a for loop # perform some math operation def operation(value): return value**2 # perform a math operation many times values = [operation(i) for i in range(1000000)] print('done') |
The code runs fast, completing in about 335 milliseconds on my system.
How long does it take to run on your system?
Let me know in the comments below.
Next, let’s make the task concurrent using the ThreadPoolExecutor.
CPU-Bound Tasks in ThreadPoolExecutor
We can update the code from the previous example to use the ThreadPoolExecutor.
This would be an anti-pattern as described previously, therefore, we would expect this example to run slower than the for loop version because of the overhead of context switching.
First, we can create a thread pool with some number of threads, in this case 4. We can then use the map() function on the ThreadPoolExecutor to submit the tasks into the thread pool. Each task will be to square a number, with numbers from 0 to 999,999 sent into the pool for execution.
1 2 3 4 |
... # perform a math operation many times with ThreadPoolExecutor(4) as executor: results = executor.map(operation, range(1000000)) |
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 |
# SuperFastPython.com # example of performing a simple math task concurrently from concurrent.futures import ThreadPoolExecutor # perform some math operation def operation(value): return value**2 # perform a math operation many times with ThreadPoolExecutor(4) as executor: results = executor.map(operation, range(1000000)) print('done') |
Running the example squares all of the numbers as before, but it is dramatically slower.
On my system, it takes about 22.9 seconds to complete compared to the 0.335 seconds taken with the for loop.
That is nearly two orders of magnitude slower, or more precisely 68.3x slower.
Ouch!
Again, the reason for the slow down is because the task is a CPU-bound task and the ThreadPoolExecutor uses threads that are subjected to the GIL, meaning only one thread can execute at a time and the operating system will context switch between them, adding a large amount of overhead to the overall task.
Free Python ThreadPoolExecutor Course
Download your FREE ThreadPoolExecutor PDF cheat sheet and get BONUS access to my free 7-day crash course on the ThreadPoolExecutor API.
Discover how to use the ThreadPoolExecutor class including how to configure the number of workers and how to execute tasks asynchronously.
Common Questions
This section answers some commonly asked questions about this example.
What If We Use More Threads in the ThreadPoolExecutor?
Using more threads will not improve the performance for the same reason that using 4 threads does not speed up the execution of the tasks.
The GIL ensures only one thread executes instructions at a time and the operating system will context switch between the tasks, adding significant overhead to the overall task.
What If We Use One Thread in the ThreadPoolExecutor?
The ThreadPoolExecutor will still be slower than the for loop even if the thread pool had one thread.
The reason is because of all of the additional overhead in the thread pool for packaging up each task using internal classes and additional function calls as the task bounces around inside the thread pool for execution.
This is the reason why even using a ProcessPoolExecutor for such a simple task would likely not provide a speed up, at least for this example.
Will the Worker Threads in the ThreadPoolExecutor Run on Different Cores?
Probably not.
The operating system determines what code will run on each CPU core in your system.
Because only one thread can execute at a time in this example, it is very likely that a single CPU core would be used.
Will Using a ProcessPoolExecutor Speed-Up This Example?
Probably Not.
The operation is very simple and we are only performing it one million times.
If we push these tasks into a ProcessPoolExecutor using map(), we will first have to tune the chunksize argument that defines the mapping of tasks we submit to internal tasks serialized and transmitted to the worker processes in the pool.
Then executing the tasks in the process pool will add overhead, firstly for the additional wrapping of the tasks using internal objects and the additional function calls that need to be made, and secondly for the inter-process communication needed to share the tasks and their results between processes.
Using a ProcessPoolExecutor for the above example would be lucky to achieve comparable performance to the for loop, highlighting that even CPU-bound tasks can run slower when using the correct concurrency tools.
For example, the listing below shows how we can adapt the example to use the ProcessPoolExecutor.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# SuperFastPython.com # example of performing a simple math task concurrently from concurrent.futures import ProcessPoolExecutor # perform some math operation def operation(value): return value**2 if __name__ == '__main__': # perform a math operation many times with ProcessPoolExecutor(4) as executor: results = executor.map(operation, range(1000000), chunksize=10000) print('done') |
Running the example shows performance slightly worse than the for loop, completing in about 381 milliseconds on my system, compared to about 335 with the for loop.
The story changes and we start to see a shortening of the running time when using the ProcessPoolExecutor if we change the tasks from one million to ten million.
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Further Reading
This section provides additional resources that you may find helpful.
Books
- ThreadPoolExecutor Jump-Start, Jason Brownlee, (my book!)
- Concurrent Futures API Interview Questions
- ThreadPoolExecutor Class API Cheat Sheet
I also recommend specific chapters from the following books:
- Effective Python, Brett Slatkin, 2019.
- See Chapter 7: Concurrency and Parallelism
- Python in a Nutshell, Alex Martelli, et al., 2017.
- See: Chapter: 14: Threads and Processes
Guides
- Python ThreadPoolExecutor: The Complete Guide
- Python ProcessPoolExecutor: The Complete Guide
- Python Threading: The Complete Guide
- Python ThreadPool: The Complete Guide
APIs
References
Takeaways
You now know how ThreadPoolExecutor can make your programs slower and how to avoid it.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Simon Connellan on Unsplash
Do you have any questions?