Last Updated on September 12, 2022
It is important to follow best practices when using the ProcessPoolExecutor in Python.
Best practices allow you to side-step the most common errors and bugs when using processes to execute ad hoc tasks in your programs.
In this tutorial you will discover the best practices when using Python process pools.
Let’s get started.
ProcessPoolExecutor Best Practices
The ProcessPoolExecutor is a flexible and powerful process pool for executing ad hoc tasks in an asynchronous manner.
Once you know how the ProcessPoolExecutor works, it is important to review some best practices to consider when bringing process pools into our Python programs.
To keep things simple, there are five best practices when using the ProcessPoolExecutor, they are:
- Use the Context Manager
- Use map() for Asynchronous For-Loops
- Use submit() with as_completed()
- Use Independent Functions as Tasks
- Use for CPU-Bound Tasks (probably)
Let’s get started with the first practice, which is to use the context manager.
Run loops using all CPUs, download your FREE book to learn how.
Use the Context Manager
Use the context manager when using process pools and handle all task dispatching to the process pool and processing results within the manager.
For example:
1 2 3 4 |
... # create a process pool via the context manager with ProcessPoolExecutor(4) as executor: # ... |
Remember to configure your process pool when creating it in the context manager, specifically by setting the number of processes to use in the pool.
Using the context manager avoids the situation where you have explicitly instantiated the process pool and forget to shut it down manually by calling shutdown().
It is also less code and better grouped than managing instantiation and shutdown manually, for example:
1 2 3 4 5 |
... # create a process pool manually executor = ProcessPoolExecutor(4) # ... executor.shutdown() |
Don’t use the context manager when you need to dispatch tasks and get results over a broader context (e.g. multiple functions) and/or when you have more control over the shutdown of the pool.
Use map() for Asynchronous For-Loops
If you have a for-loop that applies a function to each item in a list, then use the map() function to dispatch the tasks asynchronously.
For example, you may have a for-loop over a list that calls task() for each item:
1 2 3 4 5 |
... # apply a function to each item in an iterable for item in mylist: result = task(item) # do something... |
Or, you may already be using the built-in map() function:
1 2 3 4 |
... # apply a function to each item in an iterable for result in map(task, mylist): # do something... |
Both of these cases can be made asynchronous using the map() function on the process pool.
1 2 3 4 |
... # apply a function to each item in a iterable asynchronously for result in executor.map(task, mylist): # do something... |
Probably do not use the map() function if your target task function has side effects.
Do not use the map() function if your target task function has no arguments or more than one argument, unless all arguments can come from parallel iterables (i.e. map() can take multiple iterables).
Do not use the map() function if you need control over exception handling for each task, or if you would like to get results to tasks in the order that tasks are completed.
Free Python ProcessPoolExecutor Course
Download your FREE ProcessPoolExecutor PDF cheat sheet and get BONUS access to my free 7-day crash course on the ProcessPoolExecutor API.
Discover how to use the ProcessPoolExecutor class including how to configure the number of workers and how to execute tasks asynchronously.
Use submit() with as_completed()
If you would like to process results in the order that tasks are completed, rather than the order that tasks are submitted, then use submit() and as_completed().
The submit() function is in the process pool and is used to push tasks into the pool for execution and returns immediately with a Future object for the task. The as_completed() function is a module method that will take an iterable of Future objects, like a list, and will return Future objects as the tasks are completed.
For example:
1 2 3 4 5 6 7 8 |
... # submit all tasks and get future objects futures = [executor.submit(myfunc, item) for item in mylist] # process results from tasks in order of task completion for future in as_completed(futures): # get the result result = future.result() # do something... |
Do not use the submit() and as_completed() combination if you need to process the results in the order that the tasks were submitted to the process pool, instead, use map() or use submit() and iterate the Future objects directly.
Do not use the submit() and as_completed() combination if you need results from all tasks to continue, instead, you may be better off using the wait() module function.
Do not use the submit() and as_completed() combination for a simple asynchronous for-loop, instead, you may be better off using map().
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Use Independent Functions as Tasks
Use the ProcessPoolExecutor if your tasks are independent.
This means that each task is not dependent on other tasks that could execute at the same time. It also may mean tasks that are not dependent on any data other than data provided via function arguments to the task.
The ProcessPoolExecutor is ideal for tasks that do not change any data, e.g. have no side effects, so-called pure functions.
Process pools can be organized into data flows and pipelines for linear dependence between tasks, perhaps with one process pool per task type.
The process pool is not designed for tasks that require coordination, you should consider using the Process class and coordination patterns like the Barrier and Semaphore.
Process pools are not designed for tasks that require synchronization, you should consider using the Process class and locking patterns like Lock and RLock via a Manager.
Use for CPU-Bound Tasks (probably)
The ProcessPoolExecutor can be used for IO-bound tasks and CPU-bound tasks.
Nevertheless, it is probably best suited for CPU-bound tasks, whereas the ThreadPoolExecutor is probably best suited for IO-bound tasks.
CPU-bound tasks are those tasks that involve direct computation, e.g. executing instructions on data in the CPU. They are bound by the speed of execution of the CPU, hence the name CPU-bound.
This is unlike IO-bound tasks that must wait on external resources such as reading or writing to or from network connections and files.
Examples of common CPU-bound tasks that may be well suited to the ProcessPoolExecutor include:
- Media manipulation, e.g. resizing images, clipping audio and video, etc.
- Media encoding or decoding, e.g. encoding audio and video.
- Mathematical calculation, e.g. fractals, numerical approximation, estimation, etc.
- Model training, e.g. machine learning and artificial intelligence.
- Search, e.g. searching for a keyword in a large number of documents.
- Text processing, e.g. calculating statistics on a corpus of documents.
The ProcessPoolExecutor can be used for IO bound tasks, but it is probably a less well fit compared to using threads and the ThreadPoolExecutor.
This is because of two reasons:
- You can have more threads than processes.
- IO-bound tasks are often data intensive.
The number of processes you can create and manage is often quite limited, such as tens or less than 100.
Whereas, when you are using threads you can have hundreds of threads or even thousands of threads within one process. This is helpful for IO operations that many need to access or manage a large number of connections or resources concurrently.
This can be pushed to tens of thousands of connections or resources or even higher when using AsyncIO.
IO-bound tasks typically involve reading or writing a lot of data.
This may be data read or written from or to remote connections, database, servers, files, external devices, and so on.
As such, if the data needs to be shared between processes, such as in a pipeline, it may require that the data be serialized (called pickled, the built-in Python serialization process) in order to pass from process to process. This can be slow and very memory intensive, especially for large amounts of data.
This is not the case when using threads that can share and access the same resource in memory without data serialization.
Further Reading
This section provides additional resources that you may find helpful.
Books
- ProcessPoolExecutor Jump-Start, Jason Brownlee (my book!)
- Concurrent Futures API Interview Questions
- ProcessPoolExecutor PDF Cheat Sheet
I also recommend specific chapters from the following books:
- Effective Python, Brett Slatkin, 2019.
- See Chapter 7: Concurrency and Parallelism
- Python in a Nutshell, Alex Martelli, et al., 2017.
- See: Chapter: 14: Threads and Processes
Guides
- Python ProcessPoolExecutor: The Complete Guide
- Python ThreadPoolExecutor: The Complete Guide
- Python Multiprocessing: The Complete Guide
- Python Pool: The Complete Guide
APIs
References
- Thread (computing), Wikipedia.
- Process (computing), Wikipedia.
- Thread Pool, Wikipedia.
- Futures and promises, Wikipedia.
Takeaways
You now know the best practices when using the ProcessPoolExecutor in Python.
Do you have any questions about the best practices?
Ask your question in the comments below and I will do my best to answer.
Photo by Kyler Boone on Unsplash
Do you have any questions?