Multiprocessing Pool Best Practices in Python

Last Updated on September 12, 2022

It is important to follow best practices when using the multiprocessing.Pool in Python.

Best practices allow you to side-step the most common errors and bugs when using processes to execute ad hoc tasks in your programs.

In this tutorial you will discover the best practices when using process pools in Python.

Let’s get started.

Table of Contents

Multiprocessing Pool Best Practices

The multiprocessing pool is a flexible and powerful process pool for executing ad hoc tasks in a synchronous or asynchronous manner.

Once you know how the multiprocessing pool works, it is important to review some best practices to consider when bringing process pools into our Python programs.

To keep things simple, there are 6 best practices when using the multiprocessing pool, they are:

Use the Context Manager
Use map() for Parallel For-Loops
Use imap_unordered() For Responsive Code
Use map_async() to Issue Tasks Asynchronously
Use Independent Functions as Tasks
Use for CPU-Bound Tasks (probably)

Let’s get started with the first practice, which is to use the context manager.

Run loops using all CPUs, download your FREE book to learn how.

Use the Context Manager

Use the context manager when using the multiprocessing pool to ensure the pool is always closed correctly.

For example:

...

# create a process pool via the context manager

with Pool(4) as pool:

# ...

Remember to configure your multiprocessing pool when creating it in the context manager, specifically by setting the number of child process workers to use in the pool.

Using the context manager avoids the situation where you have explicitly instantiated the process pool and forget to shut it down manually by calling close() or terminate().

It is also less code and better grouped than managing instantiation and shutdown manually, for example:

...

# create a process pool manually

pool = Pool(4)

# ...

pool.close()

Don’t use the context manager when you need to dispatch tasks and get results over a broader context (e.g. multiple functions) and/or when you have more control over the shutdown of the pool.

You can learn more about how to use the multiprocessing pool context manager in the tutorial:

Multiprocessing Pool Context Manager

Download Now: Free Process Pool PDF Cheat Sheet

Use map() for Parallel For-Loops

If you have a for-loop that applies a function to each item in a list or iterable, then use the map() function to dispatch all tasks and process results once all tasks are completed.

For example, you may have a for-loop over a list that calls task() for each item:

...

# apply a function to each item in an iterable

for item in mylist:

result = task(item)

# do something...

Or, you may already be using the built-in map() function:

...

# apply a function to each item in an iterable

for result in map(task, mylist):

# do something...

Both of these cases can be made parallel using the map() function on the process pool.

...

# apply a function to each item in a iterable in parallel

for result in pool.map(task, mylist):

# do something...

Probably do not use the map() function if your target task function has side effects.

Do not use the map() function if your target task function has no arguments or more than one argument. If you have multiple arguments, you can use the starmap() function instead.

Do not use the map() function if you need control over exception handling for each task, or if you would like to get results to tasks in the order that tasks are completed.

Do not use the map() function if you have many tasks (e.g. hundreds or thousands) as all tasks will be dispatched at once. Instead, consider the more lazy imap() function.

You can learn more about the parallel version of map() with the multiprocessing pool in the tutorial:

Multiprocessing Pool.map() in Python

Free Python Multiprocessing Pool Course

Download your FREE Process Pool PDF cheat sheet and get BONUS access to my free 7-day crash course on the Process Pool API.

Discover how to use the Multiprocessing Pool including how to configure the number of workers and how to execute tasks asynchronously.

Learn more

Use imap_unordered() For Responsive Code

If you would like to process results in the order that tasks are completed, rather than the order that tasks are submitted, then use imap_unordered() function.

Unlike the Pool.map() function, the Pool.imap_unordered() function will iterate the provided iterable one item at a time and issue tasks to the process pool.

Unlike the Pool.imap() function, the Pool.imap_unordered() function will yield return values in the order that tasks are completed, not the order that tasks were issued to the process pool.

This allows the caller to process results from issued tasks as they become available, making the program more responsive.

For example:

...

# apply function to each item in the iterable in parallel

for result in pool.imap_unordered(task, items):

# ...

Do not use the imap_unordered() function if you need to process the results in the order that the tasks were submitted to the process pool, instead, use map() function.

Do not use the imap_unordered() function if you need results from all tasks before continuing on in the program, instead, you may be better off using map_async() and the AsyncResult.wait() function.

Do not use the imap_unordered() function for a simple parallel for-loop, instead, you may be better off using map().

You can learn more about the imap_unordered() function in the tutorial:

Multiprocessing Pool.imap_unordered() in Python

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Use map_async() to Issue Tasks Asynchronously

If you need to issue many tasks asynchronously, e.g. fire-and-forget use the map_async() function.

The map_async() function does not block while the function is applied to each item in the iterable, instead it returns a AsyncResult object from which the results may be accessed.

Because map_async() does not block, it allows the caller to continue and retrieve the result when needed.

The caller can choose to call the wait() function on the returned AsyncResult object in order to wait for all of the issued tasks to complete, or call the get() function to wait for the task to complete and access an iterable of return values.

For example:

...

# apply the function

result = map_async(task, items)

# wait for all tasks to complete

result.wait()

Do not use the map_async() function if you want to issue the tasks and then process the results once all tasks are complete. You would be better off using the map() function.

Do not use the map_async() function if you want to issue tasks one-by-one in a lazy manner in order to conserve memory, instead use the imap() function.

Do not use the map_async() function if you wish to issue tasks that take multiple arguments, instead use the starmap_async() function.

You can learn more about the map_async() function in the tutorial:

Multiprocessing Pool.map_async() in Python

Use Independent Functions as Tasks

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Use the multiprocessing pool if your tasks are independent.

This means that each task is not dependent on other tasks that could execute at the same time. It also may mean tasks that are not dependent on any data other than data provided via function arguments to the task.

The multiprocessing pool is ideal for tasks that do not change any data, e.g. have no side effects, so-called pure functions.

The multiprocessing pool can be organized into data flows and pipelines for linear dependence between tasks, perhaps with one multiprocessing pool per task type.

The multiprocessing pool is not designed for tasks that require coordination, you should consider using the multiprocessing.Process class and coordination patterns like the Barrier and Semaphore.

Process pools are not designed for tasks that require synchronization, you should consider using the multiprocessing.Process class and locking patterns like Lock and RLock via a Manager.

Use for CPU-Bound Tasks (probably)

The multiprocessing pool can be used for IO-bound tasks and CPU-bound tasks.

Nevertheless, it is probably best suited for CPU-bound tasks, whereas the multiprocessing.pool.ThreadPool or ThreadPoolExecutor are probably best suited for IO-bound tasks.

CPU-bound tasks are those tasks that involve direct computation, e.g. executing instructions on data in the CPU. They are bound by the speed of execution of the CPU, hence the name CPU-bound.

This is unlike IO-bound tasks that must wait on external resources such as reading or writing to or from network connections and files.

Examples of common CPU-bound tasks that may be well suited to the multiprocessing.Pool include:

Media manipulation, e.g. resizing images, clipping audio and video, etc.
Media encoding or decoding, e.g. encoding audio and video.
Mathematical calculation, e.g. fractals, numerical approximation, estimation, etc.
Model training, e.g. machine learning and artificial intelligence.
Search, e.g. searching for a keyword in a large number of documents.
Text processing, e.g. calculating statistics on a corpus of documents.

The multiprocessing.Pool can be used for IO bound tasks, but it is probably a less well fit compared to using threads and the multiprocessing.pool.ThreadPool.

This is because of two reasons:

You can have more threads than processes.
IO-bound tasks are often data intensive.

The number of processes you can create and manage is often quite limited, such as tens or less than 100.

Whereas, when you are using threads you can have hundreds of threads or even thousands of threads within one process. This is helpful for IO operations that many need to access or manage a large number of connections or resources concurrently.

This can be pushed to tens of thousands of connections or resources or even higher when using AsyncIO.

IO-bound tasks typically involve reading or writing a lot of data.

This may be data read or written from or to remote connections, database, servers, files, external devices, and so on.

As such, if the data needs to be shared between processes, such as in a pipeline, it may require that the data be serialized (called pickled, the built-in Python serialization process) in order to pass from process to process. This can be slow and very memory intensive, especially for large amounts of data.

This is not the case when using threads that can share and access the same resource in memory without data serialization.

Takeaways

You now know the best practices when using the multiprocessing.Pool in Python.

Do you have any questions about the best practices?
Ask your question in the comments below and I will do my best to answer.

Photo by Quino Al on Unsplash

Multiprocessing Pool Best Practices in Python

Multiprocessing Pool Best Practices

Use the Context Manager

Use map() for Parallel For-Loops

Use imap_unordered() For Responsive Code

Use map_async() to Issue Tasks Asynchronously

Use Independent Functions as Tasks

Use for CPU-Bound Tasks (probably)

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn the Pool Class Systematically

Additional menu

Multiprocessing Pool Best Practices

Use the Context Manager

Use map() for Parallel For-Loops

Use imap_unordered() For Responsive Code

Use map_async() to Issue Tasks Asynchronously

Use Independent Functions as Tasks

Use for CPU-Bound Tasks (probably)

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Do you have any questions?Cancel reply

Footer

Learn the Pool Class Systematically