ThreadPoolExecutor in Python: The Complete Guide

January 20, 2022 by Jason Brownlee in Python ThreadPoolExecutor

Last Updated on November 23, 2023

The Python ThreadPoolExecutor provides reusable worker threads in Python.

The ThreadPoolExecutor class is part of the Python standard library. It offers easy-to-use pools of worker threads via the modern executor design pattern. It is ideal for making loops of I/O-bound tasks concurrent and for issuing tasks asynchronously.

This book-length guide provides a detailed and comprehensive walkthrough of the Python ThreadPoolExecutor API.

Some tips:

You may want to bookmark this guide and read it over a few sittings.
You can download a zip of all code used in this guide.
You can get help, ask a question in the comments or email me.
You can jump to the topics that interest you via the table of contents (below).

Let’s dive in.

ThreadPoolExecutor in Python: The Complete Guide

Table of Contents

Python Threads and the Need for Thread Pools

So, what are threads and why do we care about thread pools?

What Are Python Threads

A thread refers to a thread of execution by a computer program.

Every Python program is a process with one thread called the main thread used to execute your program instructions. Each process is in fact one instance of the Python interpreter that executes Python instructions (Python bytecode), which is a slightly lower level than the code you type into your Python program.

Sometimes, we may need to create additional threads within our Python process to execute tasks concurrently.

Python provides real naive (system-level) threads via the threading.Thread class.

A task can be run in a new thread by creating an instance of the Thread class and specifying the function to run in the new thread via the target argument.

...

# create and configure a new thread to run a function

thread = Thread(target=task)

Once the thread is created, it must be started by calling the start() function.

...

# start the task in a new thread

thread.start()

We can then wait around for the task to complete by joining the thread; for example

...

# wait for the task to complete

thread.join()

We can demonstrate this with a complete example with a task that sleeps for a moment and prints a message.

The complete example of executing a target task function in a separate thread is listed below.

# SuperFastPython.com

# example of executing a target task function in a separate thread

from time import sleep

from threading import Thread

# a simple task that blocks for a moment and prints a message

def task():

# block for a moment

sleep(1)

# display a message

print('This is coming from another thread')

# create and configure a new thread to run a function

thread = Thread(target=task)

# start the task in a new thread

thread.start()

# display a message

print('Waiting for the new thread to finish...')

# wait for the task to complete

thread.join()

Running the example creates the thread object to run the task() function.

The thread is started and the task() function is executed in another thread. The task sleeps for a moment; meanwhile, in the main thread, a message is printed that we are waiting around and the main thread joins the new thread.

Finally, the new thread finishes sleeping, prints a message, and closes. The main thread then carries on and also closes as there are no more instructions to execute.

1 2	Waiting for the new thread to finish... This is coming from another thread

You can learn more about Python threads in the guide:

Threading in Python: The Complete Guide

This is useful for running one-off ad hoc tasks in a separate thread, although it becomes cumbersome when you have many tasks to run.

Each thread that is created requires the application of resources (e.g. memory for the thread’s stack space). The computational costs for setting up threads can become expensive if we are creating and destroying many threads over and over for ad hoc tasks.

Instead, we would prefer to keep worker threads around for reuse if we expect to run many ad hoc tasks throughout our program.

This can be achieved using a thread pool.

What Are Thread Pools

A thread pool is a programming pattern for automatically managing a pool of worker threads.

The pool is responsible for a fixed number of threads.

It controls when the threads are created, such as just-in-time when they are needed.
It also controls what threads should do when they are not being used, such as making them wait without consuming computational resources.

Each thread in the pool is called a worker or a worker thread. Each worker is agnostic to the type of tasks that are executed, along with the user of the thread pool to execute a suite of similar (homogeneous) or dissimilar tasks (heterogeneous) in terms of the function called, function arguments, task duration, and more.

Worker threads are designed to be re-used once the task is completed and provide protection against the unexpected failure of the task, such as raising an exception, without impacting the worker thread itself.

This is unlike a single thread that is configured for the single execution of one specific task.

The pool may provide some facility to configure the worker threads, such as running an initialization function and naming each worker thread using a specific naming convention.

Thread pools can provide a generic interface for executing ad hoc tasks with a variable number of arguments, but do not require that we choose a thread to run the task, start the thread, or wait for the task to complete.

It can be significantly more efficient to use a thread pool instead of manually starting, managing, and closing threads, especially with a large number of tasks.

Python provides a thread pool via the ThreadPoolExecutor class.

Run loops using all CPUs, download your FREE book to learn how.

ThreadPoolExecutor for Thread Pools in Python

The ThreadPoolExecutor Python class is used to create and manage thread pools and is provided in the concurrent.futures module.

The concurrent.futures module was introduced in Python 3.2 written by Brian Quinlan and provides both thread pools and process pools, although we will focus our attention on thread pools in this guide.

If you’re interested, you can access the Python source code for the ThreadPoolExecutor class directly via thread.py. It may be interesting to dig into how the class works internally, perhaps after you are familiar with how it works from the outside.

The ThreadPoolExecutor extends the Executor class and will return Future objects when it is called.

Executor: Parent class for the ThreadPoolExecutor that defines basic lifecycle operations for the pool.
Future: Object returned when submitting tasks to the thread pool that may complete later.

Let’s take a closer look at Executors, Futures, and the lifecycle of using the ThreadPoolExecutor class.

What Are Executors

The ThreadPoolExecutor class extends the abstract Executor class.

The Executor class defines three methods used to control our thread pool; they are: submit(), map(), and shutdown().

submit(): Dispatch a function to be executed and return a future object.
map(): Apply a function to an iterable of elements.
shutdown(): Shut down the executor.

The Executor is started when the class is created and must be shut down explicitly by calling shutdown(), which will release any resources held by the Executor. We can also shut down automatically, but we will look at that a little later.

The submit() and map() functions are used to submit tasks to the Executor for asynchronous execution.

The map() function operates just like the built-in map() function and is used to apply a function to each element in an iterable object, like a list. Unlike the built-in map() function, each application of the function to an element will happen asynchronously instead of sequentially.

The submit() function takes a function, as well as any arguments, and will execute it asynchronously, although the call returns immediately and provides a Future object.

We will take a closer look at each of these three functions in a moment. Firstly, what is a Future?

What Are Futures

A future is an object that represents a delayed result for an asynchronous task.

It is also sometimes called a promise or a delay. It provides a context for the result of a task that may or may not be executing and a way of getting a result once it is available.

In Python, the Future object is returned from an Executor, such as a ThreadPoolExecutor when calling the submit() function to dispatch a task to be executed asynchronously.

In general, we do not create Future objects; we only receive them and we may need to call functions on them.

There is always one Future object for each task sent into the ThreadPoolExecutor via a call to submit().

The Future object provides a number of helpful functions for inspecting the status of the task such as: cancelled(), running(), and done() to determine if the task was cancelled, is currently running, or has finished execution.

cancelled(): Returns True if the task was cancelled before being executed.
running(): Returns True if the task is currently running.
done(): Returns True if the task has completed or was cancelled.

A running task cannot be cancelled and a done task could have been cancelled.

A Future object also provides access to the result of the task via the result() function. If an exception was raised while executing the task, it will be re-raised when calling the result() function or can be accessed via the exception() function.

result(): Access the result from running the task.
exception(): Access any exception raised while running the task.

Both the result() and exception() functions allow a timeout to be specified as an argument, which is the number of seconds to wait for a return value if the task is not yet complete. If the timeout expires, then a TimeoutError will be raised.

Finally, we may want to have the thread pool automatically call a function once the task is completed.

This can be achieved by attaching a callback to the Future object for the task via the add_done_callback() function.

add_done_callback(): Add a callback function to the task to be executed by the thread pool once the task is completed.

We can add more than one callback to each task and they will be executed in the order they were added. If the task has already completed before we add the callback, then the callback is executed immediately.

Any exceptions raised in the callback function will not impact the task or thread pool.

We will take a closer look at the Future object in a later section.

Now that we are familiar with the functionality of a ThreadPoolExecutor provided by the Executor class and of Future objects returned by calling submit(), let’s take a closer look at the lifecycle of the ThreadPoolExecutor class.

Download Now: Free ThreadPoolExecutor PDF Cheat Sheet

LifeCycle of the ThreadPoolExecutor

The ThreadPoolExecutor provides a pool of generic worker threads.

The ThreadPoolExecutor was designed to be easy and straightforward to use.

If multithreading was like the transmission for changing gears in a car, then using threading.Thread is a manual transmission (e.g. hard to learn and and use) whereas concurrency.futures.ThreadPoolExecutor is an automatic transmission (e.g. easy to learn and use).

threading.Thread: Manual threading in Python.
concurrency.futures.ThreadPoolExecutor: Automatic or “just work” mode for threading in Python.

There are four main steps in the lifecycle of using the ThreadPoolExecutor class; they are: create, submit, wait, and shut down.

1. Create: Create the thread pool by calling the constructor ThreadPoolExecutor().
2. Submit: Submit tasks and get futures by calling submit() or map().
3. Wait: Wait and get results as tasks complete (optional).
4. Shut down: Shut down the thread pool by calling shutdown().

The following figure helps to picture the lifecycle of the ThreadPoolExecutor class.

Let’s take a closer look at each lifecycle step in turn.

Step 1. Create the Thread Pool

First, a ThreadPoolExecutor instance must be created.

When an instance of a ThreadPoolExecutor is created, it must be configured with the fixed number of threads in the pool, a prefix used when naming each thread in the pool, and the name of a function to call when initializing each thread along with any arguments for the function.

The pool is created with one thread for each CPU in your system plus four. This is good for most purposes.

Default Total Threads = (Total CPUs) + 4

For example, if you have 4 CPUs, each with hyperthreading (most modern CPUs have this), then Python will see 8 CPUs and will allocate (8 + 4) or 12 threads to the pool by default.

...

# create a thread pool with the default number of worker threads

executor = ThreadPoolExecutor()

It is a good idea to test your application in order to determine the number of threads that results in the best performance, anywhere from a few threads to hundreds of threads.

It is typically not a good idea to have thousands of threads as it may start to impact the amount of available RAM and results in a large amount of switching between threads, which may result in worse performance.

Well discuss tuning the number of threads for your pool more later on.

You can specify the number of threads to create in the pool via the max_workers argument; for example:

...

# create a thread pool with 10 worker threads

executor = ThreadPoolExecutor(max_workers=10)

Step 2. Submit Tasks to the Thread Pool

Once the thread pool has been created, you can submit tasks for asynchronous execution.

As discussed, there are two main approaches for submitting tasks defined on the Executor parent class. They are: map() and submit().

Step 2a. Submit Tasks With map()

The map() function is an asynchronous version of the built-in map() function for applying a function to each element in an iterable, like a list.

You can call the map() function on the pool and pass it the name of your function and the iterable.

You are most likely to use map() when converting a for loop to run using one thread per loop iteration.

...

# perform all tasks in parallel

results = pool.map(my_task, my_items) # does not block

Where “my_task” is your function that you wish to execute and “my_items” is your iterable of objects, each to be executed by your “my_task” function.

The tasks will be queued up in the thread pool and executed by worker threads in the pool as they become available.

The map() function will return an iterable immediately. This iterable can be used to access the results from the target task function as they are available in the order that the tasks were submitted (e.g. order of the iterable you provided).

...

# iterate over results as they become available

for result in executor.map(my_task, my_items):

print(result)

You can also set a timeout when calling map() via the “timeout” argument in seconds if you wish to impose a limit on how long you’re willing to wait for each task to complete as you’re iterating, after which a TimeOut error will be raised.

...

# perform all tasks in parallel

# iterate over results as they become available

for result in executor.map(my_task, my_items, timeout=5):

# wait for task to complete or timeout expires

print(result)

2a. Submit Tasks With submit()

The submit() function submits one task to the thread pool for execution.

The function takes the name of the function to call and all arguments to the function, then returns a Future object immediately.

The Future object is a promise to return the results from the task (if any) and provides a way to determine if a specific task has been completed or not.

...

# submit a task with arguments and get a future object

future = executor.submit(my_task, arg1, arg2) # does not block

Where “my_task” is the function you wish to execute and “arg1” and “arg2” are the first and second arguments to pass to the “my_task” function.

You can use the submit() function to submit tasks that do not take any arguments; for example:

...

# submit a task with no arguments and get a future object

future = executor.submit(my_task) # does not block

You can access the result of the task via the result() function on the returned Future object. This call will block until the task is completed.

...

# get the result from a future

result = future.result() # blocks

You can also set a timeout when calling result() via the “timeout” argument in seconds if you wish to impose a limit on how long you’re willing to wait for each task to complete, after which a TimeOut error will be raised.

...

# wait for task to complete or timeout expires

result = future.result(timeout=5) # blocks

Step 3. Wait for Tasks to Complete (Optional)

The concurrent.futures module provides two module utility functions for waiting for tasks via their Future objects.

Recall that Future objects are only created when we call submit() to push tasks into the thread pool.

These wait functions are optional to use, as you can wait for results directly after calling map() or submit() or wait for all tasks in the thread pool to finish.

These two module functions are wait() for waiting for Future objects to complete and as_completed() for getting Future objects as their tasks complete.

wait(): Wait on one or more Future objects until they are completed.
as_completed(): Returns Future objects from a collection as they complete their execution.

You can use both functions with Future objects created by one or more thread pools, they are not specific to any given thread pool in your application. This is helpful if you want to perform waiting operations across multiple thread pools that are executing different types of tasks.

Both functions are useful to use with an idiom of dispatching multiple tasks into the thread pool via submit in a list compression; for example:

...

# dispatch tasks into the thread pool and create a list of futures

futures = [executor.submit(my_task, my_data) for my_data in my_datalist]

Here, my_task is our custom target task function, “my_data” is one element of data passed as an argument to “my_task“, and “my_datalist” is our source of my_data objects.

We can then pass the Future objects to wait() or as_completed().

Creating a list of Future objects in this way is not required, just a common pattern when converting for loops into tasks submitted to a thread pool.

Step 3a. Wait for Futures to Complete

The wait() function can take one or more Future objects and will return when a specified action occurs, such as all tasks completing, one task completing, or one task raising an exception.

The function will return one set of Future objects that match the condition set via the “return_when“. The second set will contain all of the futures for tasks that did not meet the condition. These are called the “done” and the “not_done” sets of futures.

It is useful for waiting on a large batch of work and to stop waiting when we get the first result.

This can be achieved via the FIRST_COMPLETED constant passed to the “return_when” argument.

...

# wait until we get the first result

done, not_done = wait(futures, return_when=concurrent.futures.FIRST_COMPLETED)

Alternatively, we can wait for all tasks to complete via the ALL_COMPLETED constant.

This can be helpful if you are using submit() to dispatch tasks and are looking for an easy way to wait for all work to be completed.

...

# wait for all tasks to complete

done, not_done = wait(futures, return_when=concurrent.futures.ALL_COMPLETED)

There is also an option to wait for the first exception via the FIRST_EXCEPTION constant.

...

# wait for the first exception

done, not_done = wait(futures, return_when=concurrent.futures.FIRST_EXCEPTION)

Step 3b. Wait for Futures as Completed

The beauty of performing tasks concurrently is that we can get results as they become available, rather than waiting for all tasks to be completed.

The as_completed() function will return Future objects for tasks as they are completed in the thread pool.

We can call the function and provide it a list of Future objects created by calling submit() and it will return Future objects as they are completed in whatever order.

It is common to use the as_completed() function in a loop over the list of Future objects created when calling submit; for example:

...

# iterate over all submitted tasks and get results as they are available

for future in as_completed(futures):

# get the result for the next completed task

result = future.result() # blocks

Note: this is different from iterating over the results from calling map() in two ways. Firstly, map() returns an iterator over objects, not over Future objects. Secondly, map() returns results in the order that the tasks were submitted, not in the order that they are completed.

Step 4. Shutdown the Thread Pool

Once all tasks are completed, we can close down the thread pool, which will release each thread and any resources it may hold (e.g. the thread stack space).

...

# shutdown the thread pool

executor.shutdown() # blocks

The shutdown() function will wait for all tasks in the thread pool to complete before returning by default.

This behavior can be changed by setting the “wait” argument to False when calling shutdown(), in which case the function will return immediately. The resources used by thread pool will not be released until all current and queued tasks are completed.

...

# shutdown the thread pool

executor.shutdown(wait=False) # does not blocks

We can also instruct the pool to cancel all queued tasks to prevent their execution. This can be achieved by setting the “cancel_futures” argument to True. By default queued tasks are not cancelled when calling shutdown().

...

# cancel all queued tasks

executor.shutdown(cancel_futures=True) # blocks

If we forget to close the thread pool, the thread pool will be closed automatically when we exit the main thread. If we forget to close the pool and there are still tasks executing, the main thread will not exit until all tasks in the pool and all queued tasks have executed.

ThreadPoolExecutor Context Manager

A preferred way to work with the ThreadPoolExecutor class is to use a context manager.

This matches the preferred way to work with other resources, such as files and sockets.

Using the ThreadPoolExecutor with a context manager involves using the “with” keyword to create a block in which you can use the thread pool to execute tasks and get results.

Once the block has completed, the thread pool is automatically shut down. Internally, the context manager will call the shutdown() function with the default arguments, waiting for all queued and executing tasks to complete before returning and carrying on.

Below is a code snippet to demonstrate creating a thread pool using the context manager.

...

# create a thread pool

with ThreadPoolExecutor(max_workers=10) as pool:

# submit tasks and get results

# ...

# automatically shutdown the thread pool...

# the pool is shutdown at this point

This is a very handy idiom if you are converting a for loop to be multithreaded.

It is less useful if you want the thread pool to operate in the background while you perform other work in the main thread of your program, or if you wish to reuse the thread pool multiple times throughout your program.

Now that we are familiar with how to use the ThreadPoolExecutor, let’s look at some worked examples.

Free Python ThreadPoolExecutor Course

Download your FREE ThreadPoolExecutor PDF cheat sheet and get BONUS access to my free 7-day crash course on the ThreadPoolExecutor API.

Discover how to use the ThreadPoolExecutor class including how to configure the number of workers and how to execute tasks asynchronously.

Learn more

ThreadPoolExecutor Example

In this section, we will look at a more complete example of using the ThreadPoolExecutor.

Perhaps the most common use case for the ThreadPoolExecutor is to download files from the internet concurrently.

It’s a useful problem because there are many ways we can download files. We will use this problem as the basis to explore the different patterns with the ThreadPoolExecutor for downloading files concurrently.

First, let’s develop a serial (non-concurrent) version of the program.

Download Files Serially

Consider the situation where we might want to have a local copy of some of the Python API documentation on concurrency for later view.

Perhaps we are taking a flight and won’t have internet access and will need to refer to the documentation in HTML format as it appears on the docs.python.org website. It’s a contrived scenario; Python is installed with docs and we also have the pydoc command, but go with me here.

We may want to download local copies of the following ten URLs that cover the extent of the Python concurrency APIs.

We can define these URLs as a list of strings for processing in our program.

# python concurrency API docs

URLS = ['https://docs.python.org/3/library/concurrency.html',

'https://docs.python.org/3/library/concurrent.html',

'https://docs.python.org/3/library/concurrent.futures.html',

'https://docs.python.org/3/library/threading.html',

'https://docs.python.org/3/library/multiprocessing.html',

'https://docs.python.org/3/library/multiprocessing.shared_memory.html',

'https://docs.python.org/3/library/subprocess.html',

'https://docs.python.org/3/library/queue.html',

'https://docs.python.org/3/library/sched.html',

'https://docs.python.org/3/library/contextvars.html']

URLs are reasonably easy to download in Python.

First, we can attempt to open a connection to the server using the urllib.request.urlopen() function and specify the URL and a reasonable timeout in seconds.

This will give a connection, which we can then call the read() function to read the contents of the file. Using the context manager for the connection will ensure it will be closed automatically, even if an exception is raised.

The download_url() function below implements this, taking a URL as a parameter and returning the contents of the file or None if the file cannot be downloaded for whatever reason. We will set a lengthy timeout of 3 seconds in case our internet connection is flaky for some reason.

# download a url and return the raw data, or None on error

def download_url(url):

try:

# open a connection to the server

with urlopen(url, timeout=3) as connection:

# read the contents of the html doc

return connection.read()

except:

# bad url, socket timeout, http forbidden, etc.

return None

Once we have the data for a URL, we can save it as a local file.

First, we need to retrieve the filename of the file specified in the URL. There are a few ways to do this, but the os.path.basename() function is a common approach when working with paths. We can then use the os.path.join() function to construct an output path for saving the file, using a directory we specify and the filename.

We can then use the open() built-in function to open the file in write-binary mode and save the contents of the file, again using the context manager to ensure the file is closed once we are finished.

The save_file() function below implements this, taking the URL that was downloaded, the contents of the file that was downloaded, and the local output path where we wish to save downloaded files. It returns the output path that was used to save the file, in case we want to report progress to the user.

# save data to a local file

def save_file(url, data, path):

# get the name of the file from the url

filename = basename(url)

# construct a local path for saving the file

outpath = join(path, filename)

# save to file

with open(outpath, 'wb') as file:

file.write(data)

return outpath

Next, we can call the download_url() function for a given URL in our list then save_file() to save each downloaded file.

The download_and_save() function below implements this, reporting progress along the way, and handling the case of URLs that cannot be downloaded.

# download and save a url as a local file

def download_and_save(url, path):

# download the url

data = download_url(url)

# check for no data

if data is None:

print(f'>Error downloading {url}')

return

# save the data to a local file

outpath = save_file(url, data, path)

# report progress

print(f'>Saved {url} to {outpath}')

Finally, we need a function to drive the process.

First, the local output location where we will be saving files needs to be created, if it does not exist. We can achieve this using the os.makedirs() function.

We can iterate over a list of URLs and call our download_and_save() function for each.

The download_docs() function below implements this.

# download a list of URLs to local files

def download_docs(urls, path):

# create the local directory, if needed

makedirs(path, exist_ok=True)

# download each url and save as a local file

for url in urls:

download_and_save(url, path)

And that’s it.

We can then call our download_docs() with our list of URLs and an output directory. In this case, we will use a ‘docs/‘ subdirectory of our current working directory (where the Python script is located) as the output directory.

Tying this together, the complete example of downloading files serially is listed below.

# SuperFastPython.com

# download document files and save to local files serially

from os import makedirs

from os.path import basename

from os.path import join

from urllib.request import urlopen

# download a url and return the raw data, or None on error

def download_url(url):

try:

# open a connection to the server

with urlopen(url, timeout=3) as connection:

# read the contents of the html doc

return connection.read()

except:

# bad url, socket timeout, http forbidden, etc.

return None

# save data to a local file

def save_file(url, data, path):

# get the name of the file from the url

filename = basename(url)

# construct a local path for saving the file

outpath = join(path, filename)

# save to file

with open(outpath, 'wb') as file:

file.write(data)

return outpath

# download and save a url as a local file

def download_and_save(url, path):

# download the url

data = download_url(url)

# check for no data

if data is None:

print(f'>Error downloading {url}')

return

# save the data to a local file

outpath = save_file(url, data, path)

# report progress

print(f'>Saved {url} to {outpath}')

# download a list of URLs to local files

def download_docs(urls, path):

# create the local directory, if needed

makedirs(path, exist_ok=True)

# download each url and save as a local file

for url in urls:

download_and_save(url, path)

# python concurrency API docs

URLS = ['https://docs.python.org/3/library/concurrency.html',

'https://docs.python.org/3/library/concurrent.html',

'https://docs.python.org/3/library/concurrent.futures.html',

'https://docs.python.org/3/library/threading.html',

'https://docs.python.org/3/library/multiprocessing.html',

'https://docs.python.org/3/library/multiprocessing.shared_memory.html',

'https://docs.python.org/3/library/subprocess.html',

'https://docs.python.org/3/library/queue.html',

'https://docs.python.org/3/library/sched.html',

'https://docs.python.org/3/library/contextvars.html']

# local path for saving the files

PATH = 'docs'

# download all docs

download_docs(URLS, PATH)

Running the example iterates over the list of URLs and downloads each in turn.

Each file is then saved to a local file in the specified directory.

The process takes about 700 milliseconds to about one second (1,000 milliseconds) on my system.

Try running it a few times; how long does it take on your system?
Let me know in the comments.

>Saved https://docs.python.org/3/library/concurrency.html to docs/concurrency.html

>Saved https://docs.python.org/3/library/concurrent.html to docs/concurrent.html

>Saved https://docs.python.org/3/library/concurrent.futures.html to docs/concurrent.futures.html

>Saved https://docs.python.org/3/library/threading.html to docs/threading.html

>Saved https://docs.python.org/3/library/multiprocessing.html to docs/multiprocessing.html

>Saved https://docs.python.org/3/library/multiprocessing.shared_memory.html to docs/multiprocessing.shared_memory.html

>Saved https://docs.python.org/3/library/subprocess.html to docs/subprocess.html

>Saved https://docs.python.org/3/library/queue.html to docs/queue.html

>Saved https://docs.python.org/3/library/sched.html to docs/sched.html

>Saved https://docs.python.org/3/library/contextvars.html to docs/contextvars.html

Next, we can look at making the program concurrent using a thread pool.

Download Files Concurrently With submit()

Let’s look at updating our program to make use of the ThreadPoolExecutor to download files concurrently.

A first thought might be to use map() as we just want to make a for-loop concurrent.

Unfortunately, the download_and_save() function that we call each iteration in the loop takes two parameters, only one of which is an iterable.

An alternate approach is to use submit() to call download_and_save() in a separate thread for each URL in the provided list.

We can do this by first configuring a thread pool with the number of threads equal to the number of URLs in the list. We’ll use the context manager for the thread pool so that it will be closed automatically for us when we finish.

We can then call the submit() function for each URL using a list compression. We don’t even need the Future objects returned from calling submit, as there is no result we’re waiting for.

...

# create the thread pool

n_threads = len(urls)

with ThreadPoolExecutor(n_threads) as executor:

# download each url and save as a local file

_ = [executor.submit(download_and_save, url, path) for url in urls]

Once each thread has completed, the context manager will close the thread pool for us and we’re done.

We don’t even need to add an explicit call to wait, although we could if we wanted to make the code more readable; for example:

...

# create the thread pool

n_threads = len(urls)

with ThreadPoolExecutor(n_threads) as executor:

# download each url and save as a local file

futures = [executor.submit(download_and_save, url, path) for url in urls]

# wait for all download tasks to complete

_, _ = wait(futures)

But, adding this wait is not needed.

The updated version of our download_docs() function that downloads and saves the files concurrently is listed below.

# download a list of URLs to local files

def download_docs(urls, path):

# create the local directory, if needed

makedirs(path, exist_ok=True)

# create the thread pool

n_threads = len(urls)

with ThreadPoolExecutor(n_threads) as executor:

# download each url and save as a local file

_ = [executor.submit(download_and_save, url, path) for url in urls]

Tying this together, the complete example is listed below.

# SuperFastPython.com

# download document files and save to local files concurrently

from os import makedirs

from os.path import basename

from os.path import join

from urllib.request import urlopen

from concurrent.futures import ThreadPoolExecutor

# download a url and return the raw data, or None on error

def download_url(url):

try:

# open a connection to the server

with urlopen(url, timeout=3) as connection:

# read the contents of the html doc

return connection.read()

except:

# bad url, socket timeout, http forbidden, etc.

return None

# save data to a local file

def save_file(url, data, path):

# get the name of the file from the url

filename = basename(url)

# construct a local path for saving the file

outpath = join(path, filename)

# save to file

with open(outpath, 'wb') as file:

file.write(data)

return outpath

# download and save a url as a local file

def download_and_save(url, path):

# download the url

data = download_url(url)

# check for no data

if data is None:

print(f'>Error downloading {url}')

return

# save the data to a local file

outpath = save_file(url, data, path)

# report progress

print(f'>Saved {url} to {outpath}')

# download a list of URLs to local files

def download_docs(urls, path):

# create the local directory, if needed

makedirs(path, exist_ok=True)

# create the thread pool

n_threads = len(urls)

with ThreadPoolExecutor(n_threads) as executor:

# download each url and save as a local file

_ = [executor.submit(download_and_save, url, path) for url in urls]

# python concurrency API docs

URLS = ['https://docs.python.org/3/library/concurrency.html',

'https://docs.python.org/3/library/concurrent.html',

'https://docs.python.org/3/library/concurrent.futures.html',

'https://docs.python.org/3/library/threading.html',

'https://docs.python.org/3/library/multiprocessing.html',

'https://docs.python.org/3/library/multiprocessing.shared_memory.html',

'https://docs.python.org/3/library/subprocess.html',

'https://docs.python.org/3/library/queue.html',

'https://docs.python.org/3/library/sched.html',

'https://docs.python.org/3/library/contextvars.html']

# local path for saving the files

PATH = 'docs'

# download all docs

download_docs(URLS, PATH)

Running the example downloads and saves the files as before.

This time, the operation is complete in a fraction of a second. About 300 milliseconds in my case, which is less than half the time it took to download all files serially in the previous example.

How long did it take to download all files on your system?