Last Updated on September 12, 2022
The ThreadPoolExecutor class in Python can be used to check the status of multiple websites concurrently.
Using threads to check website status can dramatically speed-up the process compared to checking the status of each website sequentially, one by one.
In this tutorial, you will discover how to concurrently check the status of websites on the internet using thread pools in Python.
After completing this tutorial, you will know:
- How to check the status of websites on the internet sequentially in Python and how slow it can be.
- How to use the ThreadPoolExecutor to manage a pool of worker threads.
- How to update code to check multiple websites at the same time and dramatically accelerate the process.
Let’s dive in.
Check the Status of Websites One-by-One (slowly)
Checking the status of websites on the internet is a common task.
For example:
- You may need to check if a suite of company websites are accessible and take action if they are not.
- You may need to check if websites start generating errors, such as under heavy load.
- You may need to check if your internet connection is working correctly.
There are professional services we may use for some of these tasks. Nevertheless, sometimes we need a quick script to check on the status of a number of websites.
There are many ways to check on the status of a website in Python via HTTP.
Perhaps the most common is to open a connection to the server using the urlopen() function, then call the getcode() function on the open HTTPSConnection.
You can learn more about these standard library functions here:
For example, we can use the context manager to open the HTTP connection and get the status code, which will ensure that the connection will be closed automatically when we exit the block either normally, or if an exception is raised.
1 2 3 4 5 6 |
... # open a connection to the server with urlopen(url) as connection: # get the response code, e.g. 200 code = connection.getcode() # ... |
Also, given that we may be calling this script many times under possibly intermittent internet connectivity, we may also want to add a timeout to the connection attempt, perhaps a few seconds.
This can be achieved when calling the urlopen() function via the timeout argument.
1 2 3 4 5 |
... # open a connection to the server with a timeout with urlopen(url, timeout=3) as connection: # get the response code, e.g. 200 code = connection.getcode() |
Opening an HTTP connection can fail for many reasons, such as malformed URLs to servers.
We can choose to handle some of these cases and return an appropriate message.
For example, if we are able to connect but the server reports an error, then an HTTPError will be raised on which we can retrieve the specific HTTP status code.
1 2 3 4 5 6 7 8 9 10 |
... # handle connection errors try: # open a connection to the server with a timeout with urlopen(url, timeout=3) as connection: # get the response code, e.g. 200 code = connection.getcode() # ... except HTTPError as e: return e.code |
Alternatively, a URLError could be thrown if we are unable to connect, perhaps because the domain could not be resolved via DNS, in which case we might choose to retrieve the error message as the status.
1 2 3 4 5 6 7 8 9 10 11 12 |
... # handle connection errors try: # open a connection to the server with a timeout with urlopen(url, timeout=3) as connection: # get the response code, e.g. 200 code = connection.getcode() # ... except HTTPError as e: return e.code except URLError as e: return e.reason |
Finally, if something else happens, and this is always possible, then we can retrieve traps and return more general faults.
You can learn more about these errors here:
Tying this together, the get_website_status() function below takes a URL and returns the HTTP status, or a useful error message if there is a failure.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# get the status of a website def get_website_status(url): # handle connection errors try: # open a connection to the server with a timeout with urlopen(url, timeout=3) as connection: # get the response code, e.g. 200 code = connection.getcode() return code except HTTPError as e: return e.code except URLError as e: return e.reason except: return e |
Once we have an error status, we need to interpret it.
Python offers the http module and the HTTPStatus class with constants for the most common HTTP status codes.
Nevertheless, we can keep our script simple by checking for a status 200 (OK), otherwise reporting an error.
The get_status() function below takes an HTTP status code returned from the get_website_status() function and interprets it with a human readable message as 'OK' or 'ERROR'.
1 2 3 4 5 |
# interpret an HTTP response code into a status def get_status(code): if code == HTTPStatus.OK: return 'OK' return 'ERROR' |
Next, we need to get the status, interpret it, and report the results for each URL in a list of URLs.
This ties the program together.
The list of URLs may be loaded from a database or a text file in practice. In this case, we will define a constant named URLS that lists ten of the most popular websites on the internet in no particular order.
1 2 3 4 5 6 7 8 9 10 11 |
# list of urls to check URLS = ['https://twitter.com', 'https://google.com', 'https://facebook.com', 'https://reddit.com', 'https://youtube.com', 'https://amazon.com', 'https://wikipedia.org', 'https://ebay.com', 'https://instagram.com', 'https://cnn.com'] |
We can then define a function that takes a list of URLs and gets the status for each with a call first to the get_website_status() function to get the HTTP code, then to get_status() to interpret the code, before calling the print function to report the status.
1 2 3 4 5 6 7 |
... # get the status for the website code = get_website_status(url) # interpret the status status = get_status(code) # report status print(f'{url:20s}\t{status:5s}\t{code}') |
Tying this together, the named check_status_urls() function that reports on the status of a list of URLs is listed below.
1 2 3 4 5 6 7 8 9 |
# check status of a list of websites def check_status_urls(urls): for url in urls: # get the status for the website code = get_website_status(url) # interpret the status status = get_status(code) # report status print(f'{url:20s}\t{status:5s}\t{code}') |
We can then call this function without a predefined list of URLs.
1 2 3 |
... # check all urls check_status_urls(URLS) |
And that’s it.
Tying this together, the complete example of checking the status of a list of URLs is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
# SuperFastPython.com # example of pinging the status of a set of websites from urllib.request import urlopen from urllib.error import URLError from urllib.error import HTTPError from http import HTTPStatus # get the status of a website def get_website_status(url): # handle connection errors try: # open a connection to the server with a timeout with urlopen(url, timeout=3) as connection: # get the response code, e.g. 200 code = connection.getcode() return code except HTTPError as e: return e.code except URLError as e: return e.reason except: return e # interpret an HTTP response code into a status def get_status(code): if code == HTTPStatus.OK: return 'OK' return 'ERROR' # check status of a list of websites def check_status_urls(urls): for url in urls: # get the status for the website code = get_website_status(url) # interpret the status status = get_status(code) # report status print(f'{url:20s}\t{status:5s}\t{code}') # list of urls to check URLS = ['https://twitter.com', 'https://google.com', 'https://facebook.com', 'https://reddit.com', 'https://youtube.com', 'https://amazon.com', 'https://wikipedia.org', 'https://ebay.com', 'https://instagram.com', 'https://cnn.com'] # check all urls check_status_urls(URLS) |
Running the example gets the status code, which is then interpreted and reported for each URL in the list.
In this case, all URLs report an OK status except amazon.com, which reports a 503 error. It might be because amazon prefers the www prefix, or because it does not like to be pinged by unknown web clients.
1 2 3 4 5 6 7 8 9 10 |
https://twitter.com OK 200 https://google.com OK 200 https://facebook.com OK 200 https://reddit.com OK 200 https://youtube.com OK 200 https://amazon.com ERROR 503 https://wikipedia.org OK 200 https://ebay.com OK 200 https://instagram.com OK 200 https://cnn.com OK 200 |
Checking the status of websites one-by-one is slow.
On my system, it took a little less than ten seconds to query all ten websites.
How long did it take to run on your system?
Let me know in the comments below.
Next, we will look at the ThreadPoolExecutor class that can be used to create a pool of worker threads that will allow us to speed up the status checking process.
Run loops using all CPUs, download your FREE book to learn how.
Create a Pool of Worker Threads With ThreadPoolExecutor
We can use the ThreadPoolExecutor to speed up the download of multiple files listed on an HTML webpage.
The ThreadPoolExecutor class is provided as part of the concurrent.futures module for easily running concurrent tasks.
The ThreadPoolExecutor provides a pool of worker threads, which is different from the ProcessPoolExecutor that provides a pool of worker processes.
Generally, ThreadPoolExecutor should be used for concurrent IO-bound tasks, like downloading URLs, and the ProcessPoolExecutor should be used for concurrent CPU-bound tasks, like calculating.
Using the ThreadPoolExecutor was designed to be easy and straightforward. It is like the “automatic mode” for Python threads.
- Create the thread pool by calling ThreadPoolExecutor().
- Submit tasks and get futures by calling submit().
- Wait and get results as tasks complete by calling as_completed().
- Shutdown the thread pool by calling shutdown()
Step 1: Create the Thread Pool
First, a ThreadPoolExecutor instance must be created.
By default, it will create a pool equal to the number of logical CPUs you have available plus four.
This is good for most purposes.
1 2 3 |
... # create a thread pool with the default number of worker threads executor = ThreadPoolExecutor() |
You can run tens to hundreds of concurrent IO-bound threads per CPU, although perhaps not thousands or tens of thousands.
You can specify the number of threads to create in the pool via the max_workers argument; for example:
1 2 3 |
... # create a thread pool with 10 worker threads executor = ThreadPoolExecutor(max_workers=10) |
Step 2: Submit Tasks to the Thread Pool
Once created, it can send tasks into the pool to be completed using the submit() function.
This function takes the name of the function to call any and all arguments and returns a Future object.
The Future object is a promise to return the results from the task (if any) and provides a way to determine if a specific task has been completed or not.
1 2 3 |
... # submit a task and receive a future future = executor.submit(task, arg1, arg2, ...) |
The return from a function executed by the thread pool can be accessed via the result() function on the Future object. It will wait until the result is available, if needed, or return immediately if the result is available.
For example:
1 2 3 |
... # get the result from a future result = future.result() |
An alternate and perhaps simpler approach to submitting asynchronous tasks to the thread pool is to use the map() function provided on the ThreadPoolExecutor.
It works just the same as the built-in map() function. It takes the name of the target task function and an iterable. The target function is then called for each item in the iterable, and because we are using a thread pool, each call is submitted as an asynchronous task.
The map() function then returns an iterable over the results from each function call. We can call the map() function and iterate over the results in a for-loop directly; for example:
1 2 3 |
# perform all tasks in parallel for result in executor.map(task, items): # ... |
Step 3: Get Results as Tasks Complete
The beauty of performing tasks concurrently is that we can get results as they become available, rather than waiting for tasks to be completed in the order they were submitted.
The concurrent.futures module provides an as_completed() function that we can use to get results for tasks as they are completed, just like its name suggests.
We can call the function and provide it a list of future objects created by calling submit() and it will return future objects as they are completed in whatever order.
For example, we can use a list comprehension to submit the tasks and create the list of Future objects:
1 2 3 |
... # submit all tasks into the thread pool and create a list of futures futures = [pool.submit(work, task) for task in tasks] |
Then get results for tasks as they complete in a for loop:
1 2 3 4 5 |
# iterate over all submitted tasks and get results as they are available for future in as_completed(futures): # get the result result = future.result() # do something with the result... |
Step 4: Shutdown the Thread Pool
Once all tasks are completed, we can close down the thread pool, which will release each thread and any resources it may hold (e.g. the stack space).
1 2 3 |
... # shutdown the thread pool executor.shutdown() |
An easier way to use the thread pool is via the context manager (the with keyword), which ensures it is closed automatically once we are finished with it.
1 2 3 4 5 6 7 8 9 10 |
... # create a thread pool with ThreadPoolExecutor(max_workers=10) as executor: # submit tasks futures = [executor.submit(work, task) for task in tasks] # get results as they are available for future in as_completed(futures): # get the result result = future.result() # do something with the result... |
Now that we are familiar with ThreadPoolExecutor and how to use it, let’s look at how we can adapt our program for checking the status of websites to make use of it.
Check the Status of Websites Concurrently
We can update our program to make use of the ThreadPoolExecutor with very few changes.
All of the changes are contained to the check_status_urls() function.
First, we can create a thread pool using the context manager and set the number of worker threads to the number of URLs in our list. This way, we can attempt to query all ten websites at once.
1 2 3 4 |
... # create the thread pool with ThreadPoolExecutor(len(urls)) as executor: # ... |
Next, we need to perform asynchronous calls to the get_website_status() function.
Perhaps the simplest approach is to use the map() function on the ThreadPoolExecutor.
Each task is a call to the get_website_status() with a URL argument that returns an HTTP response code that we can then interpret and report.
1 2 3 4 5 6 7 |
... # check the status of each url for code in executor.map(get_website_status, urls): # interpret the status status = get_status(code) # report status print(f'{status:5s}\t{code}') |
The problem with this approach is that we have no way to associate the HTTP response code with each URL.
We could add a counter and increment it as we iterate through the results, then use the counter to access the appropriate address in the list of URLs.
A better version of this solution is to first retrieve the iterable of results from the call to map().
1 2 3 |
... # check the status of each url results = executor.map(get_website_status, urls) |
We can then call the enumerate() built-in function on the iterable to get an index and the item from the iterable for each iteration in a for loop. The index can be used directly to get the URL in the list for the result.
1 2 3 4 5 6 |
... # enumerate the results in the same order as the list of urls for i,code in enumerate(results): # get the url url = urls[i] # ... |
For example:
1 2 3 4 5 6 7 8 9 10 11 |
... # check the status of each url results = executor.map(get_website_status, urls) # enumerate the results in the same order as the list of urls for i,code in enumerate(results): # get the url url = urls[i] # interpret the status status = get_status(code) # report status print(f'{url:20s}\t{status:5s}\t{code}') |
Tying this together, the complete example of querying a list of websites asynchronously using a ThreadPoolExecutor in Python is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
# SuperFastPython.com # example of pinging the status of a set of websites asynchronously from urllib.request import urlopen from urllib.error import URLError from urllib.error import HTTPError from http import HTTPStatus from concurrent.futures import ThreadPoolExecutor from concurrent.futures import as_completed # get the status of a website def get_website_status(url): # handle connection errors try: # open a connection to the server with a timeout with urlopen(url, timeout=3) as connection: # get the response code, e.g. 200 code = connection.getcode() return code except HTTPError as e: return e.code except URLError as e: return e.reason except: return e # interpret an HTTP response code into a status def get_status(code): if code == HTTPStatus.OK: return 'OK' return 'ERROR' # check status of a list of websites def check_status_urls(urls): # create the thread pool with ThreadPoolExecutor(len(urls)) as executor: # check the status of each url results = executor.map(get_website_status, urls) # enumerate the results in the same order as the list of urls for i,code in enumerate(results): # get the url url = urls[i] # interpret the status status = get_status(code) # report status print(f'{url:20s}\t{status:5s}\t{code}') # list of urls to check URLS = ['https://twitter.com', 'https://google.com', 'https://facebook.com', 'https://reddit.com', 'https://youtube.com', 'https://amazon.com', 'https://wikipedia.org', 'https://ebay.com', 'https://instagram.com', 'https://cnn.com'] # check all urls check_status_urls(URLS) |
Running the example reports the websites and their status as before, but this time, the code is much faster.
On my system, all sites are queried in about 1.5 seconds, which is much faster than nearly ten seconds in the sequential version, about a 5x speedup.
1 2 3 4 5 6 7 8 9 10 |
https://twitter.com OK 200 https://google.com OK 200 https://facebook.com OK 200 https://reddit.com OK 200 https://youtube.com OK 200 https://amazon.com OK 200 https://wikipedia.org OK 200 https://ebay.com OK 200 https://instagram.com OK 200 https://cnn.com OK 200 |
What if we wanted to use the submit() function to dispatch the tasks and report the status for results as the tasks are completed, rather than the order that the URLs are listed?
This would require a slightly more complicated solution.
Recall that the submit() function returns a Future object that provides a handle on the asynchronous task. For example, we can use a list comprehension to create a list of Future objects by calling submit() for the get_website_status() function with each URL in the list of URLs.
1 2 3 |
... # submit each url as a task futures = [executor.submit(get_website_status, url) for url in urls] |
The tricky part is we don’t have a way to associate the Future or its result with the URL.
We could enumerate the futures in the same order as the list of URLs and use the enumerate() function as we did with the iterable of results returned when calling map(); for example:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
... # submit each url as a task futures = [executor.submit(get_website_status, url) for url in urls] # get results as they are available for i,future in enumerate(futures): # get the url for the future url = urls[i] # get the status for the website code = future.result() # interpret the status status = get_status(code) # report status print(f'{url:20s}\t{status:5s}\t{code}') |
This would work, but it is a more complicated version of calling map(); we may as well just call map().
It also does not give us the benefit of reporting website status as it becomes available via the as_completed() function.
An alternative approach is to create a dictionary that maps each Future object to the data that was passed to the Future’s task, in this case, the URL argument to get_website_status(). This idea is also suggested in the ThreadPoolExecutor’s API documentation.
For example:
1 2 3 |
... # submit each task, create a mapping of futures to urls future_to_url = {executor.submit(get_website_status, url):url for url in urls} |
We can then call the as_completed() function with the future_to_url dictionary and process the Future objects in the order that they are completed.
1 2 3 4 |
... # get results as they are available for future in as_completed(future_to_url): # ... |
We can then look-up the URL value in the future_to_url dictionary for the given Future key; for example:
1 2 3 |
... # get the url for the future url = future_to_url[future] |
We now have everything we need to report website status in the order that the status is returned, rather than the order that we query the websites. This might be slightly faster and more responsive than the linear version.
The updated version of the check_status_urls() function is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# check status of a list of websites def check_status_urls(urls): # create the thread pool with ThreadPoolExecutor(len(urls)) as executor: # submit each task, create a mapping of futures to urls future_to_url = {executor.submit(get_website_status, url):url for url in urls} # get results as they are available for future in as_completed(future_to_url): # get the url for the future url = future_to_url[future] # get the status for the website code = future.result() # interpret the status status = get_status(code) # report status print(f'{url:20s}\t{status:5s}\t{code}') |
Tying this together, the complete example of querying websites and reporting results in task completion order is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
# SuperFastPython.com # example of pinging the status of a set of websites asynchronously from urllib.request import urlopen from urllib.error import URLError from urllib.error import HTTPError from http import HTTPStatus from concurrent.futures import ThreadPoolExecutor from concurrent.futures import as_completed # get the status of a website def get_website_status(url): # handle connection errors try: # open a connection to the server with a timeout with urlopen(url, timeout=3) as connection: # get the response code, e.g. 200 code = connection.getcode() return code except HTTPError as e: return e.code except URLError as e: return e.reason except: return e # interpret an HTTP response code into a status def get_status(code): if code == HTTPStatus.OK: return 'OK' return 'ERROR' # check status of a list of websites def check_status_urls(urls): # create the thread pool with ThreadPoolExecutor(len(urls)) as executor: # submit each task, create a mapping of futures to urls future_to_url = {executor.submit(get_website_status, url):url for url in urls} # get results as they are available for future in as_completed(future_to_url): # get the url for the future url = future_to_url[future] # get the status for the website code = future.result() # interpret the status status = get_status(code) # report status print(f'{url:20s}\t{status:5s}\t{code}') # list of urls to check URLS = ['https://twitter.com', 'https://google.com', 'https://facebook.com', 'https://reddit.com', 'https://youtube.com', 'https://amazon.com', 'https://wikipedia.org', 'https://ebay.com', 'https://instagram.com', 'https://cnn.com'] # check all urls check_status_urls(URLS) |
Running the example, we can see that website statuses are reported in a different order from the list of URLs.
Specifically, the results are reported in the order that websites responded to the query. We can see that Amazon was the slowest to respond, likely because it does not like the specific address or Python client we are using to make the query.
On my system, this example is also slightly faster, shaving off a fraction of a second.
1 2 3 4 5 6 7 8 9 10 |
https://twitter.com OK 200 https://cnn.com OK 200 https://reddit.com OK 200 https://youtube.com OK 200 https://google.com OK 200 https://wikipedia.org OK 200 https://facebook.com OK 200 https://instagram.com OK 200 https://ebay.com OK 200 https://amazon.com ERROR 503 |
Free Python ThreadPoolExecutor Course
Download your FREE ThreadPoolExecutor PDF cheat sheet and get BONUS access to my free 7-day crash course on the ThreadPoolExecutor API.
Discover how to use the ThreadPoolExecutor class including how to configure the number of workers and how to execute tasks asynchronously.
Further Reading
This section provides additional resources that you may find helpful.
Books
- ThreadPoolExecutor Jump-Start, Jason Brownlee, (my book!)
- Concurrent Futures API Interview Questions
- ThreadPoolExecutor Class API Cheat Sheet
I also recommend specific chapters from the following books:
- Effective Python, Brett Slatkin, 2019.
- See Chapter 7: Concurrency and Parallelism
- Python in a Nutshell, Alex Martelli, et al., 2017.
- See: Chapter: 14: Threads and Processes
Guides
- Python ThreadPoolExecutor: The Complete Guide
- Python ProcessPoolExecutor: The Complete Guide
- Python Threading: The Complete Guide
- Python ThreadPool: The Complete Guide
APIs
References
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Takeaways
In this tutorial you discovered how to concurrently check the status of websites on the internet using thread pools in Python.
- How to check the status of websites on the internet sequentially in Python and how slow it can be.
- How to use the ThreadPoolExecutor to manage a pool of worker threads.
- How to update code to check multiple websites at the same time and dramatically accelerate the process.
Do you have any questions?
Leave your question in a comment below and I will reply fast with my best advice.
Photo by Simon Connellan on Unsplash
Stefano says
When i launch the script and it checks HTTPS urls, it reutrns this error:
[SSL: DH_KEY_TOO_SMALL] dh key too small (_ssl.c:997)
Jason Brownlee says
Looks like you might have an issue with your Python installation.
Perhaps try searching/posting on stackoverflow.
Shreyas Mehra says
How do I store these results in a csv?
Jason Brownlee says
Sorry, I don’t have tutorials on writing csv files.
Perhaps prepare the data you want to write as a string, then open the file with the open() function and call the write() method.