Check the Status of Multiple Websites Concurrently in Python

November 17, 2021 Python ThreadPoolExecutor

The ThreadPoolExecutor class in Python can be used to check the status of multiple websites concurrently.

Using threads to check website status can dramatically speed-up the process compared to checking the status of each website sequentially, one by one.

In this tutorial, you will discover how to concurrently check the status of websites on the internet using thread pools in Python.

After completing this tutorial, you will know:

Let's dive in.

Check the Status of Websites One-by-One (slowly)

Checking the status of websites on the internet is a common task.

For example:

There are professional services we may use for some of these tasks. Nevertheless, sometimes we need a quick script to check on the status of a number of websites.

There are many ways to check on the status of a website in Python via HTTP.

Perhaps the most common is to open a connection to the server using the urlopen() function, then call the getcode() function on the open HTTPSConnection.

You can learn more about these standard library functions here:

For example, we can use the context manager to open the HTTP connection and get the status code, which will ensure that the connection will be closed automatically when we exit the block either normally, or if an exception is raised.

...
# open a connection to the server
with urlopen(url) as connection:
    # get the response code, e.g. 200
    code = connection.getcode()
    # ...

Also, given that we may be calling this script many times under possibly intermittent internet connectivity, we may also want to add a timeout to the connection attempt, perhaps a few seconds.

This can be achieved when calling the urlopen() function via the timeout argument.

...
# open a connection to the server with a timeout
with urlopen(url, timeout=3) as connection:
    # get the response code, e.g. 200
    code = connection.getcode()

Opening an HTTP connection can fail for many reasons, such as malformed URLs to servers.

We can choose to handle some of these cases and return an appropriate message.

For example, if we are able to connect but the server reports an error, then an HTTPError will be raised on which we can retrieve the specific HTTP status code.

...
# handle connection errors
try:
    # open a connection to the server with a timeout
    with urlopen(url, timeout=3) as connection:
        # get the response code, e.g. 200
        code = connection.getcode()
        # ...
except HTTPError as e:
    return e.code

Alternatively, a URLError could be thrown if we are unable to connect, perhaps because the domain could not be resolved via DNS, in which case we might choose to retrieve the error message as the status.

...
# handle connection errors
try:
    # open a connection to the server with a timeout
    with urlopen(url, timeout=3) as connection:
        # get the response code, e.g. 200
        code = connection.getcode()
       # ...
except HTTPError as e:
    return e.code
except URLError as e:
    return e.reason

Finally, if something else happens, and this is always possible, then we can retrieve traps and return more general faults.

You can learn more about these errors here:

Tying this together, the get_website_status() function below takes a URL and returns the HTTP status, or a useful error message if there is a failure.

# get the status of a website
def get_website_status(url):
    # handle connection errors
    try:
        # open a connection to the server with a timeout
        with urlopen(url, timeout=3) as connection:
            # get the response code, e.g. 200
            code = connection.getcode()
            return code
    except HTTPError as e:
        return e.code
    except URLError as e:
        return e.reason
    except:
        return e

Once we have an error status, we need to interpret it.

Python offers the http module and the HTTPStatus class with constants for the most common HTTP status codes.

Nevertheless, we can keep our script simple by checking for a status 200 (OK), otherwise reporting an error.

The get_status() function below takes an HTTP status code returned from the get_website_status() function and interprets it with a human readable message as 'OK' or 'ERROR'.

# interpret an HTTP response code into a status
def get_status(code):
    if code == HTTPStatus.OK:
        return 'OK'
    return 'ERROR'

Next, we need to get the status, interpret it, and report the results for each URL in a list of URLs.

This ties the program together.

The list of URLs may be loaded from a database or a text file in practice. In this case, we will define a constant named URLS that lists ten of the most popular websites on the internet in no particular order.

# list of urls to check
URLS = ['https://twitter.com',
        'https://google.com',
        'https://facebook.com',
        'https://reddit.com',
        'https://youtube.com',
        'https://amazon.com',
        'https://wikipedia.org',
        'https://ebay.com',
        'https://instagram.com',
        'https://cnn.com']

We can then define a function that takes a list of URLs and gets the status for each with a call first to the get_website_status() function to get the HTTP code, then to get_status() to interpret the code, before calling the print function to report the status.

...
# get the status for the website
code = get_website_status(url)
# interpret the status
status = get_status(code)
# report status
print(f'{url:20s}\t{status:5s}\t{code}')

Tying this together, the named check_status_urls() function that reports on the status of a list of URLs is listed below.

# check status of a list of websites
def check_status_urls(urls):
    for url in urls:
        # get the status for the website
        code = get_website_status(url)
        # interpret the status
        status = get_status(code)
        # report status
        print(f'{url:20s}\t{status:5s}\t{code}')

We can then call this function without a predefined list of URLs.

...
# check all urls
check_status_urls(URLS)

And that's it.

Tying this together, the complete example of checking the status of a list of URLs is listed below.

# SuperFastPython.com
# example of pinging the status of a set of websites
from urllib.request import urlopen
from urllib.error import URLError
from urllib.error import HTTPError
from http import HTTPStatus

# get the status of a website
def get_website_status(url):
    # handle connection errors
    try:
        # open a connection to the server with a timeout
        with urlopen(url, timeout=3) as connection:
            # get the response code, e.g. 200
            code = connection.getcode()
            return code
    except HTTPError as e:
        return e.code
    except URLError as e:
        return e.reason
    except:
        return e

# interpret an HTTP response code into a status
def get_status(code):
    if code == HTTPStatus.OK:
        return 'OK'
    return 'ERROR'

# check status of a list of websites
def check_status_urls(urls):
    for url in urls:
        # get the status for the website
        code = get_website_status(url)
        # interpret the status
        status = get_status(code)
        # report status
        print(f'{url:20s}\t{status:5s}\t{code}')

# list of urls to check
URLS = ['https://twitter.com',
        'https://google.com',
        'https://facebook.com',
        'https://reddit.com',
        'https://youtube.com',
        'https://amazon.com',
        'https://wikipedia.org',
        'https://ebay.com',
        'https://instagram.com',
        'https://cnn.com']
# check all urls
check_status_urls(URLS)

Running the example gets the status code, which is then interpreted and reported for each URL in the list.

In this case, all URLs report an OK status except amazon.com, which reports a 503 error. It might be because amazon prefers the www prefix, or because it does not like to be pinged by unknown web clients.

https://twitter.com 	OK   	200
https://google.com  	OK   	200
https://facebook.com	OK   	200
https://reddit.com  	OK   	200
https://youtube.com 	OK   	200
https://amazon.com  	ERROR	503
https://wikipedia.org	OK   	200
https://ebay.com    	OK   	200
https://instagram.com	OK   	200
https://cnn.com     	OK   	200

Checking the status of websites one-by-one is slow.

On my system, it took a little less than ten seconds to query all ten websites.

How long did it take to run on your system?
Let me know in the comments below.

Next, we will look at the ThreadPoolExecutor class that can be used to create a pool of worker threads that will allow us to speed up the status checking process.

Create a Pool of Worker Threads With ThreadPoolExecutor

We can use the ThreadPoolExecutor to speed up the download of multiple files listed on an HTML webpage.

The ThreadPoolExecutor class is provided as part of the concurrent.futures module for easily running concurrent tasks.

The ThreadPoolExecutor provides a pool of worker threads, which is different from the ProcessPoolExecutor that provides a pool of worker processes.

Generally, ThreadPoolExecutor should be used for concurrent IO-bound tasks, like downloading URLs, and the ProcessPoolExecutor should be used for concurrent CPU-bound tasks, like calculating.

Using the ThreadPoolExecutor was designed to be easy and straightforward. It is like the "automatic mode" for Python threads.

  1. Create the thread pool by calling ThreadPoolExecutor().
  2. Submit tasks and get futures by calling submit().
  3. Wait and get results as tasks complete by calling as_completed().
  4. Shutdown the thread pool by calling shutdown()

Step 1: Create the Thread Pool

First, a ThreadPoolExecutor instance must be created.

By default, it will create a pool equal to the number of logical CPUs you have available plus four.

This is good for most purposes.

...
# create a thread pool with the default number of worker threads
executor = ThreadPoolExecutor()

You can run tens to hundreds of concurrent IO-bound threads per CPU, although perhaps not thousands or tens of thousands.

You can specify the number of threads to create in the pool via the max_workers argument; for example:

...
# create a thread pool with 10 worker threads
executor = ThreadPoolExecutor(max_workers=10)

Step 2: Submit Tasks to the Thread Pool

Once created, it can send tasks into the pool to be completed using the submit() function.

This function takes the name of the function to call any and all arguments and returns a Future object.

The Future object is a promise to return the results from the task (if any) and provides a way to determine if a specific task has been completed or not.

...
# submit a task and receive a future
future = executor.submit(task, arg1, arg2, ...)

The return from a function executed by the thread pool can be accessed via the result() function on the Future object. It will wait until the result is available, if needed, or return immediately if the result is available.

For example:

...
# get the result from a future
result = future.result()

An alternate and perhaps simpler approach to submitting asynchronous tasks to the thread pool is to use the map() function provided on the ThreadPoolExecutor.

It works just the same as the built-in map() function. It takes the name of the target task function and an iterable. The target function is then called for each item in the iterable, and because we are using a thread pool, each call is submitted as an asynchronous task.

The map() function then returns an iterable over the results from each function call. We can call the map() function and iterate over the results in a for-loop directly; for example:

# perform all tasks in parallel
for result in executor.map(task, items):
	# ...

Step 3: Get Results as Tasks Complete

The beauty of performing tasks concurrently is that we can get results as they become available, rather than waiting for tasks to be completed in the order they were submitted.

The concurrent.futures module provides an as_completed() function that we can use to get results for tasks as they are completed, just like its name suggests.

We can call the function and provide it a list of future objects created by calling submit() and it will return future objects as they are completed in whatever order.

For example, we can use a list comprehension to submit the tasks and create the list of Future objects:

...
# submit all tasks into the thread pool and create a list of futures
futures = [pool.submit(work, task) for task in tasks]

Then get results for tasks as they complete in a for loop:

# iterate over all submitted tasks and get results as they are available
for future in as_completed(futures):
	# get the result
	result = future.result()
	# do something with the result...

Step 4: Shutdown the Thread Pool

Once all tasks are completed, we can close down the thread pool, which will release each thread and any resources it may hold (e.g. the stack space).

...
# shutdown the thread pool
executor.shutdown()

An easier way to use the thread pool is via the context manager (the with keyword), which ensures it is closed automatically once we are finished with it.

...
# create a thread pool
with ThreadPoolExecutor(max_workers=10) as executor:
	# submit tasks
	futures = [executor.submit(work, task) for task in tasks]
	# get results as they are available
	for future in as_completed(futures):
		# get the result
		result = future.result()
		# do something with the result...

Now that we are familiar with ThreadPoolExecutor and how to use it, let's look at how we can adapt our program for checking the status of websites to make use of it.

Check the Status of Websites Concurrently

We can update our program to make use of the ThreadPoolExecutor with very few changes.

All of the changes are contained to the check_status_urls() function.

First, we can create a thread pool using the context manager and set the number of worker threads to the number of URLs in our list. This way, we can attempt to query all ten websites at once.

...
# create the thread pool
with ThreadPoolExecutor(len(urls)) as executor:
	# ...

Next, we need to perform asynchronous calls to the get_website_status() function.

Perhaps the simplest approach is to use the map() function on the ThreadPoolExecutor.

Each task is a call to the get_website_status() with a URL argument that returns an HTTP response code that we can then interpret and report.

...
# check the status of each url
for code in executor.map(get_website_status, urls):
    # interpret the status
    status = get_status(code)
    # report status
    print(f'{status:5s}\t{code}')

The problem with this approach is that we have no way to associate the HTTP response code with each URL.

We could add a counter and increment it as we iterate through the results, then use the counter to access the appropriate address in the list of URLs.

A better version of this solution is to first retrieve the iterable of results from the call to map().

...
# check the status of each url
results = executor.map(get_website_status, urls)

We can then call the enumerate() built-in function on the iterable to get an index and the item from the iterable for each iteration in a for loop. The index can be used directly to get the URL in the list for the result.

...
# enumerate the results in the same order as the list of urls
for i,code in enumerate(results):
	# get the url
    url = urls
    # ...

For example:

...
# check the status of each url
results = executor.map(get_website_status, urls)
# enumerate the results in the same order as the list of urls
for i,code in enumerate(results):
    # get the url
    url = urls
    # interpret the status
    status = get_status(code)
    # report status
    print(f'{url:20s}\t{status:5s}\t{code}')

Tying this together, the complete example of querying a list of websites asynchronously using a ThreadPoolExecutor in Python is listed below.

# SuperFastPython.com
# example of pinging the status of a set of websites asynchronously
from urllib.request import urlopen
from urllib.error import URLError
from urllib.error import HTTPError
from http import HTTPStatus
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed

# get the status of a website
def get_website_status(url):
    # handle connection errors
    try:
        # open a connection to the server with a timeout
        with urlopen(url, timeout=3) as connection:
            # get the response code, e.g. 200
            code = connection.getcode()
            return code
    except HTTPError as e:
        return e.code
    except URLError as e:
        return e.reason
    except:
        return e

# interpret an HTTP response code into a status
def get_status(code):
    if code == HTTPStatus.OK:
        return 'OK'
    return 'ERROR'

# check status of a list of websites
def check_status_urls(urls):
    # create the thread pool
    with ThreadPoolExecutor(len(urls)) as executor:
        # check the status of each url
        results = executor.map(get_website_status, urls)
        # enumerate the results in the same order as the list of urls
        for i,code in enumerate(results):
            # get the url
            url = urls
            # interpret the status
            status = get_status(code)
            # report status
            print(f'{url:20s}\t{status:5s}\t{code}')

# list of urls to check
URLS = ['https://twitter.com',
        'https://google.com',
        'https://facebook.com',
        'https://reddit.com',
        'https://youtube.com',
        'https://amazon.com',
        'https://wikipedia.org',
        'https://ebay.com',
        'https://instagram.com',
        'https://cnn.com']
# check all urls
check_status_urls(URLS)

Running the example reports the websites and their status as before, but this time, the code is much faster.

On my system, all sites are queried in about 1.5 seconds, which is much faster than nearly ten seconds in the sequential version, about a 5x speedup.

https://twitter.com 	OK   	200
https://google.com  	OK   	200
https://facebook.com	OK   	200
https://reddit.com  	OK   	200
https://youtube.com 	OK   	200
https://amazon.com  	OK   	200
https://wikipedia.org	OK   	200
https://ebay.com    	OK   	200
https://instagram.com	OK   	200
https://cnn.com     	OK   	200

What if we wanted to use the submit() function to dispatch the tasks and report the status for results as the tasks are completed, rather than the order that the URLs are listed?

This would require a slightly more complicated solution.

Recall that the submit() function returns a Future object that provides a handle on the asynchronous task. For example, we can use a list comprehension to create a list of Future objects by calling submit() for the get_website_status() function with each URL in the list of URLs.

...
# submit each url as a task
futures = [executor.submit(get_website_status, url) for url in urls]

The tricky part is we don't have a way to associate the Future or its result with the URL.

We could enumerate the futures in the same order as the list of URLs and use the enumerate() function as we did with the iterable of results returned when calling map(); for example:

...
# submit each url as a task
futures = [executor.submit(get_website_status, url) for url in urls]
# get results as they are available
for i,future in enumerate(futures):
    # get the url for the future
    url = urls
    # get the status for the website
    code = future.result()
    # interpret the status
    status = get_status(code)
    # report status
    print(f'{url:20s}\t{status:5s}\t{code}')

This would work, but it is a more complicated version of calling map(); we may as well just call map().

It also does not give us the benefit of reporting website status as it becomes available via the as_completed() function.

An alternative approach is to create a dictionary that maps each Future object to the data that was passed to the Future's task, in this case, the URL argument to get_website_status(). This idea is also suggested in the ThreadPoolExecutor's API documentation.

For example:

...
# submit each task, create a mapping of futures to urls
future_to_url = {executor.submit(get_website_status, url):url for url in urls}

We can then call the as_completed() function with the future_to_url dictionary and process the Future objects in the order that they are completed.

...
# get results as they are available
for future in as_completed(future_to_url):
	# ...

We can then look-up the URL value in the future_to_url dictionary for the given Future key; for example:

...
# get the url for the future
url = future_to_url

We now have everything we need to report website status in the order that the status is returned, rather than the order that we query the websites. This might be slightly faster and more responsive than the linear version.

The updated version of the check_status_urls() function is listed below.

# check status of a list of websites
def check_status_urls(urls):
    # create the thread pool
    with ThreadPoolExecutor(len(urls)) as executor:
        # submit each task, create a mapping of futures to urls
        future_to_url = {executor.submit(get_website_status, url):url for url in urls}
        # get results as they are available
        for future in as_completed(future_to_url):
            # get the url for the future
            url = future_to_url
            # get the status for the website
            code = future.result()
            # interpret the status
            status = get_status(code)
            # report status
            print(f'{url:20s}\t{status:5s}\t{code}')

Tying this together, the complete example of querying websites and reporting results in task completion order is listed below.

# SuperFastPython.com
# example of pinging the status of a set of websites asynchronously
from urllib.request import urlopen
from urllib.error import URLError
from urllib.error import HTTPError
from http import HTTPStatus
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed

# get the status of a website
def get_website_status(url):
    # handle connection errors
    try:
        # open a connection to the server with a timeout
        with urlopen(url, timeout=3) as connection:
            # get the response code, e.g. 200
            code = connection.getcode()
            return code
    except HTTPError as e:
        return e.code
    except URLError as e:
        return e.reason
    except:
        return e

# interpret an HTTP response code into a status
def get_status(code):
    if code == HTTPStatus.OK:
        return 'OK'
    return 'ERROR'

# check status of a list of websites
def check_status_urls(urls):
    # create the thread pool
    with ThreadPoolExecutor(len(urls)) as executor:
        # submit each task, create a mapping of futures to urls
        future_to_url = {executor.submit(get_website_status, url):url for url in urls}
        # get results as they are available
        for future in as_completed(future_to_url):
            # get the url for the future
            url = future_to_url
            # get the status for the website
            code = future.result()
            # interpret the status
            status = get_status(code)
            # report status
            print(f'{url:20s}\t{status:5s}\t{code}')

# list of urls to check
URLS = ['https://twitter.com',
        'https://google.com',
        'https://facebook.com',
        'https://reddit.com',
        'https://youtube.com',
        'https://amazon.com',
        'https://wikipedia.org',
        'https://ebay.com',
        'https://instagram.com',
        'https://cnn.com']
# check all urls
check_status_urls(URLS)

Running the example, we can see that website statuses are reported in a different order from the list of URLs.

Specifically, the results are reported in the order that websites responded to the query. We can see that Amazon was the slowest to respond, likely because it does not like the specific address or Python client we are using to make the query.

On my system, this example is also slightly faster, shaving off a fraction of a second.

https://twitter.com 	OK   	200
https://cnn.com     	OK   	200
https://reddit.com  	OK   	200
https://youtube.com 	OK   	200
https://google.com  	OK   	200
https://wikipedia.org	OK   	200
https://facebook.com	OK   	200
https://instagram.com	OK   	200
https://ebay.com    	OK   	200
https://amazon.com  	ERROR	503

Takeaways

In this tutorial you discovered how to concurrently check the status of websites on the internet using thread pools in Python.