How to Validate Links Concurrently in Python

December 24, 2021 Python ThreadPoolExecutor

The ThreadPoolExecutor class in Python can be used to validate multiple URL links at the same time.

This can dramatically speed-up the process compared to validating each URL sequentially, one by one.

In this tutorial, you will discover how to concurrently validate URL links in a webpage in Python.

After completing this tutorial, you will know:

Let's dive in.

Validate Links One-By-One (slowly)

Validating URL links in a web page is a common task.

Websites change over time and some close down, others rename pages. There are many reasons why the links in a webpage may break.

Finding broken URLs on a webpage can be time-consuming. In the worst case, we have to click each URL in turn and determine whether the target page loads or not. For more than one link, this could take a very long time.

Thankfully, we're developers, so we can write a script to first discover all links on a webpage, then check each to see if they work.

We can develop a program to validate links on a webpage one by one.

There are a few parts to this program; for example:

  1. Download URL Target Webpage
  2. Parse HTML For Links to Validate
  3. Filter the Links
  4. Validate the Links
  5. Complete Example

Let's look at each piece in turn.

Note: we are only going to implement the most basic error handling. If you change the target URL, you may need to adapt the code for the specific details of the target webpage.

Download URL File

The first step is to download a target webpage.

There are many ways to download a URL in python. In this case, we will use the urlopen() function to open the connection to the URL and call the read() function to download the contents of the file into memory.

To ensure the connection is closed automatically once we are finished downloading, we will use a context manager, e.g. the with keyword.

You can learn more about opening and reading from URL connections in the Python API here:

We will set a timeout on the download of a few seconds. If a connection cannot be made within the specified timeout, an exception is raised. We can catch this exception and return None instead of the contents of the file to indicate to the caller that an error occurred.

The download_url() function below implements this, taking a URL and returning the contents of the file.

# download a file from a URL, returns content of downloaded file
def download_url(urlpath):
    try:
        # open a connection to the server
        with urlopen(urlpath, timeout=3) as connection:
            # read the contents of the url as bytes and return it
            return connection.read()
    except:
        return None

We will use this function to download the target webpage page that contains links that we want to validate.

We can also use this same function to validate each link on the webpage.

Parse HTML for Links to Validate

Once the target webpage has been downloaded, we must parse it and extract all of the links to validate.

I recommend the BeautifulSoup Python library whenever HTML documents need to be parsed.

If you're new to BeautifulSoup, you can install it easily with your Python package manager, such as pip:

pip3 install beautifulsoup4

First, we must decode the raw data downloaded into ASCII text.

This can be achieved by calling the decode() function on the string of raw data and specifying a standard text format, such as UTF-8.

...
# decode the provided content as ascii text
html = content.decode('utf-8')

Next, we can parse the text of the HTML document with BeautifulSoup using the default parser. It is a good idea to use a more sophisticated parser that works the same on all platforms, but in this case, we will use the default parser as it does not require you to install anything extra.

...
# parse the document as best we can
soup = BeautifulSoup(html, 'html.parser')

We can then retrieve all <a href=""> HTML tags from the document as these will contain the URLs for the files we wish to download.

...
# find all all of the <a href=""> tags in the document
atags = soup.find_all('a')

We can then iterate through all of the found tags and retrieve the contents of the href property on each, e.g. get the links to files from each tag.

This can be done in a list comprehension giving us a list of URLs to validate.

...
# get all links from a tags
return [tag.get('href') for tag in atags]

Tying this all together, the get_urls_from_html() function below takes the downloaded HTML content and returns a list of URLs to validate.

# decode downloaded html and extract all <a href=""> links
def get_urls_from_html(content):
    # decode the provided content as ascii text
    html = content.decode('utf-8')
    # parse the document as best we can
    soup = BeautifulSoup(html, 'html.parser')
    # find all all of the <a href=""> tags in the document
    atags = soup.find_all('a')
    # get all links from a tags
    return [tag.get('href') for tag in atags]

We will use this function to extract all of the links from the target HTML page that need to be validated.

Filter the Links

The list of links extracted from the target webpage need to be filtered.

This includes removing any links that don't have any content or have a different protocol.

...
# skip missing or bad urls
if url is None or url.strip() is None:
    continue
# skip other protocols
if url.startswith('javascript'):
    continue

It also involves changing any links that are relative to be absolute.

Only absolute URLs can be validated; relative URLs must be converted to be absolute before being validated.

You may recall that a webpage link can be relative to the current webpage. For example, if the URL is https://www.python.org and links to /data/test1.html, then this is relative to the current webpage.

We can convert a relative URL to be an absolute URL by pre-pending the URL path for the current webpage, e.g. /data/test1.html becomes https://www.python.org/data/test1.html

We can do this using the urljoin() function from the urllib.parse module.

You can learn more about this module here:

This is a crude check (that won't work for all cases) if a URL is relative is to determine if it does not start with "http" for the HTTP protocol part of the URL. If not, we might assume the URL is relative and convert it to be absolute.

...
# check if the url is relative
if not url.startswith('http'):
    # join the base url with the absolute fragment
    url = urljoin(base, url)

The filter_urls() function below takes the target webpage URL and the list of raw URLs extracted from the target webpage and filters the list.

URLs with missing content or the wrong protocol are removed, relative URLs are converted to absolute URLs, and finally, we only keep one version of each absolute URL, using a set.

This set of filtered unique absolute URLs is then returned, ready for validation.

# filter out bad urls, convert relative to absolute if needed
def filter_urls(base, urls):
    filtered = set()
    for url in urls:
        # skip missing or bad urls
        if url is None or url.strip() is None:
            continue
        # skip other protocols
        if url.startswith('javascript'):
            continue
        # check if the url is relative
        if not url.startswith('http'):
            # join the base url with the absolute fragment
            url = urljoin(base, url)
        # store in the filtered set
        filtered.add(url)
    return filtered

Validate the Links

The next step is to validate the set of filtered links.

This can be achieved by enumerating the collection of URLs and attempting to download each.

If a URL cannot be downloaded, it is probably broken and is reported.

We can use the download_url() function developed previously to download each link

The validate() function below takes a collection of filtered URLs and validates each, reporting with a print statement all those that appear to be broken.

# validate a list of urls, report all broken urls
def validate(filtered_links):
    # validate each link
    for link in filtered_links:
        data = download_url(link)
        if data is None:
            print(f'.broken: {link}')

Complete Example

We now have all of the elements to validate the links in a target webpage.

First, we can download a target URL by calling the download_url() function, extract all of the raw links by calling get_urls_from_html(), filter the raw links by calling filter_urls(), then validate the filtered links by calling validate().

The validate_urls() function below implements this procedure for a given target webpage provided as an argument.

# find all broken urls listed on this website
def validate_urls(base):
    # download the page
    content = download_url(base)
    if content is None:
        print(f'Failed to download: {base}')
        return
    print(f'Downloaded: {base}')
    # extract all the links
    raw_links = get_urls_from_html(content)
    print(f'.found {len(raw_links)} raw links')
    # filter links
    filtered_links = filter_urls(base, raw_links)
    print(f'.found {len(filtered_links)} unique links')
    # validate the links
    validate(filtered_links)
    print('Done')

We can then call this function for a target URL.

We'll use www.python.org as the target website.

...
# website to check
URL = 'https://www.python.org/'
# validate all urls
validate_urls(URL)

Tying this together, the complete example of validating all URLs one by one in the target website python.org is listed below.

# SuperFastPython.com
# validate all urls on a webpage one at a time
from urllib.request import urlopen
from urllib.parse import urljoin
from bs4 import BeautifulSoup

# download a file from a URL, returns content of downloaded file
def download_url(urlpath):
    try:
        # open a connection to the server
        with urlopen(urlpath, timeout=3) as connection:
            # read the contents of the url as bytes and return it
            return connection.read()
    except:
        return None

# decode downloaded html and extract all <a href=""> links
def get_urls_from_html(content):
    # decode the provided content as ascii text
    html = content.decode('utf-8')
    # parse the document as best we can
    soup = BeautifulSoup(html, 'html.parser')
    # find all all of the <a href=""> tags in the document
    atags = soup.find_all('a')
    # get all links from a tags
    return [tag.get('href') for tag in atags]

# filter out bad urls, convert relative to absolute if needed
def filter_urls(base, urls):
    filtered = set()
    for url in urls:
        # skip missing or bad urls
        if url is None or url.strip() is None:
            continue
        # skip other protocols
        if url.startswith('javascript'):
            continue
        # check if the url is relative
        if not url.startswith('http'):
            # join the base url with the absolute fragment
            url = urljoin(base, url)
        # store in the filtered set
        filtered.add(url)
    return filtered

# validate a list of urls, report all broken urls
def validate(filtered_links):
    # validate each link
    for link in filtered_links:
        data = download_url(link)
        if data is None:
            print(f'.broken: {link}')

# find all broken urls listed on this website
def validate_urls(base):
    # download the page
    content = download_url(base)
    if content is None:
        print(f'Failed to download: {base}')
        return
    print(f'Downloaded: {base}')
    # extract all the links
    raw_links = get_urls_from_html(content)
    print(f'.found {len(raw_links)} raw links')
    # filter links
    filtered_links = filter_urls(base, raw_links)
    print(f'.found {len(filtered_links)} unique links')
    # validate the links
    validate(filtered_links)
    print('Done')

# website to check
URL = 'https://www.python.org/'
# validate all urls
validate_urls(URL)

Running the example first downloads the target webpage.

The downloaded web page is parsed and all links are extracted; in this case, 212 raw links are reported.

The raw links are filtered, resulting in a set of 130 unique absolute URLs to be validated.

The links are validated and only one is reported to be broken (http://www.saltstack.com). The link may not be broken; it may just not like being connected to from a Python user agent.

Downloaded: https://www.python.org/
.found 212 raw links
.found 130 unique links
.broken: http://www.saltstack.com
Done

What results did you get?
Let me know in the comments below.

The program works fine, but it is very slow as each link on the web page must be checked one by one.

On my system, it took approximately 74 seconds to complete.

How long did it take to run on your computer?
Let me know in the comments below.

Next, we will look at the ThreadPoolExecutor class that can be used to create a pool of worker threads that will allow us to speed up this validation process.

Create a Pool of Worker Threads

We can use the ThreadPoolExecutor to speed up the validation of multiple links on a webpage.

The ThreadPoolExecutor class is provided as part of the concurrent.futures module for easily running concurrent tasks.

The ThreadPoolExecutor provides a pool of worker threads, which is different from the ProcessPoolExecutor that provides a pool of worker processes.

Generally, ThreadPoolExecutor should be used for concurrent IO-bound tasks, like downloading URLs, and the ProcessPoolExecutor should be used for concurrent CPU-bound tasks, like calculating.

Using the ThreadPoolExecutor was designed to be easy and straightforward. It is like the "automatic mode" for Python threads.

  1. Create the thread pool by calling ThreadPoolExecutor().
  2. Submit tasks and get futures by calling submit().
  3. Wait and get results as tasks complete by calling as_completed().
  4. Shut down the thread pool by calling shutdown().

Create the Thread Pool

First, a ThreadPoolExecutor instance must be created. By default, it will create a pool of five times the number of CPUs you have available. This is good for most purposes.

...
# create a thread pool with the default number of worker threads
executor = ThreadPoolExecutor()

You can run tens to hundreds of concurrent IO-bound threads per CPU, although perhaps not thousands or tens of thousands. You can specify the number of threads to create in the pool via the max_workers argument; for example:

...
# create a thread pool with 10 worker threads
executor = ThreadPoolExecutor(max_workers=10)

Submit Tasks to the Tread Pool

Once created, it can send tasks into the pool to be completed using the submit() function.

This function takes the name of the function to call any and all arguments and returns a Future object.

The Future object is a promise to return the results from the task (if any) and provides a way to determine if a specific task has been completed or not.

...
# submit a task and get a future object
future = executor.submit(task, arg1, arg2, ...)

The return from a function executed by the thread pool can be accessed via the result() function on the Future object. It will wait until the result is available, if needed, or return immediately if the result is available.

For example:

...
# get the result from a future
result = future.result()

Get Results as Tasks Complete

The beauty of performing tasks concurrently is that we can get results as they become available, rather than waiting for tasks to be completed in the order they were submitted.

The concurrent.futures module provides a as_completed() function that we can use to get results for tasks as they are completed, just like its name suggests.

We can call the function and provide it a list of Future objects created by calling submit() and it will return Future objects as they are completed in whatever order.

For example, we can use a list comprehension to submit the tasks and create the list of Future objects:

...
# submit all tasks into the thread pool and create a list of futures
futures = [executor.submit(task, item) for item in items]

Then get results for tasks as they complete in a for loop:

...
# iterate over all submitted tasks and get results as they are available
for future in as_completed(futures):
	# get the result
	result = future.result()
	# do something with the result...

Shut Down the Thread Pool

Once all tasks are completed, we can close down the thread pool, which will release each thread and any resources it may hold (e.g. the stack space).

...
# shutdown the thread pool
executor.shutdown()

An easier way to use the thread pool is via the context manager (the "with" keyword), which ensures it is closed automatically once we are finished with it.

...
# create a thread pool
with ThreadPoolExecutor(max_workers=10) as executor:
	# submit tasks
	futures = [executor.submit(task, item) for item in items]
	# get results as they are available
	for future in as_completed(futures):
		# get the result
		result = future.result()
		# do something with the result...

Now that we are familiar with ThreadPoolExecutor and how to use it, let's look at how we can adapt our program for validating links to make use of it.

Validate Multiple Links Concurrently

The program for validating links can be adapted to use the ThreadPoolExecutor with very little change.

The validate() function currently enumerates the list of links extracted from the target webpage and calls the download_url() function for each.

This loop can be updated to submit() tasks to a ThreadPoolExecutor object, and then we can wait on the Future objects via a call to as_completed() and report progress.

First, we can create the thread pool. We can create one thread for each page to be downloaded. To ensure we don't create too many threads, we can limit the total to 1,000 if needed.

...
# select the number of workers, no more than 1,000
n_threads = min(1000, len(filtered_links))
# validate each link
with ThreadPoolExecutor(n_threads) as executor:
	# ...

Importantly, we need to know which URL each task is associated with so we can report a message if download_url() returns None to indicate a broken link.

This can be achieved by first creating a dictionary that maps Future objects returned from the submit() function with the URL passed as an argument to download_url().

We can do this in a dictionary comprehension; for example:

...
# validate each url
futures_to_data = {executor.submit(download_url, link):link for link in filtered_links}

We can then process results in the order that pages are downloaded by calling the as_completed() module function on the futures_to_data that contains Future objects as keys.

...
# report results as they are available
for future in as_completed(futures_to_data):
	# ...

For each result we get, we can then get the result from the Future by calling result(), and if it is None (e.g. broken), we can retrieve the URL for the task and report a message.

...
# get result from the download
data = future.result()
# check for broken url
if data is None:
    print(f'.broken: {futures_to_data}')

Tying this together, the concurrent version of the validate() function for validating links concurrently using the ThreadPoolExecutor is listed below.

# validate a list of urls, report all broken urls
def validate(filtered_links):
    # select the number of workers, no more than 1,000
    n_threads = min(1000, len(filtered_links))
    # validate each link
    with ThreadPoolExecutor(n_threads) as executor:
        # validate each url
        futures_to_data = {executor.submit(download_url, link):link for link in filtered_links}
        # report results as they are available
        for future in as_completed(futures_to_data):
            # get result from the download
            data = future.result()
            # get the link for the result
            link = futures_to_data
            # check for broken url
            if data is None:
                print(f'.broken: {link}')

The complete example with this updated validate() function is listed below.

# SuperFastPython.com
# validate all urls on a webpage concurrently
from urllib.request import urlopen
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
from bs4 import BeautifulSoup

# download a file from a URL, returns content of downloaded file
def download_url(urlpath):
    try:
        # open a connection to the server
        with urlopen(urlpath, timeout=5) as connection:
            # read the contents of the url as bytes and return it
            return connection.read()
    except:
        return None

# decode downloaded html and extract all <a href=""> links
def get_urls_from_html(content):
    # decode the provided content as ascii text
    html = content.decode('utf-8')
    # parse the document as best we can
    soup = BeautifulSoup(html, 'html.parser')
    # find all all of the <a href=""> tags in the document
    atags = soup.find_all('a')
    # get all links from a tags
    return [tag.get('href') for tag in atags]

# filter out bad urls, convert relative to absolute if needed
def filter_urls(base, urls):
    filtered = set()
    for url in urls:
        # skip missing or bad urls
        if url is None or url.strip() is None:
            continue
        # skip other protocols
        if url.startswith('javascript'):
            continue
        # check if the url is relative
        if not url.startswith('http'):
            # join the base url with the absolute fragment
            url = urljoin(base, url)
        # store in the filtered set
        filtered.add(url)
    return filtered

# validate a list of urls, report all broken urls
def validate(filtered_links):
    # select the number of workers, no more than 1,000
    n_threads = min(1000, len(filtered_links))
    # validate each link
    with ThreadPoolExecutor(n_threads) as executor:
        # validate each url
        futures_to_data = {executor.submit(download_url, link):link for link in filtered_links}
        # report results as they are available
        for future in as_completed(futures_to_data):
            # get result from the download
            data = future.result()
            # check for broken url
            if data is None:
                print(f'.broken: {futures_to_data}')

# find all broken urls listed on this website
def validate_urls(base):
    # download the page
    content = download_url(base)
    if content is None:
        print(f'Failed to download: {base}')
        return
    print(f'Downloaded: {base}')
    # extract all the links
    raw_links = get_urls_from_html(content)
    print(f'.found {len(raw_links)} raw links')
    # filter links
    filtered_links = filter_urls(base, raw_links)
    print(f'.found {len(filtered_links)} unique links')
    # validate all links
    validate(filtered_links)
    print('Done')

# website to check
URL = 'https://www.python.org/'
# validate all urls
validate_urls(URL)

Running the example reports the same result as before, as we would expect.

Importantly, the program is dramatically faster, taking only 4.8 seconds on my system compared to about 74 seconds for the sequential version we developed above. That is a 15x speedup.

Downloaded: https://www.python.org/
.found 212 raw links
.found 130 unique links
.broken: http://www.saltstack.com
Done

How long did it take to run on your computer?
Let me know in the comments below.

Extensions

This section lists ideas for extending the tutorial.

Share your extensions in the comments below; it would be great to see what you come up with.

Takeaways

In this tutorial, you discovered how to concurrently validate URL links in a webpage in Python.