How to Validate Links Concurrently in Python

Last Updated on September 12, 2022

The ThreadPoolExecutor class in Python can be used to validate multiple URL links at the same time.

This can dramatically speed-up the process compared to validating each URL sequentially, one by one.

In this tutorial, you will discover how to concurrently validate URL links in a webpage in Python.

After completing this tutorial, you will know:

How to validate all links in a webpage sequentially in Python and how slow it can be.
How to use the ThreadPoolExecutor to manage a pool of worker threads.
How to update the code to validate multiple URLs at the same time and dramatically accelerate the process.

Let’s dive in.

Table of Contents

Validate Links One-By-One (slowly)

Validating URL links in a web page is a common task.

Websites change over time and some close down, others rename pages. There are many reasons why the links in a webpage may break.

Finding broken URLs on a webpage can be time-consuming. In the worst case, we have to click each URL in turn and determine whether the target page loads or not. For more than one link, this could take a very long time.

Thankfully, we’re developers, so we can write a script to first discover all links on a webpage, then check each to see if they work.

We can develop a program to validate links on a webpage one by one.

There are a few parts to this program; for example:

Download URL Target Webpage
Parse HTML For Links to Validate
Filter the Links
Validate the Links
Complete Example

Let’s look at each piece in turn.

Note: we are only going to implement the most basic error handling. If you change the target URL, you may need to adapt the code for the specific details of the target webpage.

Download URL File

The first step is to download a target webpage.

There are many ways to download a URL in python. In this case, we will use the urlopen() function to open the connection to the URL and call the read() function to download the contents of the file into memory.

To ensure the connection is closed automatically once we are finished downloading, we will use a context manager, e.g. the with keyword.

You can learn more about opening and reading from URL connections in the Python API here:

urllib.request – Extensible library for opening URLs

We will set a timeout on the download of a few seconds. If a connection cannot be made within the specified timeout, an exception is raised. We can catch this exception and return None instead of the contents of the file to indicate to the caller that an error occurred.

The download_url() function below implements this, taking a URL and returning the contents of the file.

# download a file from a URL, returns content of downloaded file

def download_url(urlpath):

try:

# open a connection to the server

with urlopen(urlpath, timeout=3) as connection:

# read the contents of the url as bytes and return it

return connection.read()

except:

return None

We will use this function to download the target webpage page that contains links that we want to validate.

We can also use this same function to validate each link on the webpage.

Parse HTML for Links to Validate

Once the target webpage has been downloaded, we must parse it and extract all of the links to validate.

I recommend the BeautifulSoup Python library whenever HTML documents need to be parsed.

If you’re new to BeautifulSoup, you can install it easily with your Python package manager, such as pip:

1	pip3 install beautifulsoup4

First, we must decode the raw data downloaded into ASCII text.

This can be achieved by calling the decode() function on the string of raw data and specifying a standard text format, such as UTF-8.

...

# decode the provided content as ascii text

html = content.decode('utf-8')

Next, we can parse the text of the HTML document with BeautifulSoup using the default parser. It is a good idea to use a more sophisticated parser that works the same on all platforms, but in this case, we will use the default parser as it does not require you to install anything extra.

...

# parse the document as best we can

soup = BeautifulSoup(html, 'html.parser')

We can then retrieve all <a href=””> HTML tags from the document as these will contain the URLs for the files we wish to download.

...

# find all all of the <a href=""> tags in the document

atags = soup.find_all('a')

We can then iterate through all of the found tags and retrieve the contents of the href property on each, e.g. get the links to files from each tag.

This can be done in a list comprehension giving us a list of URLs to validate.

...

# get all links from a tags

return [tag.get('href') for tag in atags]

Tying this all together, the get_urls_from_html() function below takes the downloaded HTML content and returns a list of URLs to validate.

# decode downloaded html and extract all <a href=""> links

def get_urls_from_html(content):

# decode the provided content as ascii text

html = content.decode('utf-8')

# parse the document as best we can

soup = BeautifulSoup(html, 'html.parser')

# find all all of the <a href=""> tags in the document

atags = soup.find_all('a')

# get all links from a tags

return [tag.get('href') for tag in atags]

We will use this function to extract all of the links from the target HTML page that need to be validated.

Filter the Links

The list of links extracted from the target webpage need to be filtered.

This includes removing any links that don’t have any content or have a different protocol.

...

# skip missing or bad urls

if url is None or url.strip() is None:

continue

# skip other protocols

if url.startswith('javascript'):

continue

It also involves changing any links that are relative to be absolute.

Only absolute URLs can be validated; relative URLs must be converted to be absolute before being validated.

You may recall that a webpage link can be relative to the current webpage. For example, if the URL is https://www.python.org and links to /data/test1.html, then this is relative to the current webpage.

We can convert a relative URL to be an absolute URL by pre-pending the URL path for the current webpage, e.g. /data/test1.html becomes https://www.python.org/data/test1.html

We can do this using the urljoin() function from the urllib.parse module.

You can learn more about this module here:

urllib.parse — Parse URLs into components

This is a crude check (that won’t work for all cases) if a URL is relative is to determine if it does not start with “http” for the HTTP protocol part of the URL. If not, we might assume the URL is relative and convert it to be absolute.

...

# check if the url is relative

if not url.startswith('http'):

# join the base url with the absolute fragment

url = urljoin(base, url)

The filter_urls() function below takes the target webpage URL and the list of raw URLs extracted from the target webpage and filters the list.

URLs with missing content or the wrong protocol are removed, relative URLs are converted to absolute URLs, and finally, we only keep one version of each absolute URL, using a set.

This set of filtered unique absolute URLs is then returned, ready for validation.

# filter out bad urls, convert relative to absolute if needed

def filter_urls(base, urls):

filtered = set()

for url in urls:

# skip missing or bad urls

if url is None or url.strip() is None:

continue

# skip other protocols

if url.startswith('javascript'):

continue

# check if the url is relative

if not url.startswith('http'):

# join the base url with the absolute fragment

url = urljoin(base, url)

# store in the filtered set

filtered.add(url)

return filtered

Validate the Links

The next step is to validate the set of filtered links.

This can be achieved by enumerating the collection of URLs and attempting to download each.

If a URL cannot be downloaded, it is probably broken and is reported.

We can use the download_url() function developed previously to download each link

The validate() function below takes a collection of filtered URLs and validates each, reporting with a print statement all those that appear to be broken.

# validate a list of urls, report all broken urls

def validate(filtered_links):

# validate each link

for link in filtered_links:

data = download_url(link)

if data is None:

print(f'.broken: {link}')

Complete Example

We now have all of the elements to validate the links in a target webpage.

First, we can download a target URL by calling the download_url() function, extract all of the raw links by calling get_urls_from_html(), filter the raw links by calling filter_urls(), then validate the filtered links by calling validate().

The validate_urls() function below implements this procedure for a given target webpage provided as an argument.

# find all broken urls listed on this website

def validate_urls(base):

# download the page

content = download_url(base)

if content is None:

print(f'Failed to download: {base}')

return

print(f'Downloaded: {base}')

# extract all the links

raw_links = get_urls_from_html(content)

print(f'.found {len(raw_links)} raw links')

# filter links

filtered_links = filter_urls(base, raw_links)

print(f'.found {len(filtered_links)} unique links')

# validate the links

validate(filtered_links)

print('Done')

We can then call this function for a target URL.

We’ll use www.python.org as the target website.

...

# website to check

URL = 'https://www.python.org/'

# validate all urls

validate_urls(URL)

Tying this together, the complete example of validating all URLs one by one in the target website python.org is listed below.

# SuperFastPython.com

# validate all urls on a webpage one at a time

from urllib.request import urlopen

from urllib.parse import urljoin

from bs4 import BeautifulSoup

# download a file from a URL, returns content of downloaded file

def download_url(urlpath):

try:

# open a connection to the server

with urlopen(urlpath, timeout=3) as connection:

# read the contents of the url as bytes and return it

return connection.read()

except:

return None

# decode downloaded html and extract all <a href=""> links

def get_urls_from_html(content):

# decode the provided content as ascii text

html = content.decode('utf-8')

# parse the document as best we can

soup = BeautifulSoup(html, 'html.parser')

# find all all of the <a href=""> tags in the document

atags = soup.find_all('a')

# get all links from a tags

return [tag.get('href') for tag in atags]

# filter out bad urls, convert relative to absolute if needed

def filter_urls(base, urls):

filtered = set()

for url in urls:

# skip missing or bad urls

if url is None or url.strip() is None:

continue

# skip other protocols

if url.startswith('javascript'):

continue

# check if the url is relative

if not url.startswith('http'):

# join the base url with the absolute fragment

url = urljoin(base, url)

# store in the filtered set

filtered.add(url)

return filtered

# validate a list of urls, report all broken urls

def validate(filtered_links):

# validate each link

for link in filtered_links:

data = download_url(link)

if data is None:

print(f'.broken: {link}')

# find all broken urls listed on this website

def validate_urls(base):

# download the page

content = download_url(base)

if content is None:

print(f'Failed to download: {base}')

return

print(f'Downloaded: {base}')

# extract all the links

raw_links = get_urls_from_html(content)

print(f'.found {len(raw_links)} raw links')

# filter links

filtered_links = filter_urls(base, raw_links)

print(f'.found {len(filtered_links)} unique links')

# validate the links

validate(filtered_links)

print('Done')

# website to check

URL = 'https://www.python.org/'

# validate all urls

validate_urls(URL)

Running the example first downloads the target webpage.

The downloaded web page is parsed and all links are extracted; in this case, 212 raw links are reported.

The raw links are filtered, resulting in a set of 130 unique absolute URLs to be validated.

The links are validated and only one is reported to be broken (http://www.saltstack.com). The link may not be broken; it may just not like being connected to from a Python user agent.

Downloaded: https://www.python.org/

.found 212 raw links

.found 130 unique links

.broken: http://www.saltstack.com

Done

What results did you get?
Let me know in the comments below.

The program works fine, but it is very slow as each link on the web page must be checked one by one.

On my system, it took approximately 74 seconds to complete.

How long did it take to run on your computer?
Let me know in the comments below.

Next, we will look at the ThreadPoolExecutor class that can be used to create a pool of worker threads that will allow us to speed up this validation process.

Run loops using all CPUs, download your FREE book to learn how.

Create a Pool of Worker Threads

We can use the ThreadPoolExecutor to speed up the validation of multiple links on a webpage.

The ThreadPoolExecutor class is provided as part of the concurrent.futures module for easily running concurrent tasks.

The ThreadPoolExecutor provides a pool of worker threads, which is different from the ProcessPoolExecutor that provides a pool of worker processes.

Generally, ThreadPoolExecutor should be used for concurrent IO-bound tasks, like downloading URLs, and the ProcessPoolExecutor should be used for concurrent CPU-bound tasks, like calculating.

Using the ThreadPoolExecutor was designed to be easy and straightforward. It is like the “automatic mode” for Python threads.

Create the thread pool by calling ThreadPoolExecutor().
Submit tasks and get futures by calling submit().
Wait and get results as tasks complete by calling as_completed().
Shut down the thread pool by calling shutdown().

Create the Thread Pool

First, a ThreadPoolExecutor instance must be created. By default, it will create a pool of five times the number of CPUs you have available. This is good for most purposes.

...

# create a thread pool with the default number of worker threads

executor = ThreadPoolExecutor()

You can run tens to hundreds of concurrent IO-bound threads per CPU, although perhaps not thousands or tens of thousands. You can specify the number of threads to create in the pool via the max_workers argument; for example:

...

# create a thread pool with 10 worker threads

executor = ThreadPoolExecutor(max_workers=10)

Submit Tasks to the Tread Pool

Once created, it can send tasks into the pool to be completed using the submit() function.

This function takes the name of the function to call any and all arguments and returns a Future object.

The Future object is a promise to return the results from the task (if any) and provides a way to determine if a specific task has been completed or not.

...

# submit a task and get a future object

future = executor.submit(task, arg1, arg2, ...)

The return from a function executed by the thread pool can be accessed via the result() function on the Future object. It will wait until the result is available, if needed, or return immediately if the result is available.

For example:

...

# get the result from a future

result = future.result()

Get Results as Tasks Complete

The beauty of performing tasks concurrently is that we can get results as they become available, rather than waiting for tasks to be completed in the order they were submitted.

The concurrent.futures module provides a as_completed() function that we can use to get results for tasks as they are completed, just like its name suggests.

We can call the function and provide it a list of Future objects created by calling submit() and it will return Future objects as they are completed in whatever order.

For example, we can use a list comprehension to submit the tasks and create the list of Future objects:

...

# submit all tasks into the thread pool and create a list of futures

futures = [executor.submit(task, item) for item in items]

Then get results for tasks as they complete in a for loop:

...

# iterate over all submitted tasks and get results as they are available

for future in as_completed(futures):

# get the result

result = future.result()

# do something with the result...

Shut Down the Thread Pool

Once all tasks are completed, we can close down the thread pool, which will release each thread and any resources it may hold (e.g. the stack space).

...

# shutdown the thread pool

executor.shutdown()

An easier way to use the thread pool is via the context manager (the “with” keyword), which ensures it is closed automatically once we are finished with it.

...

# create a thread pool

with ThreadPoolExecutor(max_workers=10) as executor:

# submit tasks

futures = [executor.submit(task, item) for item in items]

# get results as they are available

for future in as_completed(futures):

# get the result

result = future.result()

# do something with the result...

Now that we are familiar with ThreadPoolExecutor and how to use it, let’s look at how we can adapt our program for validating links to make use of it.

Download Now: Free ThreadPoolExecutor PDF Cheat Sheet

Validate Multiple Links Concurrently

The program for validating links can be adapted to use the ThreadPoolExecutor with very little change.

The validate() function currently enumerates the list of links extracted from the target webpage and calls the download_url() function for each.

This loop can be updated to submit() tasks to a ThreadPoolExecutor object, and then we can wait on the Future objects via a call to as_completed() and report progress.

First, we can create the thread pool. We can create one thread for each page to be downloaded. To ensure we don’t create too many threads, we can limit the total to 1,000 if needed.

...

# select the number of workers, no more than 1,000

n_threads = min(1000, len(filtered_links))

# validate each link

with ThreadPoolExecutor(n_threads) as executor:

# ...

Importantly, we need to know which URL each task is associated with so we can report a message if download_url() returns None to indicate a broken link.

This can be achieved by first creating a dictionary that maps Future objects returned from the submit() function with the URL passed as an argument to download_url().

We can do this in a dictionary comprehension; for example:

...

# validate each url

futures_to_data = {executor.submit(download_url, link):link for link in filtered_links}

We can then process results in the order that pages are downloaded by calling the as_completed() module function on the futures_to_data that contains Future objects as keys.

...

# report results as they are available

for future in as_completed(futures_to_data):

# ...

For each result we get, we can then get the result from the Future by calling result(), and if it is None (e.g. broken), we can retrieve the URL for the task and report a message.

...

# get result from the download

data = future.result()

# check for broken url

if data is None:

print(f'.broken: {futures_to_data[future]}')

Tying this together, the concurrent version of the validate() function for validating links concurrently using the ThreadPoolExecutor is listed below.

# validate a list of urls, report all broken urls

def validate(filtered_links):

# select the number of workers, no more than 1,000

n_threads = min(1000, len(filtered_links))

# validate each link

with ThreadPoolExecutor(n_threads) as executor:

# validate each url

futures_to_data = {executor.submit(download_url, link):link for link in filtered_links}

# report results as they are available

for future in as_completed(futures_to_data):

# get result from the download

data = future.result()

# get the link for the result

link = futures_to_data[future]

# check for broken url

if data is None:

print(f'.broken: {link}')

The complete example with this updated validate() function is listed below.

# SuperFastPython.com

# validate all urls on a webpage concurrently

from urllib.request import urlopen

from urllib.parse import urljoin

from concurrent.futures import ThreadPoolExecutor

from concurrent.futures import as_completed

from bs4 import BeautifulSoup

# download a file from a URL, returns content of downloaded file

def download_url(urlpath):

try:

# open a connection to the server

with urlopen(urlpath, timeout=5) as connection:

# read the contents of the url as bytes and return it

return connection.read()

except:

return None

# decode downloaded html and extract all <a href=""> links

def get_urls_from_html(content):

# decode the provided content as ascii text

html = content.decode('utf-8')

# parse the document as best we can

soup = BeautifulSoup(html, 'html.parser')

# find all all of the <a href=""> tags in the document

atags = soup.find_all('a')

# get all links from a tags

return [tag.get('href') for tag in atags]

# filter out bad urls, convert relative to absolute if needed

def filter_urls(base, urls):

filtered = set()

for url in urls:

# skip missing or bad urls

if url is None or url.strip() is None:

continue

# skip other protocols

if url.startswith('javascript'):

continue

# check if the url is relative

if not url.startswith('http'):

# join the base url with the absolute fragment

url = urljoin(base, url)

# store in the filtered set

filtered.add(url)

return filtered

# validate a list of urls, report all broken urls

def validate(filtered_links):

# select the number of workers, no more than 1,000

n_threads = min(1000, len(filtered_links))

# validate each link

with ThreadPoolExecutor(n_threads) as executor:

# validate each url

futures_to_data = {executor.submit(download_url, link):link for link in filtered_links}

# report results as they are available

for future in as_completed(futures_to_data):

# get result from the download

data = future.result()

# check for broken url

if data is None:

print(f'.broken: {futures_to_data[future]}')

# find all broken urls listed on this website

def validate_urls(base):

# download the page

content = download_url(base)

if content is None:

print(f'Failed to download: {base}')

return

print(f'Downloaded: {base}')

# extract all the links

raw_links = get_urls_from_html(content)

print(f'.found {len(raw_links)} raw links')

# filter links

filtered_links = filter_urls(base, raw_links)

print(f'.found {len(filtered_links)} unique links')

# validate all links

validate(filtered_links)

print('Done')

# website to check

URL = 'https://www.python.org/'

# validate all urls

validate_urls(URL)

Running the example reports the same result as before, as we would expect.

Importantly, the program is dramatically faster, taking only 4.8 seconds on my system compared to about 74 seconds for the sequential version we developed above. That is a 15x speedup.

Downloaded: https://www.python.org/

.found 212 raw links

.found 130 unique links

.broken: http://www.saltstack.com

Done

How long did it take to run on your computer?
Let me know in the comments below.

Free Python ThreadPoolExecutor Course

Download your FREE ThreadPoolExecutor PDF cheat sheet and get BONUS access to my free 7-day crash course on the ThreadPoolExecutor API.

Discover how to use the ThreadPoolExecutor class including how to configure the number of workers and how to execute tasks asynchronously.

Learn more

Extensions

This section lists ideas for extending the tutorial.

Parse URLs During Filtering: Update the example to parse the URLs during the filtering process and perhaps filter out and even report any that fail to parse correctly.
Filter out More Protocols: Update the example to filter out more protocols, such as ftp, mailto, and anything else we might not want to validate.
Change User Agent: Update the example to use a common browser agent and potentially allow more URLs to be validated correctly.

Share your extensions in the comments below; it would be great to see what you come up with.

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Takeaways

In this tutorial, you discovered how to concurrently validate URL links in a webpage in Python.

How to validate all links in a webpage sequentially in Python and how slow it can be.
How to use the ThreadPoolExecutor to manage a pool of worker threads.
How to update the code to validate multiple URLs at the same time and dramatically accelerate the process.

Do you have any questions?
Leave your question in a comment below and I will reply fast with my best advice.

Photo by Rob Wingate on Unsplash

How to Validate Links Concurrently in Python

Validate Links One-By-One (slowly)

Download URL File

Parse HTML for Links to Validate

Filter the Links

Validate the Links

Complete Example

Create a Pool of Worker Threads

Create the Thread Pool

Submit Tasks to the Tread Pool

Get Results as Tasks Complete

Shut Down the Thread Pool

Validate Multiple Links Concurrently

Extensions

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

How Well Do You Know Concurrent Futures?

Test Your Skill:

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn the ThreadPoolExecutor Systematically

Additional menu

Validate Links One-By-One (slowly)

Download URL File

Parse HTML for Links to Validate

Filter the Links

Validate the Links

Complete Example

Create a Pool of Worker Threads

Create the Thread Pool

Submit Tasks to the Tread Pool

Get Results as Tasks Complete

Shut Down the Thread Pool

Validate Multiple Links Concurrently

Extensions

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Leave a Reply Cancel reply

Footer

Learn the ThreadPoolExecutor Systematically