Last Updated on September 12, 2022
The ThreadPoolExecutor class in Python can be used to validate multiple URL links at the same time.
This can dramatically speed-up the process compared to validating each URL sequentially, one by one.
In this tutorial, you will discover how to concurrently validate URL links in a webpage in Python.
After completing this tutorial, you will know:
- How to validate all links in a webpage sequentially in Python and how slow it can be.
- How to use the ThreadPoolExecutor to manage a pool of worker threads.
- How to update the code to validate multiple URLs at the same time and dramatically accelerate the process.
Let’s dive in.
Validate Links One-By-One (slowly)
Validating URL links in a web page is a common task.
Websites change over time and some close down, others rename pages. There are many reasons why the links in a webpage may break.
Finding broken URLs on a webpage can be time-consuming. In the worst case, we have to click each URL in turn and determine whether the target page loads or not. For more than one link, this could take a very long time.
Thankfully, we’re developers, so we can write a script to first discover all links on a webpage, then check each to see if they work.
We can develop a program to validate links on a webpage one by one.
There are a few parts to this program; for example:
- Download URL Target Webpage
- Parse HTML For Links to Validate
- Filter the Links
- Validate the Links
- Complete Example
Let’s look at each piece in turn.
Note: we are only going to implement the most basic error handling. If you change the target URL, you may need to adapt the code for the specific details of the target webpage.
Download URL File
The first step is to download a target webpage.
There are many ways to download a URL in python. In this case, we will use the urlopen() function to open the connection to the URL and call the read() function to download the contents of the file into memory.
To ensure the connection is closed automatically once we are finished downloading, we will use a context manager, e.g. the with keyword.
You can learn more about opening and reading from URL connections in the Python API here:
We will set a timeout on the download of a few seconds. If a connection cannot be made within the specified timeout, an exception is raised. We can catch this exception and return None instead of the contents of the file to indicate to the caller that an error occurred.
The download_url() function below implements this, taking a URL and returning the contents of the file.
1 2 3 4 5 6 7 8 9 |
# download a file from a URL, returns content of downloaded file def download_url(urlpath): try: # open a connection to the server with urlopen(urlpath, timeout=3) as connection: # read the contents of the url as bytes and return it return connection.read() except: return None |
We will use this function to download the target webpage page that contains links that we want to validate.
We can also use this same function to validate each link on the webpage.
Parse HTML for Links to Validate
Once the target webpage has been downloaded, we must parse it and extract all of the links to validate.
I recommend the BeautifulSoup Python library whenever HTML documents need to be parsed.
If you’re new to BeautifulSoup, you can install it easily with your Python package manager, such as pip:
1 |
pip3 install beautifulsoup4 |
First, we must decode the raw data downloaded into ASCII text.
This can be achieved by calling the decode() function on the string of raw data and specifying a standard text format, such as UTF-8.
1 2 3 |
... # decode the provided content as ascii text html = content.decode('utf-8') |
Next, we can parse the text of the HTML document with BeautifulSoup using the default parser. It is a good idea to use a more sophisticated parser that works the same on all platforms, but in this case, we will use the default parser as it does not require you to install anything extra.
1 2 3 |
... # parse the document as best we can soup = BeautifulSoup(html, 'html.parser') |
We can then retrieve all <a href=””> HTML tags from the document as these will contain the URLs for the files we wish to download.
1 2 3 |
... # find all all of the <a href=""> tags in the document atags = soup.find_all('a') |
We can then iterate through all of the found tags and retrieve the contents of the href property on each, e.g. get the links to files from each tag.
This can be done in a list comprehension giving us a list of URLs to validate.
1 2 3 |
... # get all links from a tags return [tag.get('href') for tag in atags] |
Tying this all together, the get_urls_from_html() function below takes the downloaded HTML content and returns a list of URLs to validate.
1 2 3 4 5 6 7 8 9 10 |
# decode downloaded html and extract all <a href=""> links def get_urls_from_html(content): # decode the provided content as ascii text html = content.decode('utf-8') # parse the document as best we can soup = BeautifulSoup(html, 'html.parser') # find all all of the <a href=""> tags in the document atags = soup.find_all('a') # get all links from a tags return [tag.get('href') for tag in atags] |
We will use this function to extract all of the links from the target HTML page that need to be validated.
Filter the Links
The list of links extracted from the target webpage need to be filtered.
This includes removing any links that don’t have any content or have a different protocol.
1 2 3 4 5 6 7 |
... # skip missing or bad urls if url is None or url.strip() is None: continue # skip other protocols if url.startswith('javascript'): continue |
It also involves changing any links that are relative to be absolute.
Only absolute URLs can be validated; relative URLs must be converted to be absolute before being validated.
You may recall that a webpage link can be relative to the current webpage. For example, if the URL is https://www.python.org and links to /data/test1.html, then this is relative to the current webpage.
We can convert a relative URL to be an absolute URL by pre-pending the URL path for the current webpage, e.g. /data/test1.html becomes https://www.python.org/data/test1.html
We can do this using the urljoin() function from the urllib.parse module.
You can learn more about this module here:
This is a crude check (that won’t work for all cases) if a URL is relative is to determine if it does not start with “http” for the HTTP protocol part of the URL. If not, we might assume the URL is relative and convert it to be absolute.
1 2 3 4 5 |
... # check if the url is relative if not url.startswith('http'): # join the base url with the absolute fragment url = urljoin(base, url) |
The filter_urls() function below takes the target webpage URL and the list of raw URLs extracted from the target webpage and filters the list.
URLs with missing content or the wrong protocol are removed, relative URLs are converted to absolute URLs, and finally, we only keep one version of each absolute URL, using a set.
This set of filtered unique absolute URLs is then returned, ready for validation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# filter out bad urls, convert relative to absolute if needed def filter_urls(base, urls): filtered = set() for url in urls: # skip missing or bad urls if url is None or url.strip() is None: continue # skip other protocols if url.startswith('javascript'): continue # check if the url is relative if not url.startswith('http'): # join the base url with the absolute fragment url = urljoin(base, url) # store in the filtered set filtered.add(url) return filtered |
Validate the Links
The next step is to validate the set of filtered links.
This can be achieved by enumerating the collection of URLs and attempting to download each.
If a URL cannot be downloaded, it is probably broken and is reported.
We can use the download_url() function developed previously to download each link
The validate() function below takes a collection of filtered URLs and validates each, reporting with a print statement all those that appear to be broken.
1 2 3 4 5 6 7 |
# validate a list of urls, report all broken urls def validate(filtered_links): # validate each link for link in filtered_links: data = download_url(link) if data is None: print(f'.broken: {link}') |
Complete Example
We now have all of the elements to validate the links in a target webpage.
First, we can download a target URL by calling the download_url() function, extract all of the raw links by calling get_urls_from_html(), filter the raw links by calling filter_urls(), then validate the filtered links by calling validate().
The validate_urls() function below implements this procedure for a given target webpage provided as an argument.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# find all broken urls listed on this website def validate_urls(base): # download the page content = download_url(base) if content is None: print(f'Failed to download: {base}') return print(f'Downloaded: {base}') # extract all the links raw_links = get_urls_from_html(content) print(f'.found {len(raw_links)} raw links') # filter links filtered_links = filter_urls(base, raw_links) print(f'.found {len(filtered_links)} unique links') # validate the links validate(filtered_links) print('Done') |
We can then call this function for a target URL.
We’ll use www.python.org as the target website.
1 2 3 4 5 |
... # website to check URL = 'https://www.python.org/' # validate all urls validate_urls(URL) |
Tying this together, the complete example of validating all URLs one by one in the target website python.org is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
# SuperFastPython.com # validate all urls on a webpage one at a time from urllib.request import urlopen from urllib.parse import urljoin from bs4 import BeautifulSoup # download a file from a URL, returns content of downloaded file def download_url(urlpath): try: # open a connection to the server with urlopen(urlpath, timeout=3) as connection: # read the contents of the url as bytes and return it return connection.read() except: return None # decode downloaded html and extract all <a href=""> links def get_urls_from_html(content): # decode the provided content as ascii text html = content.decode('utf-8') # parse the document as best we can soup = BeautifulSoup(html, 'html.parser') # find all all of the <a href=""> tags in the document atags = soup.find_all('a') # get all links from a tags return [tag.get('href') for tag in atags] # filter out bad urls, convert relative to absolute if needed def filter_urls(base, urls): filtered = set() for url in urls: # skip missing or bad urls if url is None or url.strip() is None: continue # skip other protocols if url.startswith('javascript'): continue # check if the url is relative if not url.startswith('http'): # join the base url with the absolute fragment url = urljoin(base, url) # store in the filtered set filtered.add(url) return filtered # validate a list of urls, report all broken urls def validate(filtered_links): # validate each link for link in filtered_links: data = download_url(link) if data is None: print(f'.broken: {link}') # find all broken urls listed on this website def validate_urls(base): # download the page content = download_url(base) if content is None: print(f'Failed to download: {base}') return print(f'Downloaded: {base}') # extract all the links raw_links = get_urls_from_html(content) print(f'.found {len(raw_links)} raw links') # filter links filtered_links = filter_urls(base, raw_links) print(f'.found {len(filtered_links)} unique links') # validate the links validate(filtered_links) print('Done') # website to check URL = 'https://www.python.org/' # validate all urls validate_urls(URL) |
Running the example first downloads the target webpage.
The downloaded web page is parsed and all links are extracted; in this case, 212 raw links are reported.
The raw links are filtered, resulting in a set of 130 unique absolute URLs to be validated.
The links are validated and only one is reported to be broken (http://www.saltstack.com). The link may not be broken; it may just not like being connected to from a Python user agent.
1 2 3 4 5 |
Downloaded: https://www.python.org/ .found 212 raw links .found 130 unique links .broken: http://www.saltstack.com Done |
What results did you get?
Let me know in the comments below.
The program works fine, but it is very slow as each link on the web page must be checked one by one.
On my system, it took approximately 74 seconds to complete.
How long did it take to run on your computer?
Let me know in the comments below.
Next, we will look at the ThreadPoolExecutor class that can be used to create a pool of worker threads that will allow us to speed up this validation process.
Run loops using all CPUs, download your FREE book to learn how.
Create a Pool of Worker Threads
We can use the ThreadPoolExecutor to speed up the validation of multiple links on a webpage.
The ThreadPoolExecutor class is provided as part of the concurrent.futures module for easily running concurrent tasks.
The ThreadPoolExecutor provides a pool of worker threads, which is different from the ProcessPoolExecutor that provides a pool of worker processes.
Generally, ThreadPoolExecutor should be used for concurrent IO-bound tasks, like downloading URLs, and the ProcessPoolExecutor should be used for concurrent CPU-bound tasks, like calculating.
Using the ThreadPoolExecutor was designed to be easy and straightforward. It is like the “automatic mode” for Python threads.
- Create the thread pool by calling ThreadPoolExecutor().
- Submit tasks and get futures by calling submit().
- Wait and get results as tasks complete by calling as_completed().
- Shut down the thread pool by calling shutdown().
Create the Thread Pool
First, a ThreadPoolExecutor instance must be created. By default, it will create a pool of five times the number of CPUs you have available. This is good for most purposes.
1 2 3 |
... # create a thread pool with the default number of worker threads executor = ThreadPoolExecutor() |
You can run tens to hundreds of concurrent IO-bound threads per CPU, although perhaps not thousands or tens of thousands. You can specify the number of threads to create in the pool via the max_workers argument; for example:
1 2 3 |
... # create a thread pool with 10 worker threads executor = ThreadPoolExecutor(max_workers=10) |
Submit Tasks to the Tread Pool
Once created, it can send tasks into the pool to be completed using the submit() function.
This function takes the name of the function to call any and all arguments and returns a Future object.
The Future object is a promise to return the results from the task (if any) and provides a way to determine if a specific task has been completed or not.
1 2 3 |
... # submit a task and get a future object future = executor.submit(task, arg1, arg2, ...) |
The return from a function executed by the thread pool can be accessed via the result() function on the Future object. It will wait until the result is available, if needed, or return immediately if the result is available.
For example:
1 2 3 |
... # get the result from a future result = future.result() |
Get Results as Tasks Complete
The beauty of performing tasks concurrently is that we can get results as they become available, rather than waiting for tasks to be completed in the order they were submitted.
The concurrent.futures module provides a as_completed() function that we can use to get results for tasks as they are completed, just like its name suggests.
We can call the function and provide it a list of Future objects created by calling submit() and it will return Future objects as they are completed in whatever order.
For example, we can use a list comprehension to submit the tasks and create the list of Future objects:
1 2 3 |
... # submit all tasks into the thread pool and create a list of futures futures = [executor.submit(task, item) for item in items] |
Then get results for tasks as they complete in a for loop:
1 2 3 4 5 6 |
... # iterate over all submitted tasks and get results as they are available for future in as_completed(futures): # get the result result = future.result() # do something with the result... |
Shut Down the Thread Pool
Once all tasks are completed, we can close down the thread pool, which will release each thread and any resources it may hold (e.g. the stack space).
1 2 3 |
... # shutdown the thread pool executor.shutdown() |
An easier way to use the thread pool is via the context manager (the “with” keyword), which ensures it is closed automatically once we are finished with it.
1 2 3 4 5 6 7 8 9 10 |
... # create a thread pool with ThreadPoolExecutor(max_workers=10) as executor: # submit tasks futures = [executor.submit(task, item) for item in items] # get results as they are available for future in as_completed(futures): # get the result result = future.result() # do something with the result... |
Now that we are familiar with ThreadPoolExecutor and how to use it, let’s look at how we can adapt our program for validating links to make use of it.
Validate Multiple Links Concurrently
The program for validating links can be adapted to use the ThreadPoolExecutor with very little change.
The validate() function currently enumerates the list of links extracted from the target webpage and calls the download_url() function for each.
This loop can be updated to submit() tasks to a ThreadPoolExecutor object, and then we can wait on the Future objects via a call to as_completed() and report progress.
First, we can create the thread pool. We can create one thread for each page to be downloaded. To ensure we don’t create too many threads, we can limit the total to 1,000 if needed.
1 2 3 4 5 6 |
... # select the number of workers, no more than 1,000 n_threads = min(1000, len(filtered_links)) # validate each link with ThreadPoolExecutor(n_threads) as executor: # ... |
Importantly, we need to know which URL each task is associated with so we can report a message if download_url() returns None to indicate a broken link.
This can be achieved by first creating a dictionary that maps Future objects returned from the submit() function with the URL passed as an argument to download_url().
We can do this in a dictionary comprehension; for example:
1 2 3 |
... # validate each url futures_to_data = {executor.submit(download_url, link):link for link in filtered_links} |
We can then process results in the order that pages are downloaded by calling the as_completed() module function on the futures_to_data that contains Future objects as keys.
1 2 3 4 |
... # report results as they are available for future in as_completed(futures_to_data): # ... |
For each result we get, we can then get the result from the Future by calling result(), and if it is None (e.g. broken), we can retrieve the URL for the task and report a message.
1 2 3 4 5 6 |
... # get result from the download data = future.result() # check for broken url if data is None: print(f'.broken: {futures_to_data[future]}') |
Tying this together, the concurrent version of the validate() function for validating links concurrently using the ThreadPoolExecutor is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# validate a list of urls, report all broken urls def validate(filtered_links): # select the number of workers, no more than 1,000 n_threads = min(1000, len(filtered_links)) # validate each link with ThreadPoolExecutor(n_threads) as executor: # validate each url futures_to_data = {executor.submit(download_url, link):link for link in filtered_links} # report results as they are available for future in as_completed(futures_to_data): # get result from the download data = future.result() # get the link for the result link = futures_to_data[future] # check for broken url if data is None: print(f'.broken: {link}') |
The complete example with this updated validate() function is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
# SuperFastPython.com # validate all urls on a webpage concurrently from urllib.request import urlopen from urllib.parse import urljoin from concurrent.futures import ThreadPoolExecutor from concurrent.futures import as_completed from bs4 import BeautifulSoup # download a file from a URL, returns content of downloaded file def download_url(urlpath): try: # open a connection to the server with urlopen(urlpath, timeout=5) as connection: # read the contents of the url as bytes and return it return connection.read() except: return None # decode downloaded html and extract all <a href=""> links def get_urls_from_html(content): # decode the provided content as ascii text html = content.decode('utf-8') # parse the document as best we can soup = BeautifulSoup(html, 'html.parser') # find all all of the <a href=""> tags in the document atags = soup.find_all('a') # get all links from a tags return [tag.get('href') for tag in atags] # filter out bad urls, convert relative to absolute if needed def filter_urls(base, urls): filtered = set() for url in urls: # skip missing or bad urls if url is None or url.strip() is None: continue # skip other protocols if url.startswith('javascript'): continue # check if the url is relative if not url.startswith('http'): # join the base url with the absolute fragment url = urljoin(base, url) # store in the filtered set filtered.add(url) return filtered # validate a list of urls, report all broken urls def validate(filtered_links): # select the number of workers, no more than 1,000 n_threads = min(1000, len(filtered_links)) # validate each link with ThreadPoolExecutor(n_threads) as executor: # validate each url futures_to_data = {executor.submit(download_url, link):link for link in filtered_links} # report results as they are available for future in as_completed(futures_to_data): # get result from the download data = future.result() # check for broken url if data is None: print(f'.broken: {futures_to_data[future]}') # find all broken urls listed on this website def validate_urls(base): # download the page content = download_url(base) if content is None: print(f'Failed to download: {base}') return print(f'Downloaded: {base}') # extract all the links raw_links = get_urls_from_html(content) print(f'.found {len(raw_links)} raw links') # filter links filtered_links = filter_urls(base, raw_links) print(f'.found {len(filtered_links)} unique links') # validate all links validate(filtered_links) print('Done') # website to check URL = 'https://www.python.org/' # validate all urls validate_urls(URL) |
Running the example reports the same result as before, as we would expect.
Importantly, the program is dramatically faster, taking only 4.8 seconds on my system compared to about 74 seconds for the sequential version we developed above. That is a 15x speedup.
1 2 3 4 5 |
Downloaded: https://www.python.org/ .found 212 raw links .found 130 unique links .broken: http://www.saltstack.com Done |
How long did it take to run on your computer?
Let me know in the comments below.
Free Python ThreadPoolExecutor Course
Download your FREE ThreadPoolExecutor PDF cheat sheet and get BONUS access to my free 7-day crash course on the ThreadPoolExecutor API.
Discover how to use the ThreadPoolExecutor class including how to configure the number of workers and how to execute tasks asynchronously.
Extensions
This section lists ideas for extending the tutorial.
- Parse URLs During Filtering: Update the example to parse the URLs during the filtering process and perhaps filter out and even report any that fail to parse correctly.
- Filter out More Protocols: Update the example to filter out more protocols, such as ftp, mailto, and anything else we might not want to validate.
- Change User Agent: Update the example to use a common browser agent and potentially allow more URLs to be validated correctly.
Share your extensions in the comments below; it would be great to see what you come up with.
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Further Reading
This section provides additional resources that you may find helpful.
Books
- ThreadPoolExecutor Jump-Start, Jason Brownlee, (my book!)
- Concurrent Futures API Interview Questions
- ThreadPoolExecutor Class API Cheat Sheet
I also recommend specific chapters from the following books:
- Effective Python, Brett Slatkin, 2019.
- See Chapter 7: Concurrency and Parallelism
- Python in a Nutshell, Alex Martelli, et al., 2017.
- See: Chapter: 14: Threads and Processes
Guides
- Python ThreadPoolExecutor: The Complete Guide
- Python ProcessPoolExecutor: The Complete Guide
- Python Threading: The Complete Guide
- Python ThreadPool: The Complete Guide
APIs
References
Takeaways
In this tutorial, you discovered how to concurrently validate URL links in a webpage in Python.
- How to validate all links in a webpage sequentially in Python and how slow it can be.
- How to use the ThreadPoolExecutor to manage a pool of worker threads.
- How to update the code to validate multiple URLs at the same time and dramatically accelerate the process.
Do you have any questions?
Leave your question in a comment below and I will reply fast with my best advice.
Photo by Rob Wingate on Unsplash
Do you have any questions?