How to Download Books Concurrently from Project Gutenberg

December 25, 2021 Python ThreadPoolExecutor

The ThreadPoolExecutor class in Python can be used to download multiple books from Project Gutenberg at the same time.

This can dramatically speed-up the process compared to downloading books sequentially, one by one.

In this tutorial, you will discover how to concurrently download books from Project Gutenberg in Python.

After completing this tutorial, you will know:

Let's dive in.

Download Books One-by-One (slowly)

Project Gutenberg is an amazing resource providing a library of over 60,000 ebooks for free.

This includes a range of widely known classic books such as Frankenstein, Pride and Prejudice, Alice's Adventures in Wonderland, and so much more.

These books can be downloaded and read for pleasure, but they can also be used as the basis for interesting text-based analysis projects.

Downloading a large number of books from the website manually is a slow process as you must visit the webpage for each book, select a format, then download it.

We can automate this process by writing a Python script to automatically download books from the website.

Perhaps a simple approach to this task would be to first download the webpage of the most popular books on Project Gutenberg, retrieve a list of these books (via their book ids) then download each book.

There are a few parts to this program; for example:

  1. Download the Top Books Webpage
  2. Parse HTML for Links
  3. Extract Book Identifiers
  4. Download a Single Book via Its Identifier
  5. Complete Example

Let's look at each piece in turn.

Note: we are only going to implement the most basic error handling as we are focused on adapting the program for concurrency.

Download the Top Books Webpage

The first step is to download the page that lists the top books on Project Gutenberg.

This page is available here:

First, let's download this webpage so we can parse it for useful data.

There are many ways to download a URL in python. In this case, we will use the urlopen() function to open the connection to the URL and call the read() function to download the contents of the file into memory.

To ensure the connection is closed automatically once we are finished downloading, we will use a context manager, e.g. the with keyword.

You can learn more about opening and reading from URL connections in the Python API here:

We will set a timeout on the download of a few seconds. If a connection cannot be made within the specified timeout, an exception is raised. We can catch this exception and return None instead of the contents of the file to indicate to the caller that an error occurred.

The download_url() function below implements this, taking a URL and returning the contents of the file.

# download a file from a URL, returns content of downloaded file
def download_url(urlpath):
    try:
        # open a connection to the server
        with urlopen(urlpath, timeout=3) as connection:
            # read the contents of the url as bytes and return it
            return connection.read()
    except:
        return None

We will use this function to download the webpage of top books from Project Gutenberg, but we can also re-use it to download each book when the time comes.

Next, let's look at parsing the webpage of top books.

Once the list of top books webpage has been downloaded, we must parse it and extract all of the links, some of which will be links to books.

I recommend the BeautifulSoup Python library anytime HTML documents need to be parsed.

If you're new to BeautifulSoup, you can install it easily with your Python package manager, such as pip:

pip3 install beautifulsoup4

First, we must decode the raw data downloaded into ASCII text.

This can be achieved by calling the decode() function on the string of raw data and specifying a standard text format, such as UTF-8.

...
# decode the provided content as ascii text
html = content.decode('utf-8')

Next, we can parse the text of the HTML document using BeautifulSoup using the default parser. It is a good idea to use a more sophisticated parser that works the same on all platforms, but in this case, will use the default parser as it does not require you to install anything extra.

...
# parse the document as best we can
soup = BeautifulSoup(html, 'html.parser')

We can then retrieve all link HTML tags from the document as these will contain the URLs for the files we wish to download.

...
# find all all of the <a href=""> tags in the document
atags = soup.find_all('a')

We can then iterate through all of the found tags and retrieve the contents of the href property on each, e.g. get the links to files from each tag.

This can be done in a list comprehension, giving us a list of URLs to validate.

...
# get all links from a tags
return [tag.get('href') for tag in atags]

Tying this all together, the get_urls_from_html() function below takes the downloaded HTML content and returns a list of URLs to validate.

# decode downloaded html and extract all <a href=""> links
def get_urls_from_html(content):
    # decode the provided content as ascii text
    html = content.decode('utf-8')
    # parse the document as best we can
    soup = BeautifulSoup(html, 'html.parser')
    # find all all of the <a href=""> tags in the document
    atags = soup.find_all('a')
    # get all links from a tags
    return [tag.get('href') for tag in atags]

We will use this function to extract all of the links from the list of top books.

Next, we can extract just the book identifiers from these links.

Extract Book Identifiers

We have a list of links extracted from the webpage of the most popular books on Project Gutenberg.

Reviewing the source code for the page, we can see that generally, books are linked using a URL structure like '/ebooks/nnn' where 'nnn' is the book identifier.

We can collect all of these book identifiers and download each book in turn using a standard URL structure on the website.

Given this '/ebooks/nnn' structure of links to books, we can define a regular expression and find all links that match this structure.

...
# define a url pattern we are looking for
pattern = re.compile('/ebooks/[0-9]+')

We can then enumerate all links extracted from the webpage and skip all of those that do not match this structure.

...
for link in links:
    # check of the link matches the pattern
    if not pattern.match(link):
        continue

For those links that do match, we can extract just the book identifier. This is any integer characters after the last '/' in the URL, or simply everything after the eight characters that make up the string '/ebooks/'. For example:

...
# extract the book id from /ebooks/nnn
book_id = link[8:]

We don't want to download the same book more than once; that would be a waste of time and effort.

Therefore, we can store all of the extracted book identifiers in a set that ensures only unique identifiers are stored.

Tying this all together, the get_book_identifiers() function below will take a list of raw URLs extracted from the list of top books webpage and return a set of unique book identifiers.

# return all book unique identifiers from a list of raw links
def get_book_identifiers(links):
    # define a url pattern we are looking for
    pattern = re.compile('/ebooks/[0-9]+')
    # process the list of links for those that match the pattern
    books = set()
    for link in links:
        # check of the link matches the pattern
        if not pattern.match(link):
            continue
        # extract the book id from /ebooks/nnn
        book_id = link[8:]
        # store in the set, only keep unique ids
        books.add(book_id)
    return books

Download a Single Book via Its Identifier

If we click a few books on the site, we can quickly see that a common website structure is used.

Each book has a webpage that uses it's book identifier; for example: 'https://www.gutenberg.org/ebooks/nnn' where nnn is the book id.

If we try to download the ASCII version of any book, we get a URL that looks like 'https://www.gutenberg.org/files/nnn/nnn-0.txt', again, where nnn is the book identifier.

Therefore, we can create a string template using this direct download link and substitute each book identifier we have collected to create a download link for each top book on the website.

For example:

...
# construct the download url
url = f'https://www.gutenberg.org/files/{book_id}/{book_id}-0.txt'

We can then download this book in ASCII format using our download_url() function developed above.

...
# download the content
data = download_url(url)

Finally, we can save the downloaded file locally.

Given a local directory where we wish to save books (e.g. './books/'), we can first construct a local path for saving the book by joining the save path with the filename, then use this path to save the contents of the book file.

...
# create local path
save_file = join(save_path, f'{book_id}.txt')
# save book to file
with open(save_file, 'wb') as file:
    file.write(data)

Tying this together, the download_book() function below takes a book identifier and a local directory save path and will download and save the book and return a message of either success or failure.

# download one book from project gutenberg
def download_book(book_id, save_path):
    # construct the download url
    url = f'https://www.gutenberg.org/files/{book_id}/{book_id}-0.txt'
    # download the content
    data = download_url(url)
    if data is None:
        return f'Failed to download {url}'
    # create local path
    save_file = join(save_path, f'{book_id}.txt')
    # save book to file
    with open(save_file, 'wb') as file:
        file.write(data)
    return f'Saved {save_file}'

Finally, we can put this all together into a complete example.

Complete Example

We now have all of the elements to download all of the top books from Project Gutenberg.

First, the web page that lists the top books is downloaded, then parsed for links. All unique book identifiers are retrieved from the list of links, then each book is downloaded in turn.

The download_all_books() function below ties these elements together, taking the URL of the list of top books websites and a save path as arguments and reporting progress as each book is downloaded.

Importantly, the output path directory is created as needed using the makedirs() function.

# download all top books from project gutenberg
def download_all_books(url, save_path):
    # download the page that lists top books
    data = download_url(url)
    print(f'.downloaded {url}')
    # extract all links from the page
    links = get_urls_from_html(data)
    print(f'.found {len(links)} links on the page')
    # retrieve all unique book ids
    book_ids = get_book_identifiers(links)
    print(f'.found {len(book_ids)} unique book ids')
    # create the save directory if needed
    makedirs(save_path, exist_ok=True)
    # download and save each book in turn
    for book_id in book_ids:
        # download and save this book
        result = download_book(book_id, save_path)
        # report result
        print(result)

We can then call this function with the URL to the list of top books and specify a 'books' subdirectory of the current working directory as the location to save the downloaded books.

...
# entry point
URL = 'https://www.gutenberg.org/browse/scores/top'
DIR = 'books'
# download top books
download_all_books(URL, DIR)

Tying this together, the complete example is listed below.

# SuperFastPython.com
# download all the top books from gutenberg.org
from os.path import basename
from os.path import join
from os import makedirs
from urllib.request import urlopen
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
from bs4 import BeautifulSoup
import re

# download html and return parsed doc or None on error
def download_url(urlpath):
    try:
        # open a connection to the server
        with urlopen(urlpath, timeout=3) as connection:
            # read the contents of the html doc
            return connection.read()
    except:
        # bad url, socket timeout, http forbidden, etc.
        return None

# decode downloaded html and extract all <a href=""> links
def get_urls_from_html(content):
    # decode the provided content as ascii text
    html = content.decode('utf-8')
    # parse the document as best we can
    soup = BeautifulSoup(html, 'html.parser')
    # find all all of the <a href=""> tags in the document
    atags = soup.find_all('a')
    # get all links from a tags
    return [tag.get('href') for tag in atags]

# return all book unique identifiers from a list of raw links
def get_book_identifiers(links):
    # define a url pattern we are looking for
    pattern = re.compile('/ebooks/[0-9]+')
    # process the list of links for those that match the pattern
    books = set()
    for link in links:
        # check of the link matches the pattern
        if not pattern.match(link):
            continue
        # extract the book id from /ebooks/nnn
        book_id = link[8:]
        # store in the set, only keep unique ids
        books.add(book_id)
    return books

# download one book from project gutenberg
def download_book(book_id, save_path):
    # construct the download url
    url = f'https://www.gutenberg.org/files/{book_id}/{book_id}-0.txt'
    # download the content
    data = download_url(url)
    if data is None:
        return f'Failed to download {url}'
    # create local path
    save_file = join(save_path, f'{book_id}.txt')
    # save book to file
    with open(save_file, 'wb') as file:
        file.write(data)
    return f'Saved {save_file}'

# download all top books from project gutenberg
def download_all_books(url, save_path):
    # download the page that lists top books
    data = download_url(url)
    print(f'.downloaded {url}')
    # extract all links from the page
    links = get_urls_from_html(data)
    print(f'.found {len(links)} links on the page')
    # retrieve all unique book ids
    book_ids = get_book_identifiers(links)
    print(f'.found {len(book_ids)} unique book ids')
    # create the save directory if needed
    makedirs(save_path, exist_ok=True)
    # download and save each book in turn
    for book_id in book_ids:
        # download and save this book
        result = download_book(book_id, save_path)
        # report result
        print(result)

# entry point
URL = 'https://www.gutenberg.org/browse/scores/top'
DIR = 'books'
# download top books
download_all_books(URL, DIR)

Running the example first downloads the webpage with the list of the most popular books.

First, the HTML document is parsed and all links are extracted, then all book IDs are extracted from the links.

Finally, each book was downloaded one-by-one.

In this case, we can see that 675 links were extracted from the HTML document and a total of 120 unique book identifiers were retrieved.

It takes a long time to download (or attempt to download) all 120 books. Not all books can be downloaded; many fail, perhaps because the book downloads don't conform to the expected link structure.

Nevertheless, 100 books are successfully downloaded, in my case, and it takes about 312.6 seconds to complete (just over 5 minutes).

How many books did you download and how long did it take?
Let me know in the comments below.

.downloaded https://www.gutenberg.org/browse/scores/top
.found 675 links on the page
.found 120 unique book ids
Failed to download https://www.gutenberg.org/files/768/768-0.txt
Saved books/1998.txt
Failed to download https://www.gutenberg.org/files/63256/63256-0.txt
Failed to download https://www.gutenberg.org/files/1399/1399-0.txt
Saved books/209.txt
Saved books/41.txt
Saved books/2600.txt
Saved books/3207.txt
Saved books/4517.txt
Saved books/203.txt
Saved books/66743.txt
Saved books/36.txt
Saved books/1934.txt
Saved books/1080.txt
Saved books/46.txt
Saved books/408.txt
Saved books/66749.txt
...

Next, we will look at the ThreadPoolExecutor class that can be used to create a pool of worker threads that will allow us to speed up this download process.

How to Create a Pool of Worker Threads with ThreadPoolExecutor

We can use the ThreadPoolExecutor to speed up the process of downloading multiple books

The ThreadPoolExecutor class is provided as part of the concurrent.futures module for easily running concurrent tasks.

The ThreadPoolExecutor provides a pool of worker threads, which is different from the ProcessPoolExecutor that provides a pool of worker processes.

Generally, ThreadPoolExecutor should be used for concurrent IO-bound tasks, like downloading URLs, and the ProcessPoolExecutor should be used for concurrent CPU-bound tasks, like calculating.

Using the ThreadPoolExecutor was designed to be easy and straightforward. It is like the "automatic mode" for Python threads.

  1. Create the thread pool by calling ThreadPoolExecutor().
  2. Submit tasks and get futures by calling submit().
  3. Wait and get results as tasks complete by calling as_completed().
  4. Shut down the thread pool by calling shutdown().

Create the Thread Pool

First, a ThreadPoolExecutor instance must be created. By default, it will create a pool of five times the number of CPUs you have available. This is good for most purposes.

...
# create a thread pool with the default number of worker threads
executor = ThreadPoolExecutor()

You can run tens to hundreds of concurrent IO-bound threads per CPU, although perhaps not thousands or tens of thousands. You can specify the number of threads to create in the pool via the max_workers argument; for example:

...
# create a thread pool with 10 worker threads
executor = ThreadPoolExecutor(max_workers=10)

Submit Tasks to the Tread Pool

Once created, it can send tasks into the pool to be completed using the submit() function.

This function takes the name of the function to call any and all arguments and returns a Future object.

The Future object is a promise to return the results from the task (if any) and provides a way to determine if a specific task has been completed or not.

...
# submit a task and get a future object
future = executor.submit(task, arg1, arg2, ...)

The return from a function executed by the thread pool can be accessed via the result() function on the Future object. It will wait until the result is available, if needed, or return immediately if the result is available.

For example:

...
# get the result from a future
result = future.result()

Get Results as Tasks Complete

The beauty of performing tasks concurrently is that we can get results as they become available, rather than waiting for tasks to be completed in the order they were submitted.

The concurrent.futures module provides a as_completed() function that we can use to get results for tasks as they are completed, just like its name suggests.

We can call the function and provide it a list of Future objects created by calling submit() and it will return Future objects as they are completed in whatever order.

For example, we can use a list comprehension to submit the tasks and create the list of Future objects:

...
# submit all tasks into the thread pool and create a list of futures
futures = [executor.submit(task, item) for item in items]

Then get results for tasks as they complete in a for loop:

...
# iterate over all submitted tasks and get results as they are available
for future in as_completed(futures):
	# get the result
	result = future.result()
	# do something with the result...

Shut Down the Thread Pool

Once all tasks are completed, we can close down the thread pool, which will release each thread and any resources it may hold (e.g. the stack space).

...
# shutdown the thread pool
executor.shutdown()

An easier way to use the thread pool is via the context manager (the "with" keyword), which ensures it is closed automatically once we are finished with it.

...
# create a thread pool
with ThreadPoolExecutor(max_workers=10) as executor:
	# submit tasks
	futures = [executor.submit(task, item) for item in items]
	# get results as they are available
	for future in as_completed(futures):
		# get the result
		result = future.result()
		# do something with the result...

Now that we are familiar with ThreadPoolExecutor and how to use it, let's look at how we can adapt our program for downloading books to make use of it.

Download Multiple Books Concurrently

The program for downloading books can be adapted to use the ThreadPoolExecutor with very little change.

The download_book() function can be called separately for each book identifier in a separate thread. We can use the submit() function on the ThreadPoolExecutor to call the function, then report the results of each download as the downloads complete using the as_completed() function.

This can be achieved with a minor change to the end of the download_all_books() function used to drive the download process.

Firstly, we can create the thread pool with one thread per book to be downloaded.

...
# download and save each book in turn
with ThreadPoolExecutor(len(book_ids)) as exe:
	# ...

Next, we can call submit() for each book identifier and collect the Future objects that are returned.

...
futures = [exe.submit(download_book, book, save_path) for book in book_ids]

Finally, we can loop over tasks as they complete and report the success or failure message directly.

...
# report results as books are downloaded
for future in as_completed(futures):
    # report the result messages
    print(future.result())

And that's all there is to it; the updated version of the download_all_books() function that makes use of the ThreadPoolExecutor to download books concurrently is listed below.

# download all top books from project gutenberg
def download_all_books(url, save_path):
    # download the page that lists top books
    data = download_url(url)
    print(f'.downloaded {url}')
    # extract all links from the page
    links = get_urls_from_html(data)
    print(f'.found {len(links)} links on the page')
    # retrieve all unique book ids
    book_ids = get_book_identifiers(links)
    print(f'.found {len(book_ids)} unique book ids')
    # create the save directory if needed
    makedirs(save_path, exist_ok=True)
    # download and save each book in turn
    with ThreadPoolExecutor(len(book_ids)) as exe:
        futures = [exe.submit(download_book, book, save_path) for book in book_ids]
        # report results as books are downloaded
        for future in as_completed(futures):
            # report the result messages
            print(future.result())

Tying this together, the complete example of the concurrent version of the book downloading program is listed below.

# SuperFastPython.com
# download all the top books from gutenberg.org concurrently
import re
from os.path import join
from os import makedirs
from urllib.request import urlopen
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
from bs4 import BeautifulSoup

# download html and return parsed doc or None on error
def download_url(urlpath):
    try:
        # open a connection to the server
        with urlopen(urlpath, timeout=3) as connection:
            # read the contents of the html doc
            return connection.read()
    except:
        # bad url, socket timeout, http forbidden, etc.
        return None

# decode downloaded html and extract all <a href=""> links
def get_urls_from_html(content):
    # decode the provided content as ascii text
    html = content.decode('utf-8')
    # parse the document as best we can
    soup = BeautifulSoup(html, 'html.parser')
    # find all all of the <a href=""> tags in the document
    atags = soup.find_all('a')
    # get all links from a tags
    return [tag.get('href') for tag in atags]

# return all book unique identifiers from a list of raw links
def get_book_identifiers(links):
    # define a url pattern we are looking for
    pattern = re.compile('/ebooks/[0-9]+')
    # process the list of links for those that match the pattern
    books = set()
    for link in links:
        # check of the link matches the pattern
        if not pattern.match(link):
            continue
        # extract the book id from /ebooks/nnn
        book_id = link[8:]
        # store in the set, only keep unique ids
        books.add(book_id)
    return books

# download one book from project gutenberg
def download_book(book_id, save_path):
    # construct the download url
    url = f'https://www.gutenberg.org/files/{book_id}/{book_id}-0.txt'
    # download the content
    data = download_url(url)
    if data is None:
        return f'Failed to download {url}'
    # create local path
    save_file = join(save_path, f'{book_id}.txt')
    # save book to file
    with open(save_file, 'wb') as file:
        file.write(data)
    return f'Saved {save_file}'

# download all top books from project gutenberg
def download_all_books(url, save_path):
    # download the page that lists top books
    data = download_url(url)
    print(f'.downloaded {url}')
    # extract all links from the page
    links = get_urls_from_html(data)
    print(f'.found {len(links)} links on the page')
    # retrieve all unique book ids
    book_ids = get_book_identifiers(links)
    print(f'.found {len(book_ids)} unique book ids')
    # create the save directory if needed
    makedirs(save_path, exist_ok=True)
    # download and save each book in turn
    with ThreadPoolExecutor(len(book_ids)) as exe:
        futures = [exe.submit(download_book, book, save_path) for book in book_ids]
        # report results as books are downloaded
        for future in as_completed(futures):
            # report the result messages
            print(future.result())

# entry point
URL = 'https://www.gutenberg.org/browse/scores/top'
DIR = 'books'
# download top books
download_all_books(URL, DIR)

Running the program first collects all book IDs from the list of top books webpage then downloads each, just like before.

Unlike the previous example, all downloads occur concurrently with one download thread per book identifier.

The process is significantly faster, completing in about 15.4 seconds on my system, compared to 312.6 seconds for the serial case. That is a speedup of about 20 times.

.downloaded https://www.gutenberg.org/browse/scores/top
.found 675 links on the page
.found 120 unique book ids
Failed to download https://www.gutenberg.org/files/8800/8800-0.txt
Failed to download https://www.gutenberg.org/files/43453/43453-0.txt
Failed to download https://www.gutenberg.org/files/3296/3296-0.txt
Saved books/58585.txt
Failed to download https://www.gutenberg.org/files/5739/5739-0.txt
Failed to download https://www.gutenberg.org/files/32992/32992-0.txt
Saved books/64317.txt
Saved books/4517.txt
Saved books/66746.txt
Saved books/2814.txt
Failed to download https://www.gutenberg.org/files/27827/27827-0.txt
Saved books/1250.txt
Saved books/1934.txt
Saved books/32449.txt
Saved books/902.txt
Failed to download https://www.gutenberg.org/files/4363/4363-0.txt
Saved books/16.txt
Failed to download https://www.gutenberg.org/files/42324/42324-0.txt
Saved books/41445.txt
Saved books/41.txt
Failed to download https://www.gutenberg.org/files/7370/7370-0.txt
Saved books/3825.txt
Saved books/113.txt
Saved books/514.txt
Saved books/236.txt
Saved books/730.txt
Saved books/844.txt
...

Theoretically, the program should take about as long as the slowest download, e.g. perhaps 5-10 seconds. The fact that it is longer might mean that the server is imposing some rate limiting, that saving files concurrently is a burden on the hard drive, or that the threads are context switching.

It might be interesting to explore further improvements, such as saving files sequentially in the main thread rather than in each worker thread and whether this would be faster.

Extensions

This section lists ideas for extending the tutorial.

Share your extensions in the comments below, it would be great to see what you come up with.

Takeaways

In this tutorial, you discovered how to concurrently download books from Project Gutenberg in Python.



If you enjoyed this tutorial, you will love my book: Python ThreadPoolExecutor Jump-Start. It covers everything you need to master the topic with hands-on examples and clear explanations.