Last Updated on September 12, 2022
The ThreadPoolExecutor class in Python can be used to download multiple books from Project Gutenberg at the same time.
This can dramatically speed-up the process compared to downloading books sequentially, one by one.
In this tutorial, you will discover how to concurrently download books from Project Gutenberg in Python.
After completing this tutorial, you will know:
- How to download all of the top books from Project Gutenberg sequentially and how slow it can be.
- How to use the ThreadPoolExecutor to manage a pool of worker threads.
- How to update the code to download multiple books at the same time and dramatically accelerate the process.
Let’s dive in.
Download Books One-by-One (slowly)
Project Gutenberg is an amazing resource providing a library of over 60,000 ebooks for free.
This includes a range of widely known classic books such as Frankenstein, Pride and Prejudice, Alice’s Adventures in Wonderland, and so much more.
These books can be downloaded and read for pleasure, but they can also be used as the basis for interesting text-based analysis projects.
Downloading a large number of books from the website manually is a slow process as you must visit the webpage for each book, select a format, then download it.
We can automate this process by writing a Python script to automatically download books from the website.
Perhaps a simple approach to this task would be to first download the webpage of the most popular books on Project Gutenberg, retrieve a list of these books (via their book ids) then download each book.
There are a few parts to this program; for example:
- Download the Top Books Webpage
- Parse HTML for Links
- Extract Book Identifiers
- Download a Single Book via Its Identifier
- Complete Example
Let’s look at each piece in turn.
Note: we are only going to implement the most basic error handling as we are focused on adapting the program for concurrency.
Download the Top Books Webpage
The first step is to download the page that lists the top books on Project Gutenberg.
This page is available here:
First, let’s download this webpage so we can parse it for useful data.
There are many ways to download a URL in python. In this case, we will use the urlopen() function to open the connection to the URL and call the read() function to download the contents of the file into memory.
To ensure the connection is closed automatically once we are finished downloading, we will use a context manager, e.g. the with keyword.
You can learn more about opening and reading from URL connections in the Python API here:
We will set a timeout on the download of a few seconds. If a connection cannot be made within the specified timeout, an exception is raised. We can catch this exception and return None instead of the contents of the file to indicate to the caller that an error occurred.
The download_url() function below implements this, taking a URL and returning the contents of the file.
1 2 3 4 5 6 7 8 9 |
# download a file from a URL, returns content of downloaded file def download_url(urlpath): try: # open a connection to the server with urlopen(urlpath, timeout=3) as connection: # read the contents of the url as bytes and return it return connection.read() except: return None |
We will use this function to download the webpage of top books from Project Gutenberg, but we can also re-use it to download each book when the time comes.
Next, let’s look at parsing the webpage of top books.
Parse HTML for Links
Once the list of top books webpage has been downloaded, we must parse it and extract all of the links, some of which will be links to books.
I recommend the BeautifulSoup Python library anytime HTML documents need to be parsed.
If you’re new to BeautifulSoup, you can install it easily with your Python package manager, such as pip:
1 |
pip3 install beautifulsoup4 |
First, we must decode the raw data downloaded into ASCII text.
This can be achieved by calling the decode() function on the string of raw data and specifying a standard text format, such as UTF-8.
1 2 3 |
... # decode the provided content as ascii text html = content.decode('utf-8') |
Next, we can parse the text of the HTML document using BeautifulSoup using the default parser. It is a good idea to use a more sophisticated parser that works the same on all platforms, but in this case, will use the default parser as it does not require you to install anything extra.
1 2 3 |
... # parse the document as best we can soup = BeautifulSoup(html, 'html.parser') |
We can then retrieve all link HTML tags from the document as these will contain the URLs for the files we wish to download.
1 2 3 |
... # find all all of the <a href=""> tags in the document atags = soup.find_all('a') |
We can then iterate through all of the found tags and retrieve the contents of the href property on each, e.g. get the links to files from each tag.
This can be done in a list comprehension, giving us a list of URLs to validate.
1 2 3 |
... # get all links from a tags return [tag.get('href') for tag in atags] |
Tying this all together, the get_urls_from_html() function below takes the downloaded HTML content and returns a list of URLs to validate.
1 2 3 4 5 6 7 8 9 10 |
# decode downloaded html and extract all <a href=""> links def get_urls_from_html(content): # decode the provided content as ascii text html = content.decode('utf-8') # parse the document as best we can soup = BeautifulSoup(html, 'html.parser') # find all all of the <a href=""> tags in the document atags = soup.find_all('a') # get all links from a tags return [tag.get('href') for tag in atags] |
We will use this function to extract all of the links from the list of top books.
Next, we can extract just the book identifiers from these links.
Extract Book Identifiers
We have a list of links extracted from the webpage of the most popular books on Project Gutenberg.
Reviewing the source code for the page, we can see that generally, books are linked using a URL structure like ‘/ebooks/nnn‘ where ‘nnn‘ is the book identifier.
We can collect all of these book identifiers and download each book in turn using a standard URL structure on the website.
Given this ‘/ebooks/nnn‘ structure of links to books, we can define a regular expression and find all links that match this structure.
1 2 3 |
... # define a url pattern we are looking for pattern = re.compile('/ebooks/[0-9]+') |
We can then enumerate all links extracted from the webpage and skip all of those that do not match this structure.
1 2 3 4 5 |
... for link in links: # check of the link matches the pattern if not pattern.match(link): continue |
For those links that do match, we can extract just the book identifier. This is any integer characters after the last ‘/‘ in the URL, or simply everything after the eight characters that make up the string ‘/ebooks/‘. For example:
1 2 3 |
... # extract the book id from /ebooks/nnn book_id = link[8:] |
We don’t want to download the same book more than once; that would be a waste of time and effort.
Therefore, we can store all of the extracted book identifiers in a set that ensures only unique identifiers are stored.
Tying this all together, the get_book_identifiers() function below will take a list of raw URLs extracted from the list of top books webpage and return a set of unique book identifiers.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# return all book unique identifiers from a list of raw links def get_book_identifiers(links): # define a url pattern we are looking for pattern = re.compile('/ebooks/[0-9]+') # process the list of links for those that match the pattern books = set() for link in links: # check of the link matches the pattern if not pattern.match(link): continue # extract the book id from /ebooks/nnn book_id = link[8:] # store in the set, only keep unique ids books.add(book_id) return books |
Download a Single Book via Its Identifier
If we click a few books on the site, we can quickly see that a common website structure is used.
Each book has a webpage that uses it’s book identifier; for example: ‘https://www.gutenberg.org/ebooks/nnn‘ where nnn is the book id.
If we try to download the ASCII version of any book, we get a URL that looks like ‘https://www.gutenberg.org/files/nnn/nnn-0.txt‘, again, where nnn is the book identifier.
Therefore, we can create a string template using this direct download link and substitute each book identifier we have collected to create a download link for each top book on the website.
For example:
1 2 3 |
... # construct the download url url = f'https://www.gutenberg.org/files/{book_id}/{book_id}-0.txt' |
We can then download this book in ASCII format using our download_url() function developed above.
1 2 3 |
... # download the content data = download_url(url) |
Finally, we can save the downloaded file locally.
Given a local directory where we wish to save books (e.g. ‘./books/‘), we can first construct a local path for saving the book by joining the save path with the filename, then use this path to save the contents of the book file.
1 2 3 4 5 6 |
... # create local path save_file = join(save_path, f'{book_id}.txt') # save book to file with open(save_file, 'wb') as file: file.write(data) |
Tying this together, the download_book() function below takes a book identifier and a local directory save path and will download and save the book and return a message of either success or failure.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# download one book from project gutenberg def download_book(book_id, save_path): # construct the download url url = f'https://www.gutenberg.org/files/{book_id}/{book_id}-0.txt' # download the content data = download_url(url) if data is None: return f'Failed to download {url}' # create local path save_file = join(save_path, f'{book_id}.txt') # save book to file with open(save_file, 'wb') as file: file.write(data) return f'Saved {save_file}' |
Finally, we can put this all together into a complete example.
Complete Example
We now have all of the elements to download all of the top books from Project Gutenberg.
First, the web page that lists the top books is downloaded, then parsed for links. All unique book identifiers are retrieved from the list of links, then each book is downloaded in turn.
The download_all_books() function below ties these elements together, taking the URL of the list of top books websites and a save path as arguments and reporting progress as each book is downloaded.
Importantly, the output path directory is created as needed using the makedirs() function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# download all top books from project gutenberg def download_all_books(url, save_path): # download the page that lists top books data = download_url(url) print(f'.downloaded {url}') # extract all links from the page links = get_urls_from_html(data) print(f'.found {len(links)} links on the page') # retrieve all unique book ids book_ids = get_book_identifiers(links) print(f'.found {len(book_ids)} unique book ids') # create the save directory if needed makedirs(save_path, exist_ok=True) # download and save each book in turn for book_id in book_ids: # download and save this book result = download_book(book_id, save_path) # report result print(result) |
We can then call this function with the URL to the list of top books and specify a ‘books‘ subdirectory of the current working directory as the location to save the downloaded books.
1 2 3 4 5 6 |
... # entry point URL = 'https://www.gutenberg.org/browse/scores/top' DIR = 'books' # download top books download_all_books(URL, DIR) |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
# SuperFastPython.com # download all the top books from gutenberg.org from os.path import basename from os.path import join from os import makedirs from urllib.request import urlopen from urllib.parse import urljoin from concurrent.futures import ThreadPoolExecutor from concurrent.futures import as_completed from bs4 import BeautifulSoup import re # download html and return parsed doc or None on error def download_url(urlpath): try: # open a connection to the server with urlopen(urlpath, timeout=3) as connection: # read the contents of the html doc return connection.read() except: # bad url, socket timeout, http forbidden, etc. return None # decode downloaded html and extract all <a href=""> links def get_urls_from_html(content): # decode the provided content as ascii text html = content.decode('utf-8') # parse the document as best we can soup = BeautifulSoup(html, 'html.parser') # find all all of the <a href=""> tags in the document atags = soup.find_all('a') # get all links from a tags return [tag.get('href') for tag in atags] # return all book unique identifiers from a list of raw links def get_book_identifiers(links): # define a url pattern we are looking for pattern = re.compile('/ebooks/[0-9]+') # process the list of links for those that match the pattern books = set() for link in links: # check of the link matches the pattern if not pattern.match(link): continue # extract the book id from /ebooks/nnn book_id = link[8:] # store in the set, only keep unique ids books.add(book_id) return books # download one book from project gutenberg def download_book(book_id, save_path): # construct the download url url = f'https://www.gutenberg.org/files/{book_id}/{book_id}-0.txt' # download the content data = download_url(url) if data is None: return f'Failed to download {url}' # create local path save_file = join(save_path, f'{book_id}.txt') # save book to file with open(save_file, 'wb') as file: file.write(data) return f'Saved {save_file}' # download all top books from project gutenberg def download_all_books(url, save_path): # download the page that lists top books data = download_url(url) print(f'.downloaded {url}') # extract all links from the page links = get_urls_from_html(data) print(f'.found {len(links)} links on the page') # retrieve all unique book ids book_ids = get_book_identifiers(links) print(f'.found {len(book_ids)} unique book ids') # create the save directory if needed makedirs(save_path, exist_ok=True) # download and save each book in turn for book_id in book_ids: # download and save this book result = download_book(book_id, save_path) # report result print(result) # entry point URL = 'https://www.gutenberg.org/browse/scores/top' DIR = 'books' # download top books download_all_books(URL, DIR) |
Running the example first downloads the webpage with the list of the most popular books.
First, the HTML document is parsed and all links are extracted, then all book IDs are extracted from the links.
Finally, each book was downloaded one-by-one.
In this case, we can see that 675 links were extracted from the HTML document and a total of 120 unique book identifiers were retrieved.
It takes a long time to download (or attempt to download) all 120 books. Not all books can be downloaded; many fail, perhaps because the book downloads don’t conform to the expected link structure.
Nevertheless, 100 books are successfully downloaded, in my case, and it takes about 312.6 seconds to complete (just over 5 minutes).
How many books did you download and how long did it take?
Let me know in the comments below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
.downloaded https://www.gutenberg.org/browse/scores/top .found 675 links on the page .found 120 unique book ids Failed to download https://www.gutenberg.org/files/768/768-0.txt Saved books/1998.txt Failed to download https://www.gutenberg.org/files/63256/63256-0.txt Failed to download https://www.gutenberg.org/files/1399/1399-0.txt Saved books/209.txt Saved books/41.txt Saved books/2600.txt Saved books/3207.txt Saved books/4517.txt Saved books/203.txt Saved books/66743.txt Saved books/36.txt Saved books/1934.txt Saved books/1080.txt Saved books/46.txt Saved books/408.txt Saved books/66749.txt ... |
Next, we will look at the ThreadPoolExecutor class that can be used to create a pool of worker threads that will allow us to speed up this download process.
Run loops using all CPUs, download your FREE book to learn how.
How to Create a Pool of Worker Threads with ThreadPoolExecutor
We can use the ThreadPoolExecutor to speed up the process of downloading multiple books
The ThreadPoolExecutor class is provided as part of the concurrent.futures module for easily running concurrent tasks.
The ThreadPoolExecutor provides a pool of worker threads, which is different from the ProcessPoolExecutor that provides a pool of worker processes.
Generally, ThreadPoolExecutor should be used for concurrent IO-bound tasks, like downloading URLs, and the ProcessPoolExecutor should be used for concurrent CPU-bound tasks, like calculating.
Using the ThreadPoolExecutor was designed to be easy and straightforward. It is like the “automatic mode” for Python threads.
- Create the thread pool by calling ThreadPoolExecutor().
- Submit tasks and get futures by calling submit().
- Wait and get results as tasks complete by calling as_completed().
- Shut down the thread pool by calling shutdown().
Create the Thread Pool
First, a ThreadPoolExecutor instance must be created. By default, it will create a pool of five times the number of CPUs you have available. This is good for most purposes.
1 2 3 |
... # create a thread pool with the default number of worker threads executor = ThreadPoolExecutor() |
You can run tens to hundreds of concurrent IO-bound threads per CPU, although perhaps not thousands or tens of thousands. You can specify the number of threads to create in the pool via the max_workers argument; for example:
1 2 3 |
... # create a thread pool with 10 worker threads executor = ThreadPoolExecutor(max_workers=10) |
Submit Tasks to the Tread Pool
Once created, it can send tasks into the pool to be completed using the submit() function.
This function takes the name of the function to call any and all arguments and returns a Future object.
The Future object is a promise to return the results from the task (if any) and provides a way to determine if a specific task has been completed or not.
1 2 3 |
... # submit a task and get a future object future = executor.submit(task, arg1, arg2, ...) |
The return from a function executed by the thread pool can be accessed via the result() function on the Future object. It will wait until the result is available, if needed, or return immediately if the result is available.
For example:
1 2 3 |
... # get the result from a future result = future.result() |
Get Results as Tasks Complete
The beauty of performing tasks concurrently is that we can get results as they become available, rather than waiting for tasks to be completed in the order they were submitted.
The concurrent.futures module provides a as_completed() function that we can use to get results for tasks as they are completed, just like its name suggests.
We can call the function and provide it a list of Future objects created by calling submit() and it will return Future objects as they are completed in whatever order.
For example, we can use a list comprehension to submit the tasks and create the list of Future objects:
1 2 3 |
... # submit all tasks into the thread pool and create a list of futures futures = [executor.submit(task, item) for item in items] |
Then get results for tasks as they complete in a for loop:
1 2 3 4 5 6 |
... # iterate over all submitted tasks and get results as they are available for future in as_completed(futures): # get the result result = future.result() # do something with the result... |
Shut Down the Thread Pool
Once all tasks are completed, we can close down the thread pool, which will release each thread and any resources it may hold (e.g. the stack space).
1 2 3 |
... # shutdown the thread pool executor.shutdown() |
An easier way to use the thread pool is via the context manager (the “with” keyword), which ensures it is closed automatically once we are finished with it.
1 2 3 4 5 6 7 8 9 10 |
... # create a thread pool with ThreadPoolExecutor(max_workers=10) as executor: # submit tasks futures = [executor.submit(task, item) for item in items] # get results as they are available for future in as_completed(futures): # get the result result = future.result() # do something with the result... |
Now that we are familiar with ThreadPoolExecutor and how to use it, let’s look at how we can adapt our program for downloading books to make use of it.
Download Multiple Books Concurrently
The program for downloading books can be adapted to use the ThreadPoolExecutor with very little change.
The download_book() function can be called separately for each book identifier in a separate thread. We can use the submit() function on the ThreadPoolExecutor to call the function, then report the results of each download as the downloads complete using the as_completed() function.
This can be achieved with a minor change to the end of the download_all_books() function used to drive the download process.
Firstly, we can create the thread pool with one thread per book to be downloaded.
1 2 3 4 |
... # download and save each book in turn with ThreadPoolExecutor(len(book_ids)) as exe: # ... |
Next, we can call submit() for each book identifier and collect the Future objects that are returned.
1 2 |
... futures = [exe.submit(download_book, book, save_path) for book in book_ids] |
Finally, we can loop over tasks as they complete and report the success or failure message directly.
1 2 3 4 5 |
... # report results as books are downloaded for future in as_completed(futures): # report the result messages print(future.result()) |
And that’s all there is to it; the updated version of the download_all_books() function that makes use of the ThreadPoolExecutor to download books concurrently is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# download all top books from project gutenberg def download_all_books(url, save_path): # download the page that lists top books data = download_url(url) print(f'.downloaded {url}') # extract all links from the page links = get_urls_from_html(data) print(f'.found {len(links)} links on the page') # retrieve all unique book ids book_ids = get_book_identifiers(links) print(f'.found {len(book_ids)} unique book ids') # create the save directory if needed makedirs(save_path, exist_ok=True) # download and save each book in turn with ThreadPoolExecutor(len(book_ids)) as exe: futures = [exe.submit(download_book, book, save_path) for book in book_ids] # report results as books are downloaded for future in as_completed(futures): # report the result messages print(future.result()) |
Tying this together, the complete example of the concurrent version of the book downloading program is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# SuperFastPython.com # download all the top books from gutenberg.org concurrently import re from os.path import join from os import makedirs from urllib.request import urlopen from concurrent.futures import ThreadPoolExecutor from concurrent.futures import as_completed from bs4 import BeautifulSoup # download html and return parsed doc or None on error def download_url(urlpath): try: # open a connection to the server with urlopen(urlpath, timeout=3) as connection: # read the contents of the html doc return connection.read() except: # bad url, socket timeout, http forbidden, etc. return None # decode downloaded html and extract all <a href=""> links def get_urls_from_html(content): # decode the provided content as ascii text html = content.decode('utf-8') # parse the document as best we can soup = BeautifulSoup(html, 'html.parser') # find all all of the <a href=""> tags in the document atags = soup.find_all('a') # get all links from a tags return [tag.get('href') for tag in atags] # return all book unique identifiers from a list of raw links def get_book_identifiers(links): # define a url pattern we are looking for pattern = re.compile('/ebooks/[0-9]+') # process the list of links for those that match the pattern books = set() for link in links: # check of the link matches the pattern if not pattern.match(link): continue # extract the book id from /ebooks/nnn book_id = link[8:] # store in the set, only keep unique ids books.add(book_id) return books # download one book from project gutenberg def download_book(book_id, save_path): # construct the download url url = f'https://www.gutenberg.org/files/{book_id}/{book_id}-0.txt' # download the content data = download_url(url) if data is None: return f'Failed to download {url}' # create local path save_file = join(save_path, f'{book_id}.txt') # save book to file with open(save_file, 'wb') as file: file.write(data) return f'Saved {save_file}' # download all top books from project gutenberg def download_all_books(url, save_path): # download the page that lists top books data = download_url(url) print(f'.downloaded {url}') # extract all links from the page links = get_urls_from_html(data) print(f'.found {len(links)} links on the page') # retrieve all unique book ids book_ids = get_book_identifiers(links) print(f'.found {len(book_ids)} unique book ids') # create the save directory if needed makedirs(save_path, exist_ok=True) # download and save each book in turn with ThreadPoolExecutor(len(book_ids)) as exe: futures = [exe.submit(download_book, book, save_path) for book in book_ids] # report results as books are downloaded for future in as_completed(futures): # report the result messages print(future.result()) # entry point URL = 'https://www.gutenberg.org/browse/scores/top' DIR = 'books' # download top books download_all_books(URL, DIR) |
Running the program first collects all book IDs from the list of top books webpage then downloads each, just like before.
Unlike the previous example, all downloads occur concurrently with one download thread per book identifier.
The process is significantly faster, completing in about 15.4 seconds on my system, compared to 312.6 seconds for the serial case. That is a speedup of about 20 times.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
.downloaded https://www.gutenberg.org/browse/scores/top .found 675 links on the page .found 120 unique book ids Failed to download https://www.gutenberg.org/files/8800/8800-0.txt Failed to download https://www.gutenberg.org/files/43453/43453-0.txt Failed to download https://www.gutenberg.org/files/3296/3296-0.txt Saved books/58585.txt Failed to download https://www.gutenberg.org/files/5739/5739-0.txt Failed to download https://www.gutenberg.org/files/32992/32992-0.txt Saved books/64317.txt Saved books/4517.txt Saved books/66746.txt Saved books/2814.txt Failed to download https://www.gutenberg.org/files/27827/27827-0.txt Saved books/1250.txt Saved books/1934.txt Saved books/32449.txt Saved books/902.txt Failed to download https://www.gutenberg.org/files/4363/4363-0.txt Saved books/16.txt Failed to download https://www.gutenberg.org/files/42324/42324-0.txt Saved books/41445.txt Saved books/41.txt Failed to download https://www.gutenberg.org/files/7370/7370-0.txt Saved books/3825.txt Saved books/113.txt Saved books/514.txt Saved books/236.txt Saved books/730.txt Saved books/844.txt ... |
Theoretically, the program should take about as long as the slowest download, e.g. perhaps 5-10 seconds. The fact that it is longer might mean that the server is imposing some rate limiting, that saving files concurrently is a burden on the hard drive, or that the threads are context switching.
It might be interesting to explore further improvements, such as saving files sequentially in the main thread rather than in each worker thread and whether this would be faster.
Free Python ThreadPoolExecutor Course
Download your FREE ThreadPoolExecutor PDF cheat sheet and get BONUS access to my free 7-day crash course on the ThreadPoolExecutor API.
Discover how to use the ThreadPoolExecutor class including how to configure the number of workers and how to execute tasks asynchronously.
Extensions
This section lists ideas for extending the tutorial.
- Fix Broken Downloads. Investigate why some books cannot be downloaded and update the program accordingly so all popular books can be downloaded.
- Save Files in the Main Thread. Update the example to save all files in the main thread rather than worker threads and report any speed difference.
- Tune The Number of Worker Threads. Test with different numbers of worker threads to see if performance can be improved and in turn whether thread context switching is degrading performance.
Share your extensions in the comments below, it would be great to see what you come up with.
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Further Reading
This section provides additional resources that you may find helpful.
Books
- ThreadPoolExecutor Jump-Start, Jason Brownlee, (my book!)
- Concurrent Futures API Interview Questions
- ThreadPoolExecutor Class API Cheat Sheet
I also recommend specific chapters from the following books:
- Effective Python, Brett Slatkin, 2019.
- See Chapter 7: Concurrency and Parallelism
- Python in a Nutshell, Alex Martelli, et al., 2017.
- See: Chapter: 14: Threads and Processes
Guides
- Python ThreadPoolExecutor: The Complete Guide
- Python ProcessPoolExecutor: The Complete Guide
- Python Threading: The Complete Guide
- Python ThreadPool: The Complete Guide
APIs
References
Takeaways
In this tutorial, you discovered how to concurrently download books from Project Gutenberg in Python.
- How to download all of the top books from Project Gutenberg sequentially and how slow it can be.
- How to use the ThreadPoolExecutor to manage a pool of worker threads.
- How to update the code to download multiple books at the same time and dramatically accelerate the process.
Do you have any questions?
Leave your question in a comment below and I will reply fast with my best advice.
Photo by Kaur Kristjan on Unsplash
Gao says
thanks,a great article.it teach me a lot about thread.But i am confused with this code”for future in as_completed(futures):”,it seems like a iterotar,could i traverse it and add something to it(i mean a new thread maybe finisd) same time?
Jason Brownlee says
The as_completed() function will return futures in the order that the tasks are completed. For example:
https://superfastpython.com/threadpoolexecutor-wait-vs-as-completed/
I believe you cannot add additional tasks to the list passed to the function once called. Interesting ideas though.