You can download webpages asynchronously using asyncio in Python.
In this tutorial, you will discover how to download an HTML webpage using HTTP and HTTPS using asyncio.
Let’s get started.
How to Download a Webpage with Asyncio
The asyncio module provides support for opening socket connections and reading and writing data via streams.
We can use this capability to download webpages.
This involves perhaps four steps, they are:
- 1. Open a connection
- 2. Write a request
- 3. Read a response
- 4. Close the connection
Let’s take a closer look at each part in turn.
Open HTTP Connection
A connection can be opened in asyncio using the asyncio.open_connection() function.
Establish a network connection and return a pair of (reader, writer) objects.
— Asyncio Streams
Among many arguments, the function takes the string hostname and integer port number
This is a coroutine that must be awaited and returns a StreamReader and a StreamWriter for reading and writing with the socket.
This can be used to open an HTTP connection on port 80.
For example:
1 2 3 |
... # open a socket connection reader, writer = await asyncio.open_connection('www.google.com', 80) |
We can also open an SSL connection using the ssl=True argument. This can be used to open an HTTPS connection on port 443.
For example:
1 2 3 |
... # open a socket connection reader, writer = await asyncio.open_connection('www.google.com', 443) |
Write HTTP Request
Once open, we can write a query to the StreamWriter to make an HTTP request.
For example, an HTTP version 1.1 request is in plain text. We can request the file path ‘/’, which may look as follows:
1 2 |
GET / HTTP/1.1 Host: www.google.com |
Importantly, there must be a carriage return and a line feed (\r\n) at the end of each line, and an empty line at the end.
As Python strings this may look as follows:
1 2 3 |
'GET / HTTP/1.1\r\n' 'Host: www.google.com\r\n' '\r\n' |
You can learn more about HTTP v1.1 request messages here:
This string must be encoded as bytes before being written to the StreamWriter.
Represents a writer object that provides APIs to write data to the IO stream.
— Asyncio Streams
This can be achieved using the encode() method on the string itself.
The default ‘utf-8‘ encoding may be sufficient.
For example:
1 2 3 |
... # encode string as bytes byte_data = string.encode() |
You can see a listing of encodings here:
The bytes can then be written to the socket via the StreamWriter via the write() method.
For example:
1 2 3 |
... # write query to socket writer.write(byte_data) |
After writing the request, it is a good idea to wait for the byte data to be sent and for the socket to be ready.
This can be achieved by the drain() method.
Wait until it is appropriate to resume writing to the stream.
— Asyncio Streams
This is a coroutine that must be awaited.
For example:
1 2 3 |
... # wait for the socket to be ready. await writer.drain() |
Read HTTP Response
Once the HTTP request has been made, we can read the response.
This can be achieved via the StreamReader for the socket.
Represents a reader object that provides APIs to read data from the IO stream.
— Asyncio Streams
The response can be read using the read() method which will read a chunk of bytes, or the readline() method which will read one line of bytes.
We might prefer the readline() method because we are using the text-based HTTP protocol which sends HTML data one line at a time.
The readline() method is a coroutine and must be awaited.
For example:
1 2 3 |
... # read one line of response line_bytes = await reader.readline() |
HTTP 1.1 responses are composed of two parts, a header separated by an empty line, then the body terminating with an empty line.
The header has information about whether the request was successful and what type of file will be sent, and the body contains the content of the file, such as an HTML webpage.
You can learn more about HTTP v1.1 responses here:
Each line must be decoded from bytes into a string.
This can be achieved using the decode() method on the byte data. Again, the default encoding is ‘utf_8‘.
For example:
1 2 3 |
... # decode bytes into a string line_data = line_bytes.decode() |
We can continue to read lines in a loop until an empty string (containing only newlines) is returned. This will read the head. We can then do the same thing in order to read the body.
For example:
1 2 3 4 5 6 7 8 9 10 11 |
# read the header while True: # read one line of response line_bytes = await reader.readline() # decode bytes into a string line_data = line_bytes.decode() # remove white space line_data = line_data.strip() # check if empty if not line_data: break |
Helpfully, the StreamReader implements the interface for the asynchronous iterator.
We can see this if we look in the source code for the StreamReader. It implements the __anext__() method.
This capability does not appear to be described in the API documentation, at least at the time of writing.
This will call the readline() method until an empty string is read.
This means we can read lines from the StreamReader using the “async for” expression.
For example:
1 2 3 4 5 6 7 |
... # read the header async for line_bytes in reader: # decode bytes into a string line_data = line_bytes.decode() # remove white space line_data = line_data.strip() |
This will automatically suspend the caller and await each line.
You can learn more about the “async for” expression in the tutorial:
Close HTTP Connection
We can close the socket connection by closing the StreamWriter.
This can be achieved by calling the close() method.
For example:
1 2 3 |
... # close the connection writer.close() |
This does not block and may not close the socket immediately.
We can wait for the socket to close completely.
This can be achieved using the wait_closed() method.
This is a coroutine that we can await.
For example:
1 2 3 |
... # wait for the socket to close completely. await writer.closed() |
Now that we know how to download a webpage asynchronously using asyncio, let’s look at some worked examples.
Run loops using all CPUs, download your FREE book to learn how.
Example of Downloading HTML over HTTP in Asyncio
We can develop an example to download an HTML webpage over HTTP using asyncio.
Let’s dive in.
Write HTTP Query
We can develop a coroutine that will issue an HTTP GET.
The coronet will take three arguments, the StreamWriter for writing the request to the socket, the path for the file to download, and the hostname.
1 2 3 |
# coroutine to construct a query and write to socket async def http_get(writer, path, host): # ... |
The first step is to construct the HTTP request using v1.1 of the protocol.
This can be achieved in a single line.
1 2 3 |
... # define request query = f'GET {path} HTTP/1.1\r\nHost: {host}\r\n\r\n' |
The request string can then be encoded to bytes using the UTF-8 encoding.
1 2 3 |
... # encode request as bytes query = query.encode() |
The byte string can then be written to the socket via the StreamWriter.
1 2 3 |
... # write query to socket writer.write(query) |
Finally, we can drain the socket, waiting until it is ready to continue using.
1 2 3 |
... # wait for the bytes to be written to the socket await writer.drain() |
Tying this together, the complete http_get() coroutine is listed below.
1 2 3 4 5 6 7 8 9 10 |
# coroutine to construct a query and write to socket async def http_get(writer, path, host): # define request query = f'GET {path} HTTP/1.1\r\nHost: {host}\r\n\r\n' # encode request as bytes query = query.encode() # write query to socket writer.write(query) # wait for the bytes to be written to the socket await writer.drain() |
Next, we can read the response.
Read HTTP Query Results
We can develop a coroutine to read a response from the server.
This coroutine will take a StreamReader as an argument.
1 2 3 |
# coroutine to read an http block from a socket async def http_read(reader): # ... |
It will read lines from the StreamReader until an empty line is read, after which it will return all lines as a single string with new line separators.
We will do this using a for loop, although a more pythonic way might be to use a simple list comprehension with compound statements.
First;y we can initialize the target string that will be used to store the complete response.
1 2 |
... data = '' |
We can then use an “async for” expression to read lines from the StreamReader.
This will automatically await each line for us.
1 2 3 4 |
... # read all data, decode and add to string async for line in reader: # ... |
Each line can then be decoded using UTF-8 encoding, the default for converting bytes to a string.
1 2 3 |
... # decode into a string line = line.decode() |
We can then strip the white space from the string which leaves the content only.
1 2 3 |
... # strip white space line = line.strip() |
If there is no content, it is likely the end of a block, such as the end of a header or the end of the body.
Therefore, we can stop reading content from the socket.
1 2 3 4 |
... # check for end of block if not line: return data |
Otherwise, we can append the line to our response string.
1 2 3 |
... # store line data += line + '\n' |
Tying this together, the complete http_read() coroutine is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# coroutine to read a http block from a socket async def http_read(reader): data = '' # read all data, decode and add to string async for line in reader: # decode into a string line = line.decode() # strip white space line = line.strip() # check for end of block if not line: return data # store line data += line + '\n' |
Next, we can open a connection and tie the asyncio program together.
Open Connection
We can develop a coroutine to open a connection, make the request and read the response.
First, we can define the host and file we wish to download.
We will download the google homepage in this example as it’s likely to always be there and accessible to all readers.
1 2 3 4 5 6 |
... # main coroutine async def main(): # define the web page we want to download host = 'www.google.com' path = '/' |
Next, we can open a connection to the host on port 80 used for HTTP web server socket connections.
1 2 3 4 |
... # open the connection reader, writer = await asyncio.open_connection(host, 80) print(f'Connected to {host}') |
This returns a StreamReader and StreamWriter that we can use to interact with the host.
Next, we can issue the request to GET the target file.
This will involve awaiting our http_get() coroutine.
1 2 3 4 |
... # issue the request await http_get(writer, path, host) print(f'Issued GET for {path}') |
Next, we can read the header of the HTTP response.
This involves awaiting our http_read() coroutine.
In this case, we will report the contents of the header, out of interest.
1 2 3 4 5 |
... # reader the header header = await http_read(reader) # report the header print(header) |
We will then read the body of the response and report the number of characters read, a rough approximation because we are stripping white space and adding some back when processing the response.
Again, we will use our http_read() coroutine.
1 2 3 4 5 |
... # read the body body = await http_read(reader) # report the details of the body print(f'Body: {len(body)} characters') |
Finally, we can close the connection.
This is achieved by closing the StreamWriter and waiting for it to close.
1 2 3 4 5 |
... # close writer stream writer.close() # wait for the writer to close await writer.wait_closed() |
Tying this together, the complete main() coroutine is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# main coroutine async def main(): # define the web page we want to download host = 'www.google.com' path = '/' # open the connection reader, writer = await asyncio.open_connection(host, 80) print(f'Connected to {host}') # issue the request await http_get(writer, path, host) print(f'Issued GET for {path}') # reader the header header = await http_read(reader) # report the header print(header) # read the body body = await http_read(reader) # report the details of the body print(f'Body: {len(body)} characters') # close writer stream writer.close() # wait for the writer to close await writer.wait_closed() |
Complete Example
We can tie all of these elements together.
The complete example of downloading an HTML webpage over HTTP using asyncio is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
# SuperFastPython.com # example of downloading a webpage over HTTP in asyncio import asyncio # coroutine to construct a query and write to socket async def http_get(writer, path, host): # define request query = f'GET {path} HTTP/1.1\r\nHost: {host}\r\n\r\n' # encode request as bytes query = query.encode() # write query to socket writer.write(query) # wait for the bytes to be written to the socket await writer.drain() # coroutine to read a http block from a socket async def http_read(reader): # read all data, decode and add to string data = '' async for line in reader: # decode into a string line = line.decode() # strip white space line = line.strip() # check for end of block if not line: return data # store line data += line + '\n' # main coroutine async def main(): # define the web page we want to download host = 'www.google.com' path = '/' # open the connection reader, writer = await asyncio.open_connection(host, 80) print(f'Connected to {host}') # issue the request await http_get(writer, path, host) print(f'Issued GET for {path}') # reader the header header = await http_read(reader) # report the header print(header) # read the body body = await http_read(reader) # report the details of the body print(f'Body: {len(body)} characters') # close writer stream writer.close() # wait for the writer to close await writer.wait_closed() # entry point asyncio.run(main()) |
Running the example first creates the main() coroutine and uses it as the entry point for the program.
The main() coroutine runs and defines the host and file to download.
The main() opens the connection, suspending and waiting for the socket to be open and be made ready on port 80.
Once ready, the coroutine returns a StreamReader and StreamWriter for interacting with the socket.
Next, the request is made. The main() coroutine calls our http_get() coroutine in order to make a request.
The http_get() coroutine runes, constructing and encoding the query then writing the bytes to the socket. The http_get() then suspends and waits for the socket to be ready before returning.
The main() coroutine resumes and reads the header, calling our http_read() coroutine and suspending.
The http_read() coroutine runs, reading lines and appending them to a string until an empty line is read, likely signaling the end of a header or body block.
Each call to the readline() coroutine is awaited automatically using the “async for” expression on the StreamReader.
The http_read() coroutine terminates and the main() coroutine resumes, reporting the contents of the HTTP header.
This contains lots of interesting stuff.
The body is then read using our http_read() coroutine and the number of characters in the content is reported.
Finally, the socket is closed and the program closes.
This highlights how we can use write to and read from sockets in asyncio to download an HTML webpage using the HTTP protocol.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Connected to www.google.com Issued GET for / HTTP/1.1 200 OK Date: Tue, 11 Oct 2022 21:18:16 GMT Expires: -1 Cache-Control: private, max-age=0 Content-Type: text/html; charset=ISO-8859-1 P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info." Server: gws X-XSS-Protection: 0 X-Frame-Options: SAMEORIGIN Set-Cookie: 1P_JAR=2022-10-11-21; expires=Thu, 10-Nov-2022 21:18:16 GMT; path=/; domain=.google.com; Secure Set-Cookie: AEC=AakniGNyVHd3MuuDP6JuAtJRtKlAWwZEp04ju-OwpoymwbIp7w77Ieg1Cfs; expires=Sun, 09-Apr-2023 21:18:16 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax Set-Cookie: NID=511=AT-bATDPaoz9wWo8BQKDycF49LdbCvtiW2-o7_Nwrd3guh3Clgae5kTv-tUqDHVVpRDmzsrWoRzjDVQ6stpkSacom7yv-5Dvaz0ZhhuK8tdKaElUwURML8DAW2spQMJfXHa31aizmwRk1iotwCHtQdmt1cwqulOx10DW6prWIjk; expires=Wed, 12-Apr-2023 21:18:16 GMT; path=/; domain=.google.com; HttpOnly Accept-Ranges: none Vary: Accept-Encoding Transfer-Encoding: chunked Body: 24446 characters |
Next, let’s look at how we might update the example to download the same file using HTTPS.
Example of Downloading HTML over HTTP in Asyncio
We can update the previous example to download a file using HTTPS instead of HTTP.
Recall that HTTPS is a secure version of the HTTP protocol that uses SSL to encrypt the plain text request and responses.
HTTPS socket connections are made on port 443 instead of port 80.
We can specify this port in the call to asyncio.open_connection() method.
For example:
1 2 3 |
... # open the connection reader, writer = await asyncio.open_connection(host, 443) |
We also must specify that we want to use SSL when making the connection, e.g. HTTP over SSL.
This can be achieved by setting the “ssl” argument to True.
For example:
1 2 3 |
... # open the connection reader, writer = await asyncio.open_connection(host, 443, ssl=True) |
And that is all there is to it.
Except, there is one small issue.
There is a bug when waiting for SSL connection to close sometimes.
Specifically, this line:
1 2 3 |
... # wait for the writer to close await writer.wait_closed() |
This may raise an ssl.SSLError exception.
For example:
1 2 3 |
Traceback (most recent call last): ... ssl.SSLError: [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2672) |
It is a known bug and not fixed at the time of writing:
We can avoid the exception by removing our call to wait_closed().
The complete example updated to download the HTML webpage using HTTPS instead of HTTP with asyncio is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# SuperFastPython.com # example of downloading a webpage over HTTPS in asyncio import asyncio # coroutine to construct a query and write to socket async def http_get(writer, path, host): # define request query = f'GET {path} HTTP/1.1\r\nHost: {host}\r\n\r\n' # encode request as bytes query = query.encode() # write query to socket writer.write(query) # wait for the bytes to be written to the socket await writer.drain() # coroutine to read a http block from a socket async def http_read(reader): # read all data, decode and add to string data = '' async for line in reader: # decode into a string line = line.decode() # strip white space line = line.strip() # # check for end of block if not line: return data # store line data += line + '\n' # main coroutine async def main(): # define the web page we want to download host = 'www.google.com' path = '/' # open the connection reader, writer = await asyncio.open_connection(host, 443, ssl=True) print(f'Connected to {host}') # issue the request await http_get(writer, path, host) print(f'Issued GET for {path}') # reader the header header = await http_read(reader) # report the header print(header) # read the body body = await http_read(reader) # report the details of the body print(f'Body: {len(body)} characters') # close writer stream writer.close() # entry point asyncio.run(main()) |
Running the example downloads the header and body of the file from google.com as before.
In this case, it is downloaded using the secure HTTPS protocol, instead of the HTTP protocol.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
Connected to www.google.com Issued GET for / HTTP/1.1 200 OK Date: Tue, 11 Oct 2022 21:18:01 GMT Expires: -1 Cache-Control: private, max-age=0 Content-Type: text/html; charset=ISO-8859-1 P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info." Server: gws X-XSS-Protection: 0 X-Frame-Options: SAMEORIGIN Set-Cookie: 1P_JAR=2022-10-11-21; expires=Thu, 10-Nov-2022 21:18:01 GMT; path=/; domain=.google.com; Secure Set-Cookie: AEC=AakniGP5QnZi1OqK0kP5npSVuHy9w4-_JX6zyLxd_P5fPj5St2ev7q-yIg; expires=Sun, 09-Apr-2023 21:18:01 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax Set-Cookie: NID=511=Wy_eSG5Wg2UvkdHpO6NhymmpxMfWoUrjDLnpklb86V9cYIirGv92MHJfDw7VQP2k_ouIT_3wxAj_6AIddvwohdaFim8SdL_6alVxF-UzARFVeFb6X6Q_BM1Ht5ludtGnPRgAJPa1txxK0F0R0aVvl7qXZHOSFDtrF5hWOWpRiho; expires=Wed, 12-Apr-2023 21:18:01 GMT; path=/; domain=.google.com; HttpOnly Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43" Accept-Ranges: none Vary: Accept-Encoding Transfer-Encoding: chunked Body: 24376 characters |
Free Python Asyncio Course
Download your FREE Asyncio PDF cheat sheet and get BONUS access to my free 7-day crash course on the Asyncio API.
Discover how to use the Python asyncio module including how to define, create, and run new coroutines and how to use non-blocking I/O.
Further Reading
This section provides additional resources that you may find helpful.
Python Asyncio Books
- Python Asyncio Mastery, Jason Brownlee (my book!)
- Python Asyncio Jump-Start, Jason Brownlee.
- Python Asyncio Interview Questions, Jason Brownlee.
- Asyncio Module API Cheat Sheet
I also recommend the following books:
- Python Concurrency with asyncio, Matthew Fowler, 2022.
- Using Asyncio in Python, Caleb Hattingh, 2020.
- asyncio Recipes, Mohamed Mustapha Tahrioui, 2019.
Guides
APIs
- asyncio — Asynchronous I/O
- Asyncio Coroutines and Tasks
- Asyncio Streams
- Asyncio Subprocesses
- Asyncio Queues
- Asyncio Synchronization Primitives
References
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Takeaways
You now know how to download an HTML webpage using HTTP and HTTPS using asyncio.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Devon Janse van Rensburg on Unsplash
Do you have any questions?