How to Download a Webpage with Asyncio

December 19, 2022 Python Asyncio

You can download webpages asynchronously using asyncio in Python.

In this tutorial, you will discover how to download an HTML webpage using HTTP and HTTPS using asyncio.

Let's get started.

How to Download a Webpage with Asyncio

The asyncio module provides support for opening socket connections and reading and writing data via streams.

We can use this capability to download webpages.

This involves perhaps four steps, they are:

1. Open a connection
2. Write a request
3. Read a response
4. Close the connection

Let's take a closer look at each part in turn.

Open HTTP Connection

A connection can be opened in asyncio using the asyncio.open_connection() function.

Establish a network connection and return a pair of (reader, writer) objects.
-- Asyncio Streams

Among many arguments, the function takes the string hostname and integer port number

This is a coroutine that must be awaited and returns a StreamReader and a StreamWriter for reading and writing with the socket.

This can be used to open an HTTP connection on port 80.

For example:

...
# open a socket connection
reader, writer = await asyncio.open_connection('www.google.com', 80)

We can also open an SSL connection using the ssl=True argument. This can be used to open an HTTPS connection on port 443.

For example:

...
# open a socket connection
reader, writer = await asyncio.open_connection('www.google.com', 443)

Write HTTP Request

Once open, we can write a query to the StreamWriter to make an HTTP request.

For example, an HTTP version 1.1 request is in plain text. We can request the file path '/', which may look as follows:

GET / HTTP/1.1
Host: www.google.com

Importantly, there must be a carriage return and a line feed (\r\n) at the end of each line, and an empty line at the end.

As Python strings this may look as follows:

'GET / HTTP/1.1\r\n'
'Host: www.google.com\r\n'
'\r\n'

You can learn more about HTTP v1.1 request messages here:

HTTP/1.1 request messages

This string must be encoded as bytes before being written to the StreamWriter.

Represents a writer object that provides APIs to write data to the IO stream.
-- Asyncio Streams

This can be achieved using the encode() method on the string itself.

The default 'utf-8' encoding may be sufficient.

For example:

...
# encode string as bytes
byte_data = string.encode()

You can see a listing of encodings here:

Python Standard Encodings

The bytes can then be written to the socket via the StreamWriter via the write() method.

For example:

...
# write query to socket
writer.write(byte_data)

After writing the request, it is a good idea to wait for the byte data to be sent and for the socket to be ready.

This can be achieved by the drain() method.

Wait until it is appropriate to resume writing to the stream.
-- Asyncio Streams

This is a coroutine that must be awaited.

For example:

...
# wait for the socket to be ready.
await writer.drain()

Read HTTP Response

Once the HTTP request has been made, we can read the response.

This can be achieved via the StreamReader for the socket.

Represents a reader object that provides APIs to read data from the IO stream.
-- Asyncio Streams

The response can be read using the read() method which will read a chunk of bytes, or the readline() method which will read one line of bytes.

We might prefer the readline() method because we are using the text-based HTTP protocol which sends HTML data one line at a time.

The readline() method is a coroutine and must be awaited.

For example:

...
# read one line of response
line_bytes = await reader.readline()

HTTP 1.1 responses are composed of two parts, a header separated by an empty line, then the body terminating with an empty line.

The header has information about whether the request was successful and what type of file will be sent, and the body contains the content of the file, such as an HTML webpage.

You can learn more about HTTP v1.1 responses here:

HTTP/1.1 response messages

Each line must be decoded from bytes into a string.

This can be achieved using the decode() method on the byte data. Again, the default encoding is 'utf_8'.

For example:

...
# decode bytes into a string
line_data = line_bytes.decode()

We can continue to read lines in a loop until an empty string (containing only newlines) is returned. This will read the head. We can then do the same thing in order to read the body.

For example:

# read the header
while True:
	# read one line of response
	line_bytes = await reader.readline()
	# decode bytes into a string
	line_data = line_bytes.decode()
	# remove white space
	line_data = line_data.strip()
	# check if empty
	if not line_data:
		break

Helpfully, the StreamReader implements the interface for the asynchronous iterator.

We can see this if we look in the source code for the StreamReader. It implements the __anext__() method.

This capability does not appear to be described in the API documentation, at least at the time of writing.

This will call the readline() method until an empty string is read.

This means we can read lines from the StreamReader using the "async for" expression.

For example:

...
# read the header
async for line_bytes in reader:
	# decode bytes into a string
	line_data = line_bytes.decode()
	# remove white space
	line_data = line_data.strip()

This will automatically suspend the caller and await each line.

You can learn more about the "async for" expression in the tutorial:

Asyncio async for loop

Close HTTP Connection

We can close the socket connection by closing the StreamWriter.

This can be achieved by calling the close() method.

For example:

...
# close the connection
writer.close()

This does not block and may not close the socket immediately.

We can wait for the socket to close completely.

This can be achieved using the wait_closed() method.

This is a coroutine that we can await.

For example:

...
# wait for the socket to close completely.
await writer.closed()

Now that we know how to download a webpage asynchronously using asyncio, let's look at some worked examples.

Example of Downloading HTML over HTTP in Asyncio

We can develop an example to download an HTML webpage over HTTP using asyncio.

Let's dive in.

Write HTTP Query

We can develop a coroutine that will issue an HTTP GET.

The coronet will take three arguments, the StreamWriter for writing the request to the socket, the path for the file to download, and the hostname.

# coroutine to construct a query and write to socket
async def http_get(writer, path, host):
	# ...

The first step is to construct the HTTP request using v1.1 of the protocol.

This can be achieved in a single line.

...
# define request
query = f'GET {path} HTTP/1.1\r\nHost: {host}\r\n\r\n'

The request string can then be encoded to bytes using the UTF-8 encoding.

...
# encode request as bytes
query = query.encode()

The byte string can then be written to the socket via the StreamWriter.

...
# write query to socket
writer.write(query)

Finally, we can drain the socket, waiting until it is ready to continue using.

...
# wait for the bytes to be written to the socket
await writer.drain()

Tying this together, the complete http_get() coroutine is listed below.

# coroutine to construct a query and write to socket
async def http_get(writer, path, host):
    # define request
    query = f'GET {path} HTTP/1.1\r\nHost: {host}\r\n\r\n'
    # encode request as bytes
    query = query.encode()
    # write query to socket
    writer.write(query)
    # wait for the bytes to be written to the socket
    await writer.drain()

Next, we can read the response.

Read HTTP Query Results

We can develop a coroutine to read a response from the server.

This coroutine will take a StreamReader as an argument.

# coroutine to read an http block from a socket
async def http_read(reader):
	# ...

It will read lines from the StreamReader until an empty line is read, after which it will return all lines as a single string with new line separators.

We will do this using a for loop, although a more pythonic way might be to use a simple list comprehension with compound statements.

First;y we can initialize the target string that will be used to store the complete response.

...
data = ''

We can then use an "async for" expression to read lines from the StreamReader.

This will automatically await each line for us.

...
# read all data, decode and add to string
async for line in reader:
	# ...

Each line can then be decoded using UTF-8 encoding, the default for converting bytes to a string.

...
# decode into a string
line = line.decode()

We can then strip the white space from the string which leaves the content only.

...
# strip white space
line = line.strip()

If there is no content, it is likely the end of a block, such as the end of a header or the end of the body.

Therefore, we can stop reading content from the socket.

...
# check for end of block
if not line:
    return data

Otherwise, we can append the line to our response string.

...
# store line
data += line + '\n'

Tying this together, the complete http_read() coroutine is listed below.

# coroutine to read a http block from a socket
async def http_read(reader):
    data = ''
    # read all data, decode and add to string
    async for line in reader:
        # decode into a string
        line = line.decode()
        # strip white space
        line = line.strip()
        # check for end of block
        if not line:
            return data
        # store line
        data += line + '\n'

Next, we can open a connection and tie the asyncio program together.

Open Connection

We can develop a coroutine to open a connection, make the request and read the response.

First, we can define the host and file we wish to download.

We will download the google homepage in this example as it's likely to always be there and accessible to all readers.

...
# main coroutine
async def main():
    # define the web page we want to download
    host = 'www.google.com'
    path = '/'

Next, we can open a connection to the host on port 80 used for HTTP web server socket connections.

...
# open the connection
reader, writer = await asyncio.open_connection(host, 80)
print(f'Connected to {host}')

This returns a StreamReader and StreamWriter that we can use to interact with the host.

Next, we can issue the request to GET the target file.

This will involve awaiting our http_get() coroutine.

...
# issue the request
await http_get(writer, path, host)
print(f'Issued GET for {path}')

Next, we can read the header of the HTTP response.

This involves awaiting our http_read() coroutine.

In this case, we will report the contents of the header, out of interest.

...
# reader the header
header = await http_read(reader)
# report the header
print(header)

We will then read the body of the response and report the number of characters read, a rough approximation because we are stripping white space and adding some back when processing the response.

Again, we will use our http_read() coroutine.

...
# read the body
body = await http_read(reader)
# report the details of the body
print(f'Body: {len(body)} characters')

Finally, we can close the connection.

This is achieved by closing the StreamWriter and waiting for it to close.

...
# close writer stream
writer.close()
# wait for the writer to close
await writer.wait_closed()

Tying this together, the complete main() coroutine is listed below.

# main coroutine
async def main():
    # define the web page we want to download
    host = 'www.google.com'
    path = '/'
    # open the connection
    reader, writer = await asyncio.open_connection(host, 80)
    print(f'Connected to {host}')
    # issue the request
    await http_get(writer, path, host)
    print(f'Issued GET for {path}')
    # reader the header
    header = await http_read(reader)
    # report the header
    print(header)
    # read the body
    body = await http_read(reader)
    # report the details of the body
    print(f'Body: {len(body)} characters')
    # close writer stream
    writer.close()
    # wait for the writer to close
    await writer.wait_closed()

Complete Example

We can tie all of these elements together.

The complete example of downloading an HTML webpage over HTTP using asyncio is listed below.

# SuperFastPython.com
# example of downloading a webpage over HTTP in asyncio
import asyncio

# coroutine to construct a query and write to socket
async def http_get(writer, path, host):
    # define request
    query = f'GET {path} HTTP/1.1\r\nHost: {host}\r\n\r\n'
    # encode request as bytes
    query = query.encode()
    # write query to socket
    writer.write(query)
    # wait for the bytes to be written to the socket
    await writer.drain()

# coroutine to read a http block from a socket
async def http_read(reader):
    # read all data, decode and add to string
    data = ''
    async for line in reader:
        # decode into a string
        line = line.decode()
        # strip white space
        line = line.strip()
        # check for end of block
        if not line:
            return data
        # store line
        data += line + '\n'

# main coroutine
async def main():
    # define the web page we want to download
    host = 'www.google.com'
    path = '/'
    # open the connection
    reader, writer = await asyncio.open_connection(host, 80)
    print(f'Connected to {host}')
    # issue the request
    await http_get(writer, path, host)
    print(f'Issued GET for {path}')
    # reader the header
    header = await http_read(reader)
    # report the header
    print(header)
    # read the body
    body = await http_read(reader)
    # report the details of the body
    print(f'Body: {len(body)} characters')
    # close writer stream
    writer.close()
    # wait for the writer to close
    await writer.wait_closed()

# entry point
asyncio.run(main())

Running the example first creates the main() coroutine and uses it as the entry point for the program.

The main() coroutine runs and defines the host and file to download.

The main() opens the connection, suspending and waiting for the socket to be open and be made ready on port 80.

Once ready, the coroutine returns a StreamReader and StreamWriter for interacting with the socket.

Next, the request is made. The main() coroutine calls our http_get() coroutine in order to make a request.

The http_get() coroutine runes, constructing and encoding the query then writing the bytes to the socket. The http_get() then suspends and waits for the socket to be ready before returning.

The main() coroutine resumes and reads the header, calling our http_read() coroutine and suspending.

The http_read() coroutine runs, reading lines and appending them to a string until an empty line is read, likely signaling the end of a header or body block.

Each call to the readline() coroutine is awaited automatically using the "async for" expression on the StreamReader.

The http_read() coroutine terminates and the main() coroutine resumes, reporting the contents of the HTTP header.

This contains lots of interesting stuff.

The body is then read using our http_read() coroutine and the number of characters in the content is reported.

Finally, the socket is closed and the program closes.

This highlights how we can use write to and read from sockets in asyncio to download an HTML webpage using the HTTP protocol.

Connected to www.google.com
Issued GET for /
HTTP/1.1 200 OK
Date: Tue, 11 Oct 2022 21:18:16 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
Server: gws
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
Set-Cookie: 1P_JAR=2022-10-11-21; expires=Thu, 10-Nov-2022 21:18:16 GMT; path=/; domain=.google.com; Secure
Set-Cookie: AEC=AakniGNyVHd3MuuDP6JuAtJRtKlAWwZEp04ju-OwpoymwbIp7w77Ieg1Cfs; expires=Sun, 09-Apr-2023 21:18:16 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax
Set-Cookie: NID=511=AT-bATDPaoz9wWo8BQKDycF49LdbCvtiW2-o7_Nwrd3guh3Clgae5kTv-tUqDHVVpRDmzsrWoRzjDVQ6stpkSacom7yv-5Dvaz0ZhhuK8tdKaElUwURML8DAW2spQMJfXHa31aizmwRk1iotwCHtQdmt1cwqulOx10DW6prWIjk; expires=Wed, 12-Apr-2023 21:18:16 GMT; path=/; domain=.google.com; HttpOnly
Accept-Ranges: none
Vary: Accept-Encoding
Transfer-Encoding: chunked

Body: 24446 characters

Next, let's look at how we might update the example to download the same file using HTTPS.

Example of Downloading HTML over HTTP in Asyncio

We can update the previous example to download a file using HTTPS instead of HTTP.

Recall that HTTPS is a secure version of the HTTP protocol that uses SSL to encrypt the plain text request and responses.

HTTPS socket connections are made on port 443 instead of port 80.

We can specify this port in the call to asyncio.open_connection() method.

For example:

...
# open the connection
reader, writer = await asyncio.open_connection(host, 443)

We also must specify that we want to use SSL when making the connection, e.g. HTTP over SSL.

This can be achieved by setting the "ssl" argument to True.

For example:

...
# open the connection
reader, writer = await asyncio.open_connection(host, 443, ssl=True)

And that is all there is to it.

Except, there is one small issue.

There is a bug when waiting for SSL connection to close sometimes.

Specifically, this line:

...
# wait for the writer to close
await writer.wait_closed()

This may raise an ssl.SSLError exception.

For example:

Traceback (most recent call last):
  ...
ssl.SSLError: [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2672)

It is a known bug and not fixed at the time of writing:

Ignore specific errors when closing ssl connections

We can avoid the exception by removing our call to wait_closed().

The complete example updated to download the HTML webpage using HTTPS instead of HTTP with asyncio is listed below.

# SuperFastPython.com
# example of downloading a webpage over HTTPS in asyncio
import asyncio

# coroutine to construct a query and write to socket
async def http_get(writer, path, host):
    # define request
    query = f'GET {path} HTTP/1.1\r\nHost: {host}\r\n\r\n'
    # encode request as bytes
    query = query.encode()
    # write query to socket
    writer.write(query)
    # wait for the bytes to be written to the socket
    await writer.drain()

# coroutine to read a http block from a socket
async def http_read(reader):
    # read all data, decode and add to string
    data = ''
    async for line in reader:
        # decode into a string
        line = line.decode()
        # strip white space
        line = line.strip()
        # # check for end of block
        if not line:
            return data
        # store line
        data += line + '\n'

# main coroutine
async def main():
    # define the web page we want to download
    host = 'www.google.com'
    path = '/'
    # open the connection
    reader, writer = await asyncio.open_connection(host, 443, ssl=True)
    print(f'Connected to {host}')
    # issue the request
    await http_get(writer, path, host)
    print(f'Issued GET for {path}')
    # reader the header
    header = await http_read(reader)
    # report the header
    print(header)
    # read the body
    body = await http_read(reader)
    # report the details of the body
    print(f'Body: {len(body)} characters')
    # close writer stream
    writer.close()

# entry point
asyncio.run(main())

Running the example downloads the header and body of the file from google.com as before.

In this case, it is downloaded using the secure HTTPS protocol, instead of the HTTP protocol.

Connected to www.google.com
Issued GET for /
HTTP/1.1 200 OK
Date: Tue, 11 Oct 2022 21:18:01 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
Server: gws
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
Set-Cookie: 1P_JAR=2022-10-11-21; expires=Thu, 10-Nov-2022 21:18:01 GMT; path=/; domain=.google.com; Secure
Set-Cookie: AEC=AakniGP5QnZi1OqK0kP5npSVuHy9w4-_JX6zyLxd_P5fPj5St2ev7q-yIg; expires=Sun, 09-Apr-2023 21:18:01 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax
Set-Cookie: NID=511=Wy_eSG5Wg2UvkdHpO6NhymmpxMfWoUrjDLnpklb86V9cYIirGv92MHJfDw7VQP2k_ouIT_3wxAj_6AIddvwohdaFim8SdL_6alVxF-UzARFVeFb6X6Q_BM1Ht5ludtGnPRgAJPa1txxK0F0R0aVvl7qXZHOSFDtrF5hWOWpRiho; expires=Wed, 12-Apr-2023 21:18:01 GMT; path=/; domain=.google.com; HttpOnly
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
Accept-Ranges: none
Vary: Accept-Encoding
Transfer-Encoding: chunked

Body: 24376 characters

Takeaways

You now know how to download an HTML webpage using HTTP and HTTPS using asyncio.

If you enjoyed this tutorial, you will love my book: Python Asyncio Jump-Start. It covers everything you need to master the topic with hands-on examples and clear explanations.