How to Download a Webpage with Asyncio

You can download webpages asynchronously using asyncio in Python.

In this tutorial, you will discover how to download an HTML webpage using HTTP and HTTPS using asyncio.

Let’s get started.

Table of Contents

How to Download a Webpage with Asyncio

The asyncio module provides support for opening socket connections and reading and writing data via streams.

We can use this capability to download webpages.

This involves perhaps four steps, they are:

1. Open a connection
2. Write a request
3. Read a response
4. Close the connection

Let’s take a closer look at each part in turn.

Open HTTP Connection

A connection can be opened in asyncio using the asyncio.open_connection() function.

Establish a network connection and return a pair of (reader, writer) objects.
— Asyncio Streams

Among many arguments, the function takes the string hostname and integer port number

This is a coroutine that must be awaited and returns a StreamReader and a StreamWriter for reading and writing with the socket.

This can be used to open an HTTP connection on port 80.

For example:

...

# open a socket connection

reader, writer = await asyncio.open_connection('www.google.com', 80)

We can also open an SSL connection using the ssl=True argument. This can be used to open an HTTPS connection on port 443.

For example:

...

# open a socket connection

reader, writer = await asyncio.open_connection('www.google.com', 443)

Write HTTP Request

Once open, we can write a query to the StreamWriter to make an HTTP request.

For example, an HTTP version 1.1 request is in plain text. We can request the file path ‘/’, which may look as follows:

1 2	GET / HTTP/1.1 Host: www.google.com

Importantly, there must be a carriage return and a line feed (\r\n) at the end of each line, and an empty line at the end.

As Python strings this may look as follows:

'GET / HTTP/1.1\r\n'

'Host: www.google.com\r\n'

'\r\n'

You can learn more about HTTP v1.1 request messages here:

HTTP/1.1 request messages

This string must be encoded as bytes before being written to the StreamWriter.

Represents a writer object that provides APIs to write data to the IO stream.
— Asyncio Streams

This can be achieved using the encode() method on the string itself.

The default ‘utf-8‘ encoding may be sufficient.

For example:

...

# encode string as bytes

byte_data = string.encode()

You can see a listing of encodings here:

Python Standard Encodings

The bytes can then be written to the socket via the StreamWriter via the write() method.

For example:

...

# write query to socket

writer.write(byte_data)

After writing the request, it is a good idea to wait for the byte data to be sent and for the socket to be ready.

This can be achieved by the drain() method.

Wait until it is appropriate to resume writing to the stream.
— Asyncio Streams

This is a coroutine that must be awaited.

For example:

...

# wait for the socket to be ready.

await writer.drain()

Read HTTP Response

Once the HTTP request has been made, we can read the response.

This can be achieved via the StreamReader for the socket.

Represents a reader object that provides APIs to read data from the IO stream.
— Asyncio Streams

The response can be read using the read() method which will read a chunk of bytes, or the readline() method which will read one line of bytes.

We might prefer the readline() method because we are using the text-based HTTP protocol which sends HTML data one line at a time.

The readline() method is a coroutine and must be awaited.

For example:

...

# read one line of response

line_bytes = await reader.readline()

HTTP 1.1 responses are composed of two parts, a header separated by an empty line, then the body terminating with an empty line.

The header has information about whether the request was successful and what type of file will be sent, and the body contains the content of the file, such as an HTML webpage.

You can learn more about HTTP v1.1 responses here:

HTTP/1.1 response messages

Each line must be decoded from bytes into a string.

This can be achieved using the decode() method on the byte data. Again, the default encoding is ‘utf_8‘.

For example:

...

# decode bytes into a string

line_data = line_bytes.decode()

We can continue to read lines in a loop until an empty string (containing only newlines) is returned. This will read the head. We can then do the same thing in order to read the body.

For example:

# read the header

while True:

# read one line of response

line_bytes = await reader.readline()

# decode bytes into a string

line_data = line_bytes.decode()

# remove white space

line_data = line_data.strip()

# check if empty

if not line_data:

break

Helpfully, the StreamReader implements the interface for the asynchronous iterator.

We can see this if we look in the source code for the StreamReader. It implements the __anext__() method.

This capability does not appear to be described in the API documentation, at least at the time of writing.

This will call the readline() method until an empty string is read.

This means we can read lines from the StreamReader using the “async for” expression.

For example:

...

# read the header

async for line_bytes in reader:

# decode bytes into a string

line_data = line_bytes.decode()

# remove white space

line_data = line_data.strip()

This will automatically suspend the caller and await each line.

You can learn more about the “async for” expression in the tutorial:

Asyncio async for loop

Close HTTP Connection

We can close the socket connection by closing the StreamWriter.

This can be achieved by calling the close() method.

For example:

...

# close the connection

writer.close()

This does not block and may not close the socket immediately.

We can wait for the socket to close completely.

This can be achieved using the wait_closed() method.

This is a coroutine that we can await.

For example:

...

# wait for the socket to close completely.

await writer.closed()

Now that we know how to download a webpage asynchronously using asyncio, let’s look at some worked examples.

Run loops using all CPUs, download your FREE book to learn how.

Example of Downloading HTML over HTTP in Asyncio

We can develop an example to download an HTML webpage over HTTP using asyncio.

Let’s dive in.

Write HTTP Query

We can develop a coroutine that will issue an HTTP GET.

The coronet will take three arguments, the StreamWriter for writing the request to the socket, the path for the file to download, and the hostname.

# coroutine to construct a query and write to socket

async def http_get(writer, path, host):

# ...

The first step is to construct the HTTP request using v1.1 of the protocol.

This can be achieved in a single line.

...

# define request

query = f'GET {path} HTTP/1.1\r\nHost: {host}\r\n\r\n'

The request string can then be encoded to bytes using the UTF-8 encoding.

...

# encode request as bytes

query = query.encode()

The byte string can then be written to the socket via the StreamWriter.

...

# write query to socket

writer.write(query)

Finally, we can drain the socket, waiting until it is ready to continue using.

...

# wait for the bytes to be written to the socket

await writer.drain()

Tying this together, the complete http_get() coroutine is listed below.

# coroutine to construct a query and write to socket

async def http_get(writer, path, host):

# define request

query = f'GET {path} HTTP/1.1\r\nHost: {host}\r\n\r\n'

# encode request as bytes

query = query.encode()

# write query to socket

writer.write(query)

# wait for the bytes to be written to the socket

await writer.drain()

Next, we can read the response.

Read HTTP Query Results

We can develop a coroutine to read a response from the server.

This coroutine will take a StreamReader as an argument.

# coroutine to read an http block from a socket

async def http_read(reader):

# ...

It will read lines from the StreamReader until an empty line is read, after which it will return all lines as a single string with new line separators.

We will do this using a for loop, although a more pythonic way might be to use a simple list comprehension with compound statements.

First;y we can initialize the target string that will be used to store the complete response.

1 2	... data = ''

We can then use an “async for” expression to read lines from the StreamReader.

This will automatically await each line for us.

...

# read all data, decode and add to string

async for line in reader:

# ...

Each line can then be decoded using UTF-8 encoding, the default for converting bytes to a string.

...

# decode into a string

line = line.decode()

We can then strip the white space from the string which leaves the content only.

...

# strip white space

line = line.strip()

If there is no content, it is likely the end of a block, such as the end of a header or the end of the body.

Therefore, we can stop reading content from the socket.

...

# check for end of block

if not line:

return data

Otherwise, we can append the line to our response string.

...

# store line

data += line + '\n'

Tying this together, the complete http_read() coroutine is listed below.

# coroutine to read a http block from a socket

async def http_read(reader):

data = ''

# read all data, decode and add to string

async for line in reader:

# decode into a string

line = line.decode()

# strip white space

line = line.strip()

# check for end of block

if not line:

return data

# store line

data += line + '\n'

Next, we can open a connection and tie the asyncio program together.

Open Connection

We can develop a coroutine to open a connection, make the request and read the response.

First, we can define the host and file we wish to download.

We will download the google homepage in this example as it’s likely to always be there and accessible to all readers.

...

# main coroutine

async def main():

# define the web page we want to download

host = 'www.google.com'

path = '/'

Next, we can open a connection to the host on port 80 used for HTTP web server socket connections.

...

# open the connection

reader, writer = await asyncio.open_connection(host, 80)

print(f'Connected to {host}')

This returns a StreamReader and StreamWriter that we can use to interact with the host.

Next, we can issue the request to GET the target file.

This will involve awaiting our http_get() coroutine.

...

# issue the request

await http_get(writer, path, host)

print(f'Issued GET for {path}')

Next, we can read the header of the HTTP response.

This involves awaiting our http_read() coroutine.

In this case, we will report the contents of the header, out of interest.

...

# reader the header

header = await http_read(reader)

# report the header

print(header)

We will then read the body of the response and report the number of characters read, a rough approximation because we are stripping white space and adding some back when processing the response.

Again, we will use our http_read() coroutine.

...

# read the body

body = await http_read(reader)

# report the details of the body

print(f'Body: {len(body)} characters')

Finally, we can close the connection.

This is achieved by closing the StreamWriter and waiting for it to close.

...

# close writer stream

writer.close()

# wait for the writer to close

await writer.wait_closed()

Tying this together, the complete main() coroutine is listed below.

# main coroutine

async def main():

# define the web page we want to download

host = 'www.google.com'

path = '/'

# open the connection

reader, writer = await asyncio.open_connection(host, 80)

print(f'Connected to {host}')

# issue the request

await http_get(writer, path, host)

print(f'Issued GET for {path}')

# reader the header

header = await http_read(reader)

# report the header

print(header)

# read the body

body = await http_read(reader)

# report the details of the body

print(f'Body: {len(body)} characters')

# close writer stream

writer.close()

# wait for the writer to close

await writer.wait_closed()

Complete Example

We can tie all of these elements together.

The complete example of downloading an HTML webpage over HTTP using asyncio is listed below.

# SuperFastPython.com

# example of downloading a webpage over HTTP in asyncio

import asyncio

# coroutine to construct a query and write to socket

async def http_get(writer, path, host):

# define request

query = f'GET {path} HTTP/1.1\r\nHost: {host}\r\n\r\n'

# encode request as bytes

query = query.encode()

# write query to socket

writer.write(query)

# wait for the bytes to be written to the socket

await writer.drain()

# coroutine to read a http block from a socket

async def http_read(reader):

# read all data, decode and add to string

data = ''

async for line in reader:

# decode into a string

line = line.decode()

# strip white space

line = line.strip()

# check for end of block

if not line:

return data

# store line

data += line + '\n'

# main coroutine

async def main():

# define the web page we want to download

host = 'www.google.com'

path = '/'

# open the connection

reader, writer = await asyncio.open_connection(host, 80)

print(f'Connected to {host}')

# issue the request

await http_get(writer, path, host)

print(f'Issued GET for {path}')

# reader the header

header = await http_read(reader)

# report the header

print(header)

# read the body

body = await http_read(reader)

# report the details of the body

print(f'Body: {len(body)} characters')

# close writer stream

writer.close()

# wait for the writer to close

await writer.wait_closed()

# entry point

asyncio.run(main())

Running the example first creates the main() coroutine and uses it as the entry point for the program.

The main() coroutine runs and defines the host and file to download.

The main() opens the connection, suspending and waiting for the socket to be open and be made ready on port 80.

Once ready, the coroutine returns a StreamReader and StreamWriter for interacting with the socket.

Next, the request is made. The main() coroutine calls our http_get() coroutine in order to make a request.

The http_get() coroutine runes, constructing and encoding the query then writing the bytes to the socket. The http_get() then suspends and waits for the socket to be ready before returning.

The main() coroutine resumes and reads the header, calling our http_read() coroutine and suspending.

The http_read() coroutine runs, reading lines and appending them to a string until an empty line is read, likely signaling the end of a header or body block.

Each call to the readline() coroutine is awaited automatically using the “async for” expression on the StreamReader.

The http_read() coroutine terminates and the main() coroutine resumes, reporting the contents of the HTTP header.

This contains lots of interesting stuff.

The body is then read using our http_read() coroutine and the number of characters in the content is reported.

Finally, the socket is closed and the program closes.

This highlights how we can use write to and read from sockets in asyncio to download an HTML webpage using the HTTP protocol.

Connected to www.google.com

Issued GET for /

HTTP/1.1 200 OK

Date: Tue, 11 Oct 2022 21:18:16 GMT

Expires: -1

Cache-Control: private, max-age=0

Content-Type: text/html; charset=ISO-8859-1

P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."

Server: gws

X-XSS-Protection: 0

X-Frame-Options: SAMEORIGIN

Set-Cookie: 1P_JAR=2022-10-11-21; expires=Thu, 10-Nov-2022 21:18:16 GMT; path=/; domain=.google.com; Secure

Set-Cookie: AEC=AakniGNyVHd3MuuDP6JuAtJRtKlAWwZEp04ju-OwpoymwbIp7w77Ieg1Cfs; expires=Sun, 09-Apr-2023 21:18:16 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax

Set-Cookie: NID=511=AT-bATDPaoz9wWo8BQKDycF49LdbCvtiW2-o7_Nwrd3guh3Clgae5kTv-tUqDHVVpRDmzsrWoRzjDVQ6stpkSacom7yv-5Dvaz0ZhhuK8tdKaElUwURML8DAW2spQMJfXHa31aizmwRk1iotwCHtQdmt1cwqulOx10DW6prWIjk; expires=Wed, 12-Apr-2023 21:18:16 GMT; path=/; domain=.google.com; HttpOnly

Accept-Ranges: none

Vary: Accept-Encoding

Transfer-Encoding: chunked

Body: 24446 characters

Next, let’s look at how we might update the example to download the same file using HTTPS.

Download Now: Free Asyncio PDF Cheat Sheet

Example of Downloading HTML over HTTP in Asyncio

We can update the previous example to download a file using HTTPS instead of HTTP.

Recall that HTTPS is a secure version of the HTTP protocol that uses SSL to encrypt the plain text request and responses.

HTTPS socket connections are made on port 443 instead of port 80.

We can specify this port in the call to asyncio.open_connection() method.

For example:

...

# open the connection

reader, writer = await asyncio.open_connection(host, 443)

We also must specify that we want to use SSL when making the connection, e.g. HTTP over SSL.

This can be achieved by setting the “ssl” argument to True.

For example:

...

# open the connection

reader, writer = await asyncio.open_connection(host, 443, ssl=True)

And that is all there is to it.

Except, there is one small issue.

There is a bug when waiting for SSL connection to close sometimes.

Specifically, this line:

...

# wait for the writer to close

await writer.wait_closed()

This may raise an ssl.SSLError exception.

For example:

Traceback (most recent call last):

...

ssl.SSLError: [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2672)

It is a known bug and not fixed at the time of writing:

Ignore specific errors when closing ssl connections

We can avoid the exception by removing our call to wait_closed().

The complete example updated to download the HTML webpage using HTTPS instead of HTTP with asyncio is listed below.

# SuperFastPython.com

# example of downloading a webpage over HTTPS in asyncio

import asyncio

# coroutine to construct a query and write to socket

async def http_get(writer, path, host):

# define request

query = f'GET {path} HTTP/1.1\r\nHost: {host}\r\n\r\n'

# encode request as bytes

query = query.encode()

# write query to socket

writer.write(query)

# wait for the bytes to be written to the socket

await writer.drain()

# coroutine to read a http block from a socket

async def http_read(reader):

# read all data, decode and add to string

data = ''

async for line in reader:

# decode into a string

line = line.decode()

# strip white space

line = line.strip()

# # check for end of block

if not line:

return data

# store line

data += line + '\n'

# main coroutine

async def main():

# define the web page we want to download

host = 'www.google.com'

path = '/'

# open the connection

reader, writer = await asyncio.open_connection(host, 443, ssl=True)

print(f'Connected to {host}')

# issue the request

await http_get(writer, path, host)

print(f'Issued GET for {path}')

# reader the header

header = await http_read(reader)

# report the header

print(header)

# read the body

body = await http_read(reader)

# report the details of the body

print(f'Body: {len(body)} characters')

# close writer stream

writer.close()

# entry point

asyncio.run(main())

Running the example downloads the header and body of the file from google.com as before.

In this case, it is downloaded using the secure HTTPS protocol, instead of the HTTP protocol.

Connected to www.google.com

Issued GET for /

HTTP/1.1 200 OK

Date: Tue, 11 Oct 2022 21:18:01 GMT

Expires: -1

Cache-Control: private, max-age=0

Content-Type: text/html; charset=ISO-8859-1

P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."

Server: gws

X-XSS-Protection: 0

X-Frame-Options: SAMEORIGIN

Set-Cookie: 1P_JAR=2022-10-11-21; expires=Thu, 10-Nov-2022 21:18:01 GMT; path=/; domain=.google.com; Secure

Set-Cookie: AEC=AakniGP5QnZi1OqK0kP5npSVuHy9w4-_JX6zyLxd_P5fPj5St2ev7q-yIg; expires=Sun, 09-Apr-2023 21:18:01 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax

Set-Cookie: NID=511=Wy_eSG5Wg2UvkdHpO6NhymmpxMfWoUrjDLnpklb86V9cYIirGv92MHJfDw7VQP2k_ouIT_3wxAj_6AIddvwohdaFim8SdL_6alVxF-UzARFVeFb6X6Q_BM1Ht5ludtGnPRgAJPa1txxK0F0R0aVvl7qXZHOSFDtrF5hWOWpRiho; expires=Wed, 12-Apr-2023 21:18:01 GMT; path=/; domain=.google.com; HttpOnly

Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"

Accept-Ranges: none

Vary: Accept-Encoding

Transfer-Encoding: chunked

Body: 24376 characters

Free Python Asyncio Course

Download your FREE Asyncio PDF cheat sheet and get BONUS access to my free 7-day crash course on the Asyncio API.

Discover how to use the Python asyncio module including how to define, create, and run new coroutines and how to use non-blocking I/O.

Learn more

Takeaways

You now know how to download an HTML webpage using HTTP and HTTPS using asyncio.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by Devon Janse van Rensburg on Unsplash

How to Download a Webpage with Asyncio

How to Download a Webpage with Asyncio

Open HTTP Connection

Write HTTP Request

Read HTTP Response

Close HTTP Connection

Example of Downloading HTML over HTTP in Asyncio

Write HTTP Query

Read HTTP Query Results

Open Connection

Complete Example

Example of Downloading HTML over HTTP in Asyncio

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Asyncio Resources:

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Asyncio Fast
(without the frustration)

Additional menu

How to Download a Webpage with Asyncio

Open HTTP Connection

Write HTTP Request

Read HTTP Response

Close HTTP Connection

Example of Downloading HTML over HTTP in Asyncio

Write HTTP Query

Read HTTP Query Results

Open Connection

Complete Example

Example of Downloading HTML over HTTP in Asyncio

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Leave a Reply Cancel reply

Footer

Learn Asyncio Fast (without the frustration)

Learn Asyncio Fast
(without the frustration)