How to Profile Asyncio Programs

You can profile slow asyncio programs to discover coroutines that occupy most of the runtime and functions that are called more than any other.

The profilers built into the Python standard library are helpful to quickly discovering why some asyncio programs are slow and where to focus program optimization efforts.

In this tutorial, you will discover how to profile asyncio programs in order to discover why they are slow.

Let’s get started.

Table of Contents

Need to Profile Slow Asyncio Programs

Sometimes our asyncio Python programs are slow.

We can use profiling to help identify which parts of an asyncio program are slow so that we can change or fix them to be different or faster.

In software engineering, profiling (“program profiling”, “software profiling”) is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization, and more specifically, performance engineering.
— Profiling (computer programming), Wikipedia.

Profiling slow asyncio programs in Python is crucial for several reasons.

Optimizing I/O-Bound Applications. Asyncio is often used in I/O-bound applications where the performance gains of asynchronous programming are most significant. Profiling helps identify which parts of the code are causing slowdowns, enabling us to prioritize optimizations and ensure that I/O-bound tasks are executed as efficiently as possible.
Managing Complexity in Asyncio Programs. Asyncio programs can be complex, involving numerous asynchronous tasks and callbacks. Profiling provides insights into the execution flow, revealing whether tasks are waiting excessively or if there are concurrency issues that need addressing. This information is invaluable for fine-tuning event loop management and task scheduling.
Resource Bottleneck Detection. Profiling helps detect resource bottlenecks, such as excessive CPU usage, memory leaks, or blocking operations that hinder the responsiveness of asyncio programs. Identifying and resolving these issues is critical for maintaining smooth and responsive asynchronous applications.
Opportunities for Code Optimization. Profiling can highlight opportunities for code optimization and performance improvements. By pinpointing specific functions or parts of the code that consume the most time or resources, we can focus our efforts on areas that provide the most significant performance gains.

Profiling slow asyncio programs not only helps diagnose performance issues but also guides optimizations and ultimately leads to more responsive and scalable asynchronous Python applications.

Now that we know that we need to profile our slow asyncio programs, let’s look at how we can use the built-in Python profiler to do so.

Run loops using all CPUs, download your FREE book to learn how.

How to Use Python Profiler

Python provides an efficient C-based profiler in the cProfile module, built into the Python standard library.

A pure-Python profiler is provided in the profile module that offers the same API as the cProfile module. This can be used on those platforms that do not support the cProfile module.

cProfile and profile provide deterministic profiling of Python programs. A profile is a set of statistics that describes how often and for how long various parts of the program executed.
— The Python Profilers

We can use the Python profiler via the command line interface.

You may recall that we can run a module as a Program via the python -m flag.

Therefore, we can apply the Python profiler to an asyncio program file named program.py as follows:

1	python -m cProfile program.py

This will run the program program.py normally. Then once the program is done, profile information will be reported on standard output.

The cProfile program takes a number of command line flags.

The -o flag allows us to specify the file in which to store the output of the profiler.

Storing profile results to file is a good practice.

For example:

1	python -m cProfile -o profile.out program.py

Note that the data is written in a binary format suitable for the pstats module to read and use.

We can provide the -s flag to sort the profile results by one of the table headings.

By default, the profiler results are sorted by cumulative time spent on the function and sub-functions, e.g. “cumtime“

cumtime: is the cumulative time spent in this and all subfunctions (from invocation till exit). This figure is accurate even for recursive functions.
— The Python Profilers

For example:

1	python -m cProfile -s cumtime program.py

A common helpful way to sort the results by the number of calls made to a function, e.g. “ncalls“.

ncalls: for the number of calls.
— The Python Profilers

For example:

1	python -m cProfile -s ncalls program.py

This can highlight functions that are called many times and in turn, take a long overall time.

We might also sort the results by the total time spent on each function call, e.g. “tottime“.

tottime: for the total time spent in the given function (and excluding time made in calls to sub-functions)
— The Python Profilers

For example:

1	python -m cProfile -s tottime program.py

This can highlight individual function calls that take a long time.

You can learn more about the cProfile command line flags and API in the Python Profiler API documentation:

The Python Profilers

Now that we know how to use the Python profiler, let’s explore how we might use it to identify slow parts of asyncio programs.

Download Now: Free Asyncio PDF Cheat Sheet

Example of Profiling a Program With a Slow Function Call

We can explore using the Python profiler on an asyncio program that is slow because it has a long-running function call.

In this example, we will define a coroutine that performs a long blocking call. The coroutine will then be awaited from the main task.

We will call the time.sleep() function which blocks the entire thread, preventing the asyncio event loop from progressing.

# long running task

async def work():

# block the event loop for a long time

time.sleep(5)

Tying this together, the complete example is listed below.

# SuperFastPython.com

# example of a program that has a slow function call

import time

import asyncio

# long running task

async def work():

# block the event loop for a long time

time.sleep(5)

# main program

async def main():

# report a message

print('Main is running')

# await our task

await work()

# report a message

print('Main is done.')

# start the event loop

asyncio.run(main())

Running the program first starts the event loop and runs the main() coroutine.

The main() coroutine runs and reports a start message then awaits the work() coroutine.

The work() coroutine runs and blocks the event loop with a call to time.sleep() for 5 seconds.

This is a bad practice as it will prevent any other coroutines from running at the same time. Instead, the program should use asyncio.sleep().

The work() coroutine resumes and terminates.

The main() coroutine resumes, reports a final message, and terminates.

1 2	Main is running Main is done.

Next, we can profile the program, and assume we don’t know why the program is slow.

We will use the cProfile module and report with default sorting and focus on the first 20 lines of output.

This can be achieved by piping the output from the cProfile program through the “head” command and keeping only the first 20 lines.

1	python -m cProfile program.py \| head -20

If you don’t have access to the head command, just run the cProfile without it and look at the first 20 lines.

If you are unable to run the cProfile program on your system, use the profile module instead. The results will be identical.

The cProfile output reports some details about the program.

For example, there were 33 thousand function calls and the program ran for a little over 5 seconds.

The default output from cProfile does not seem helpful in this case, perhaps sorting by cumulative function execution time (cumtime) is not the best approach.

33232 function calls (32511 primitive calls) in 5.045 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)

46/1 0.000 0.000 5.045 5.045 {built-in method builtins.exec}

1 0.000 0.000 5.045 5.045 program.py:1(<module>)

1 0.000 0.000 5.007 5.007 runners.py:160(run)

3 0.000 0.000 5.006 1.669 base_events.py:617(run_until_complete)

1 0.000 0.000 5.006 5.006 runners.py:86(run)

3 0.000 0.000 5.006 1.669 base_events.py:593(run_forever)

8 0.000 0.000 5.006 0.626 base_events.py:1845(_run_once)

8 0.000 0.000 5.005 0.626 events.py:78(_run)

8 0.000 0.000 5.005 0.626 {method 'run' of '_contextvars.Context' objects}

1 0.000 0.000 5.005 5.005 program.py:7(work)

1 5.005 5.005 5.005 5.005 {built-in method time.sleep}

4 0.000 0.000 0.050 0.013 __init__.py:1(<module>)

60/1 0.000 0.000 0.039 0.039 <frozen importlib._bootstrap>:1165(_find_and_load)

One approach we can try is to filter all results by the name of our program, e.g. “program.py“.

This may give insight into the cumulative execution time of each one of our coroutines.

A simple way to achieve this is to run the cProfile and filter the results for the program name via grep.

For example:

1	python -m cProfile program.py \| grep program.py

Note that this assumes you have access to the grep program on your system. It may not be available on Windows.

The results show only the output lines from cProfile that mention our program name.

We can see that the call with the longest cumulative time (column 4) is the start of the program. The second is the work() coroutine and the third is the main() coroutine.

This highlights that the problem is probably somewhere in the work() coroutine.

1 0.000 0.000 5.045 5.045 test1.py:1(<module>)

1 0.000 0.000 5.005 5.005 test1.py:7(work)

2 0.000 0.000 0.000 0.000 test1.py:12(main)

Next, let’s try sorting the output of cProfile by the total execution time of each function, e.g. tottime.

1	python -m cProfile -s tottime program.py \| head -20

This helps.

We can see that the first function in the list that has the longest overall execution time is the call to time.sleep.

This highlights how we can use the output from cProfile to zero in on long-running functions in our asyncio programs.

33232 function calls (32511 primitive calls) in 5.044 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

1 5.004 5.004 5.004 5.004 {built-in method time.sleep}

45 0.004 0.000 0.004 0.000 {built-in method marshal.loads}

12 0.004 0.000 0.005 0.000 {built-in method _imp.create_dynamic}

249 0.002 0.000 0.002 0.000 {built-in method posix.stat}

153/152 0.002 0.000 0.006 0.000 {built-in method builtins.__build_class__}

17 0.001 0.000 0.003 0.000 enum.py:1657(convert_class)

45 0.001 0.000 0.001 0.000 {built-in method io.open_code}

129 0.001 0.000 0.005 0.000 <frozen importlib._bootstrap_external>:1604(find_spec)

12 0.001 0.000 0.001 0.000 {built-in method _imp.exec_dynamic}

28 0.001 0.000 0.001 0.000 {built-in method posix.getcwd}

646 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:128(<listcomp>)

267/239 0.001 0.000 0.001 0.000 {built-in method __new__ of type object at 0x104b38ee0}

13 0.000 0.000 0.002 0.000 enum.py:907(<listcomp>)

Next, let’s look at another profiling example, this time with functions that are called many times.

Free Python Asyncio Course

Download your FREE Asyncio PDF cheat sheet and get BONUS access to my free 7-day crash course on the Asyncio API.

Discover how to use the Python asyncio module including how to define, create, and run new coroutines and how to use non-blocking I/O.

Learn more

Example of Profiling a Program With a Repeated Function Calls

We can explore using the Python profiler on an asyncio program that is slow because it repeatedly makes the same function call.

In this case, we can define a coroutine that performs that same mathematical operation many millions of times, e.g. 100 million calls to math.sqrt().

Collectively this is a CPU-bound task and will be slow, but performed in a list comprehension, it does not look like it.

# long running task

async def work():

# block the event loop for a long time

data = [math.sqrt(i) for i in range(1, 100000000)]

Tying this together, the complete example is listed below.

# SuperFastPython.com

# example of a program that calls a function repeatedly

import math

import asyncio

# long running task

async def work():

# block the event loop for a long time

data = [math.sqrt(i) for i in range(1, 100000000)]

# main program

async def main():

# report a message

print('Main is running')

# create the task

task = asyncio.create_task(work())

# await our task

await task

# report a message

print('Main is done.')

# start the event loop

asyncio.run(main())

Running the program first starts the event loop and runs the main() coroutine.

The main() coroutine runs and reports a start message then awaits the work() coroutine.

The work() coroutine runs and executes the list comprehension, creating a list of 100 million squared integers.

This takes a while and prevents the asyncio event loop from progressing.

Ideally, this CPU-bound task would be performed outside of the event loop, such as in a new thread or a new child process that will not be limited by the global interpreter lock.

Once the list comprehension is complete, work() terminates.

The main() coroutine resumes, reports a final message, and terminates.

1 2	Main is running Main is done.

Next, let’s profile our program and pretend we don’t know the cause of why the program is slow.

We can start off by running the cProfile with default sorting and reviewing the first 20 lines of output.

1	python -m cProfile program.py \| head -20

This does not appear helpful in this case.

It is reporting the internal infrastructure and the first line of our program as having the highest overall cumulative execution time.

100033231 function calls (100032516 primitive calls) in 30.865 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)

46/1 0.000 0.000 30.865 30.865 {built-in method builtins.exec}

1 0.000 0.000 30.865 30.865 program.py:1(<module>)

1 0.000 0.000 30.827 30.827 runners.py:160(run)

3 0.000 0.000 30.826 10.275 base_events.py:617(run_until_complete)

1 0.000 0.000 30.826 30.826 runners.py:86(run)

3 0.000 0.000 30.826 10.275 base_events.py:593(run_forever)

8 0.000 0.000 30.826 3.853 base_events.py:1845(_run_once)

8 0.000 0.000 30.825 3.853 events.py:78(_run)

8 1.076 0.135 30.825 3.853 {method 'run' of '_contextvars.Context' objects}

1 0.000 0.000 29.749 29.749 program.py:7(work)

1 20.309 20.309 29.749 29.749 program.py:9(<listcomp>)

99999999 9.440 0.000 9.440 0.000 {built-in method math.sqrt}

4 0.000 0.000 0.049 0.012 __init__.py:1(<module>)

Next, let’s try sorting results by the total execution time of each function.

1	python -m cProfile -s tottime program.py \| head -20

Here, we can see that line 9 of our program has the highest overall execution time at 19.547 seconds.

This is the line that contains the list comprehension, pointing us in the right direction.

The next function with the largest overall time is math.sqrt at 9 seconds.

100033231 function calls (100032516 primitive calls) in 30.190 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

1 19.547 19.547 29.043 29.043 program.py:9(<listcomp>)

99999999 9.496 0.000 9.496 0.000 {built-in method math.sqrt}

8 1.107 0.138 30.150 3.769 {method 'run' of '_contextvars.Context' objects}

45 0.004 0.000 0.004 0.000 {built-in method marshal.loads}

12 0.004 0.000 0.005 0.000 {built-in method _imp.create_dynamic}

249 0.002 0.000 0.002 0.000 {built-in method posix.stat}

153/152 0.002 0.000 0.006 0.000 {built-in method builtins.__build_class__}

17 0.001 0.000 0.003 0.000 enum.py:1657(convert_class)

45 0.001 0.000 0.001 0.000 {built-in method io.open_code}

129 0.001 0.000 0.005 0.000 <frozen importlib._bootstrap_external>:1604(find_spec)

12 0.001 0.000 0.001 0.000 {built-in method _imp.exec_dynamic}

28 0.001 0.000 0.001 0.000 {built-in method posix.getcwd}

45 0.001 0.000 0.001 0.000 {method 'read' of '_io.BufferedReader' objects}

Let’s try another approach.

In this case, let’s sort the results by the total number of times each function is called, e.g. ncalls.

1	python -m cProfile -s ncalls program.py \| head -20

Now we see some striking results.

We can see that the math.sqrt function was called nearly 100 million times, 99999999 to be precise.

This is far and away beyond the calls to any other function and the source of our problem.

100033231 function calls (100032516 primitive calls) in 30.185 seconds

Ordered by: call count

ncalls tottime percall cumtime percall filename:lineno(function)

99999999 9.428 0.000 9.428 0.000 {built-in method math.sqrt}

2156/2090 0.000 0.000 0.000 0.000 {built-in method builtins.len}

1947 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}

1626 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}

1382 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects}

1166 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}

1106 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}

1059 0.000 0.000 0.001 0.000 {built-in method builtins.getattr}

1027 0.000 0.000 0.000 0.000 {method 'isupper' of 'str' objects}

749 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}

709 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:244(_verbose_message)

646 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:126(_path_join)

646 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:128(<listcomp>)

This highlights how we can zero in on the cause of a problem by reviewing the same profile data in different ways, both execution time and number of function calls.

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps