Last Updated on September 27, 2023
You can benchmark the performance of coroutines, threads, and processes on a suite of (simulated) common tasks.
This is helpful to explore a refrain that “coroutines are faster than threads“, and the sometimes opposite refrain “threads are faster than coroutines“.
Benchmark results across different common types of tasks show that both threads and coroutines are about as fast as each other, even when the startup time of threads and coroutines is taken into account, at least with a modest number of concurrent tasks, e.g. 100.
In this tutorial, you will discover whether coroutines really are faster than threads, or not, given a suite of benchmarks.
Big thank you to Timothy Beamish for the discussion that prompted this tutorial.
Let’s get started.
Are Asyncio Coroutines Faster Than Threads!?
There is much confusion in the Python community about asyncio vs. threads.
For example, some claim that asyncio is faster than threads, and why we all need to go full asyncio ASAP.
This is often the refrain of the beginner or the newly converted.
Other more seasoned developers claim that threads are faster than asyncio.
For example:
Asyncio will make my code blazing fast. Unfortunately, no. In fact, most benchmarks seem to show that threading solutions are slightly faster than their comparable Asyncio solutions.
— Page 6, Using Asyncio in Python, 2020.
Also see the piece: Async Python is not faster.
The problem with both claims is that they motivate adopting asyncio because of execution speed. This is an error.
We adopt asyncio because we want asynchronous programming in Python and to adopt capabilities like non-blocking I/O with sockets and subprocesses.
You can learn more about this stance in the tutorial:
Nevertheless, sometimes the speed of execution is important. The most important.
Sometimes completing many concurrent tasks faster with one approach than another matters.
We will explore the question of whether asyncio coroutines are faster than threads, or not.
Before we do, let’s briefly consider the differences between coroutines and threads.
Run loops using all CPUs, download your FREE book to learn how.
Coroutines vs Threads
Coroutines and threads are quite different.
A coroutine is a function. A thread is an object connected to a native “thread” in the operating system.
You can learn more about asyncio vs. threads in the tutorial:
These structural differences result in two performance differences that we can state upfront, they are:
- Coroutines are faster to start than threads.
- Coroutines use less memory than threads.
Let’s take a closer look at each in turn.
Asyncio Coroutines Are Faster To Start Than Threads
Generally, coroutines are faster to “start” than threads.
This should make sense.
A coroutine is just a function. A thread is a Python object tied to a native thread which is manipulated by an operating system-specific API under the covers.
Running a function is faster than creating an object and calling down to a C API to do things in the OS.
This should not be surprising.
You can learn more about this in the tutorial:
The implication is that if all things are equal, e.g. the task takes the same time to execute in a coroutine and in a thread, then the startup time of the unit of concurrency might make coroutines slightly faster.
Yep, this may be.
Generally, my thoughts are:
- Startup time is still tiny, e.g. sub 2 ms (see below).
- Startup time is a constant, we exclude/ignore constants when waving our hands about and talking about benchmarking.
- Startup time can be overcome by using a pool of workers started once and reused.
Asyncio Coroutines Use Less Memory Than Threads
Coroutines use less memory than threads.
This is for the same reason why coroutines are faster to start than threads. They use less memory because they are just functions, whereas Threads and Processes are represented in Python as objects and are connected to very real native threads and processes in the operating system, with all the baggage that entails.
Again, this should not be surprising.
You can learn more about this in the tutorial:
The upshot is important.
We may want to adopt asyncio because it can support more concurrent tasks than another type of concurrency.
It can “do more” concurrent tasks per unit of concurrency compared to threads and processes if memory requirements per task are a constraint.
This may or may not hold in practice, and there are many benchmark results all over the web, mostly web server-focused, that go either way.
Your mileage may vary. Test.
Benchmark Consistency
The focus here will be the execution time of the same tasks with different units of concurrency.
We will compare coroutines and threads, and rope in processes for good measure.
Benchmarking will aim to be consistent across all tests.
We will aim to complete the same task using coroutines, threads, and processes. Naturally, the asyncio version will use coroutine functions, whereas the thread and process must use normal Python functions.
In all tests, we will create 100 concurrent tasks and wait for them to complete. The coroutine versions will create and await tasks, whereas the thread and process versions will create, start, and join Thread and Process objects.
Each benchmark will be repeated 30 times and the average of the 30 runs reported as the benchmark result for the given unit of concurrency. We take average time to give an expected performance and to counter variation introduced by the operating system, other programs that happen to be running, garbage collection, and so on.
Timing will be recorded using the time.perf_counter() function, preferred over time.time() because it may have higher precision and is not subject to changes to the system clock due to synchronization.
You can learn more about time.perf_counter() in the tutorial:
Yes, we could use the timeit API. I find it cumbersome in practice, and more critically, I find it confusing to readers. Nevertheless, you can adapt the examples below to use the timeit.timeit() API directly if you prefer.
We will explore 5 benchmark use cases that I think attack the central question of coroutines vs threads from a speed/faster perspective, they are:
- Benchmark fixed duration tasks.
- Benchmark variable duration tasks.
- Benchmark startup time only.
- Benchmark blocking IO-bound task.
- Benchmark blocking CPU-bound task.
Relax. Don’t take the results as the final word. They’re just another set of empirical results we can talk about.
For reference, all tests were performed with Python 3.11.4 on macOS on a 4.2 GHz Quad-Core Intel Core i7. Take the results as relative and directional of expected performance, rather than absolute and definitive.
Repeat all the tests on your machine and share the results in the comments, I’d love to see if we get the same findings (we should).
Think there is a fatal flaw in the reasoning here? Shout out in the comments, help me to see things the way you do.
Let’s dive in.
Free Python Asyncio Course
Download your FREE Asyncio PDF cheat sheet and get BONUS access to my free 7-day crash course on the Asyncio API.
Discover how to use the Python asyncio module including how to define, create, and run new coroutines and how to use non-blocking I/O.
Test 01: Benchmark Fixed Duration Tasks
We can explore how coroutines, threads, and processes compare when executing a fixed-duration task.
In this case, the fixed duration task will be a sleep, native to the unit of concurrency.
This means that coroutine tasks will await a call to asyncio.sleep(), whereas threads and processes will call time.sleep().
The expectation is that coroutines, threads, and processes should all have times that are very similar, and dominated by the fixed-duration task. We may expect the processes to take longer given the added time to startup (spawn) child processes compared to starting threads and coroutines.
Example Benchmark with Coroutines
We can benchmark 100 concurrent coroutines with a fixed-duration task that sleeps for 5 seconds.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# SuperFastPython.com # example of benchmarking fixed-duration tasks with coroutines import time import asyncio # fixed duration task async def work(): # fixed duration task await asyncio.sleep(5) # main program async def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [asyncio.create_task(work()) for _ in range(100)] # wait for all tasks to be done _ = await asyncio.gather(*tasks) # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [asyncio.run(main()) for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 5.003148159776659 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 5.005606721970253 seconds >took 5.005901835975237 seconds >took 5.003903842996806 seconds >took 5.004879605025053 seconds >took 5.002281563938595 seconds >took 5.003532132017426 seconds >took 5.00103097804822 seconds >took 5.001082134083845 seconds >took 5.0028190850280225 seconds >took 5.002347280969843 seconds >took 5.0025020180037245 seconds >took 5.002327059046365 seconds >took 5.002018683007918 seconds >took 5.001448153983802 seconds >took 5.0015862949658185 seconds >took 5.0029983459971845 seconds >took 5.0019010490505025 seconds >took 5.001183673040941 seconds >took 5.002747917082161 seconds >took 5.00373182806652 seconds >took 5.005316140013747 seconds >took 5.005476810969412 seconds >took 5.003614882007241 seconds >took 5.002382114995271 seconds >took 5.002130510983989 seconds >took 5.0052623560186476 seconds >took 5.001809012028389 seconds >took 5.00466235098429 seconds >took 5.001973979989998 seconds >took 5.005986433010548 seconds Took 5.003148159776659 seconds on average (n=30) |
Example Benchmark with Threads
We can benchmark 100 concurrent threads with a fixed-duration task that sleeps for 5 seconds.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# SuperFastPython.com # example of benchmarking fixed-duration tasks with threads import time from threading import Thread # fixed duration task def work(): # fixed duration task time.sleep(5) # main program def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [Thread(target=work) for _ in range(100)] # start all tasks for task in tasks: task.start() # block and wait for all tasks to complete for task in tasks: task.join() # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [main() for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 5.007344934732343 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 5.007713600993156 seconds >took 5.005233556963503 seconds >took 5.0035047109704465 seconds >took 5.008227261947468 seconds >took 5.007400047034025 seconds >took 5.008706768043339 seconds >took 5.007896321010776 seconds >took 5.005644620046951 seconds >took 5.00834703608416 seconds >took 5.008340638014488 seconds >took 5.007637929986231 seconds >took 5.008523405063897 seconds >took 5.004624187014997 seconds >took 5.007473903940991 seconds >took 5.007817853009328 seconds >took 5.007267071981914 seconds >took 5.008219021023251 seconds >took 5.007796313031577 seconds >took 5.007497936952859 seconds >took 5.004158415016718 seconds >took 5.007114102947526 seconds >took 5.007810879033059 seconds >took 5.008114760974422 seconds >took 5.007892998983152 seconds >took 5.007765169953927 seconds >took 5.00831244699657 seconds >took 5.008351197000593 seconds >took 5.007599718985148 seconds >took 5.007581111975014 seconds >took 5.007775056990795 seconds Took 5.007344934732343 seconds on average (n=30) |
Example Benchmark with Processes
We can benchmark 100 concurrent processes with a fixed-duration task that sleeps for 5 seconds.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# SuperFastPython.com # example of benchmarking fixed-duration tasks with processes import time from multiprocessing import Process # fixed duration task def work(): # fixed duration task time.sleep(5) # main program def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [Process(target=work) for _ in range(100)] # start all tasks for task in tasks: task.start() # block and wait for all tasks to complete for task in tasks: task.join() # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [main() for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 5.926129415029815 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 5.974190598004498 seconds >took 5.895346604986116 seconds >took 5.979979321942665 seconds >took 5.902725514955819 seconds >took 5.958364231046289 seconds >took 5.923908116994426 seconds >took 5.898951902985573 seconds >took 5.890727333026007 seconds >took 5.904744418920018 seconds >took 5.9984695360763 seconds >took 5.977242992958054 seconds >took 5.91637431900017 seconds >took 5.915665547014214 seconds >took 5.899516494944692 seconds >took 5.924616903997958 seconds >took 5.92795381101314 seconds >took 5.903348780004308 seconds >took 5.909356849035248 seconds >took 5.974689574097283 seconds >took 5.903668463928625 seconds >took 5.9071951870573685 seconds >took 5.910321347997524 seconds >took 5.918995287967846 seconds >took 5.913291255012155 seconds >took 5.905852591036819 seconds >took 5.980654550949112 seconds >took 5.950753453071229 seconds >took 5.895213556941599 seconds >took 5.899536328972317 seconds >took 5.922227576957084 seconds Took 5.926129415029815 seconds on average (n=30) |
Comparison of Results
The table below summarizes the results of executing 100 fixed-duration tasks with coroutines, threads, and processes.
1 2 3 4 5 |
Method | Average Time (sec) -----------|------------------- Coroutines | 5.003148159776659 Threads | 5.007344934732343 Processes | 5.926129415029815 |
We can see that the overall time is dominated by the duration of the 5-second task, as expected.
We can also see that the timings for coroutines and threads are very similar with a difference of 0.004196774955684 seconds or about 4 milliseconds. This is so tiny that it may be within the margin of measurement error.
We can see that the example with processes is nearly a full second slower than the coroutine example, e.g. 0.922981255253156 seconds. This was expected as generally spawning child processes is significantly slower than starting threads and coroutines.
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Test 02: Benchmark Variable Duration Tasks
We can explore how coroutines, threads, and processes compare when executing a variable-duration task.
In this case, the variable duration task will be a sleep for a fraction of 5 seconds determined by a pseudorandom number generator random.random(). No care will be taken to ensure the thread safety of the random numbers. Maybe this is sloppy, but I think it has no bearing (prove me wrong?).
A native sleep will be used for each method of concurrency. This means that coroutine tasks will await a call to asyncio.sleep(), whereas threads and processes will call time.sleep().
The expectation is that the overall group of tasks will take as long as the longest duration task, which will be nearly a full fraction of 5 seconds. This duration will dominate the timing. Again, the long start-up time of processes will make the execution time using child processes longer than either coroutines or threads.
The expectation is that the timing of threads and coroutines will be similar, perhaps separated by random variability in the task durations.
Example Benchmark with Coroutines
We can benchmark 100 concurrent coroutines with a variable-duration task that sleeps for a fraction of 5 seconds.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# SuperFastPython.com # example of benchmarking variable-duration tasks with coroutines import time import random import asyncio # variable duration task async def work(): # variable duration task await asyncio.sleep(5 * random.random()) # main program async def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [asyncio.create_task(work()) for _ in range(100)] # wait for all tasks to be done _ = await asyncio.gather(*tasks) # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [asyncio.run(main()) for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 4.9519578189007 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 4.9794992649694905 seconds >took 4.982328077079728 seconds >took 4.996504585957155 seconds >took 4.942020524991676 seconds >took 4.997824985999614 seconds >took 4.988499372964725 seconds >took 4.994024669984356 seconds >took 4.96711247600615 seconds >took 4.857614416978322 seconds >took 4.970976333948784 seconds >took 4.6339615900069475 seconds >took 4.954266816959716 seconds >took 4.934364631073549 seconds >took 4.910091287922114 seconds >took 4.979386947932653 seconds >took 4.989278680062853 seconds >took 4.970042674918659 seconds >took 4.992362047079951 seconds >took 4.949385932995938 seconds >took 4.993119681952521 seconds >took 4.958530042087659 seconds >took 4.969902457087301 seconds >took 4.991431168047711 seconds >took 4.987695010029711 seconds >took 4.998500703950413 seconds >took 4.987839314038865 seconds >took 4.938185329083353 seconds >took 4.975908385007642 seconds >took 4.916640411945991 seconds >took 4.851436745957471 seconds Took 4.9519578189007 seconds on average (n=30) |
Example Benchmark with Threads
We can benchmark 100 concurrent threads with a variable-duration task that sleeps for a fraction of 5 seconds.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# SuperFastPython.com # example of benchmarking variable-duration tasks with threads import time import random from threading import Thread # variable duration task def work(): # variable duration task time.sleep(5 * random.random()) # main program def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [Thread(target=work) for _ in range(100)] # start all tasks for task in tasks: task.start() # block and wait for all tasks to complete for task in tasks: task.join() # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [main() for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 4.961258380332341 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 5.001233504968695 seconds >took 4.934257935034111 seconds >took 4.991752875968814 seconds >took 4.976597366970964 seconds >took 4.949255422921851 seconds >took 4.922064215992577 seconds >took 4.970534094958566 seconds >took 4.96110759396106 seconds >took 5.012660198030062 seconds >took 4.930330466013402 seconds >took 4.9412664350820705 seconds >took 4.83308906503953 seconds >took 4.990854993928224 seconds >took 4.993070785072632 seconds >took 4.9742144220508635 seconds >took 4.899490543059073 seconds >took 4.998987291008234 seconds >took 4.944359121960588 seconds >took 4.975336050963961 seconds >took 4.990968734025955 seconds >took 4.998389234999195 seconds >took 4.959146684035659 seconds >took 4.933758645085618 seconds >took 4.92325114493724 seconds >took 4.915331566007808 seconds >took 4.98300524498336 seconds >took 4.9956415949855 seconds >took 4.956055152928457 seconds >took 5.002632780000567 seconds >took 4.979108244995587 seconds Took 4.961258380332341 seconds on average (n=30) |
Example Benchmark with Processes
We can benchmark 100 concurrent processes with a variable-duration task that sleeps for a fraction of 5 seconds.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# SuperFastPython.com # example of benchmarking variable-duration tasks with processes import time import random from multiprocessing import Process # variable duration task def work(): # variable duration task time.sleep(5 * random.random()) # main program def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [Process(target=work) for _ in range(100)] # start all tasks for task in tasks: task.start() # block and wait for all tasks to complete for task in tasks: task.join() # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [main() for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 5.741915735187164 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 5.878824210958555 seconds >took 5.696801570942625 seconds >took 5.749754771008156 seconds >took 5.839017973979935 seconds >took 5.907578862970695 seconds >took 5.811398584046401 seconds >took 5.7286086570238695 seconds >took 5.710756152984686 seconds >took 5.57938848703634 seconds >took 5.417791018029675 seconds >took 5.939029362984002 seconds >took 5.671715242904611 seconds >took 5.70442625193391 seconds >took 5.872526949970052 seconds >took 5.639694338082336 seconds >took 5.7143202039878815 seconds >took 5.763406664947979 seconds >took 5.65289467794355 seconds >took 5.759069915045984 seconds >took 5.635834189946763 seconds >took 5.702374238986522 seconds >took 5.743349586962722 seconds >took 5.908026635996066 seconds >took 5.7052182629704475 seconds >took 5.653151003993116 seconds >took 5.791833568015136 seconds >took 5.8976868039462715 seconds >took 5.589777515968308 seconds >took 5.884207276045345 seconds >took 5.709009076002985 seconds Took 5.741915735187164 seconds on average (n=30) |
Comparison of Results
The table below summarizes the results of executing 100 variable-duration tasks with coroutines, threads, and processes.
1 2 3 4 5 |
Method | Average Time (sec) -----------|------------------- Coroutines | 4.9519578189007 Threads | 4.961258380332341 Processes | 5.741915735187164 |
We can see that as expected, in all cases timing is dominated by the longest task which is a major fraction of 5 seconds.
We can see that the timing of coroutines and threads are very similar, as expected. The difference is only about 0.009300561431641 seconds or about 9 milliseconds in favor of coroutines.
We can see that again that using processes was nearly a second slower than threads and processes as we expected given the slow start-up time, e.g. 0.789957916286464 seconds slower than coroutines.
This highlights that our expectations are holding so far with fixed and variable duration tasks, coroutines and threads perform about the same from an execution time perspective.
Test 03: Benchmark Startup Time Only
We can explore how coroutines, threads, and processes compare when executing an empty task.
This will compare task start-up time only.
We know from previous experiments that coroutines start faster than threads because coroutines are a type of a function whereas threads are a Python object (and concept) in the underlying operating system.
You can learn more about this in the tutorial:
Nevertheless, we have a robust test harness in this tutorial, so let’s use it and confirm the findings.
Example Benchmark with Coroutines
We can benchmark 100 concurrent coroutines with an empty task to test start-up time.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
# SuperFastPython.com # example of repeated benchmarking tasks with coroutines import time import asyncio # fixed duration task async def work(): pass # main program async def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [asyncio.create_task(work()) for _ in range(100)] # wait for all tasks to be done _ = await asyncio.gather(*tasks) # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [asyncio.run(main()) for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 0.004739730646057675 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 0.005347909987904131 seconds >took 0.004523460986092687 seconds >took 0.004601942957378924 seconds >took 0.004634479992091656 seconds >took 0.004282271023839712 seconds >took 0.0046805780148133636 seconds >took 0.004773869994096458 seconds >took 0.00460624904371798 seconds >took 0.004660396953113377 seconds >took 0.004591502016410232 seconds >took 0.006978853023611009 seconds >took 0.004605635069310665 seconds >took 0.004293047008104622 seconds >took 0.0043420830043032765 seconds >took 0.004730738000944257 seconds >took 0.004409621004015207 seconds >took 0.004569402080960572 seconds >took 0.004708648077212274 seconds >took 0.0046031050151214 seconds >took 0.004314463003538549 seconds >took 0.004589941003359854 seconds >took 0.004424743005074561 seconds >took 0.004493659012950957 seconds >took 0.004938154947012663 seconds >took 0.004276170046068728 seconds >took 0.0064506810158491135 seconds >took 0.0047845469089224935 seconds >took 0.0044050110736861825 seconds >took 0.004408059059642255 seconds >took 0.005162697052583098 seconds Took 0.004739730646057675 seconds on average (n=30) |
Example Benchmark with Threads
We can benchmark 100 concurrent threads with an empty task to test start-up time.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# SuperFastPython.com # example of repeated benchmark tasks with threads import time from threading import Thread # fixed duration task def work(): pass # main program def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [Thread(target=work) for _ in range(100)] # start all tasks for task in tasks: task.start() # block and wait for all tasks to complete for task in tasks: task.join() # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [main() for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 0.0061288173349263765 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 0.017854186007753015 seconds >took 0.007404224015772343 seconds >took 0.008532406063750386 seconds >took 0.0070331760216504335 seconds >took 0.007874942035414279 seconds >took 0.01920797792263329 seconds >took 0.015867796959355474 seconds >took 0.004812743980437517 seconds >took 0.004920091014355421 seconds >took 0.0044369869865477085 seconds >took 0.004294598009437323 seconds >took 0.0040072170086205006 seconds >took 0.004762348951771855 seconds >took 0.004084032028913498 seconds >took 0.004062370979227126 seconds >took 0.004322782042436302 seconds >took 0.004614311037585139 seconds >took 0.004278869018889964 seconds >took 0.0040333300130441785 seconds >took 0.004042938002385199 seconds >took 0.004748346051201224 seconds >took 0.004133644979447126 seconds >took 0.004252930986694992 seconds >took 0.004081127932295203 seconds >took 0.004312386037781835 seconds >took 0.004327230039052665 seconds >took 0.004057571990415454 seconds >took 0.00452194397803396 seconds >took 0.004909696988761425 seconds >took 0.004072312964126468 seconds Took 0.0061288173349263765 seconds on average (n=30) |
Example Benchmark with Processes
We can benchmark 100 concurrent processes with an empty task to test start-up time.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# SuperFastPython.com # example of repeated benchmark tasks with processes import time from multiprocessing import Process # fixed duration task def work(): pass # main program def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [Process(target=work) for _ in range(100)] # start all tasks for task in tasks: task.start() # block and wait for all tasks to complete for task in tasks: task.join() # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [main() for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 1.1669189567444846 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 1.1468775400426239 seconds >took 1.1204417319968343 seconds >took 1.1563100520288572 seconds >took 1.181375317974016 seconds >took 1.1913440090138465 seconds >took 1.2022203609813005 seconds >took 1.152252175961621 seconds >took 1.214157114038244 seconds >took 1.1970876979175955 seconds >took 1.171911056037061 seconds >took 1.1844127650838345 seconds >took 1.233062399085611 seconds >took 1.1473784600384533 seconds >took 1.1950394000159577 seconds >took 1.1723082889802754 seconds >took 1.1344309240812436 seconds >took 1.1255874049384147 seconds >took 1.0986476360121742 seconds >took 1.1141970760654658 seconds >took 1.1348516209982336 seconds >took 1.1226598740322515 seconds >took 1.1108944459119812 seconds >took 1.2577478190651163 seconds >took 1.2719450040021911 seconds >took 1.2095733399037272 seconds >took 1.239167366991751 seconds >took 1.18447899306193 seconds >took 1.1385733240749687 seconds >took 1.0991249760845676 seconds >took 1.09951052791439 seconds Took 1.1669189567444846 seconds on average (n=30) |
Comparison of Results
The table below summarizes the results of executing 100 empty tasks with coroutines, threads, and processes.
1 2 3 4 5 |
Method | Average Time (sec) -----------|------------------- Coroutines | 0.004739730646057675 Threads | 0.0061288173349263765 Processes | 1.1669189567444846 |
We can see that as we expected, the timing shows that coroutines are slightly faster than threads. The difference is about 0.001389086688869 seconds or about 1.3 milliseconds.
This is less than the difference seen above of 4 and 9 milliseconds for fixed-duration and variable-duration tasks respectively. The difference is tiny.
Perhaps within the margin of measurement error.
Perhaps the interpreter optimizes the empty tasks away in some cases.
We can see that as expected processes are much slower to start than both threads and coroutines, more than a second in both cases, e.g. 1.162179226098427 seconds slower than coroutines.
I suspect this is a more fair test than the prior work and shows the difference in starting time between threads and processes, at least 100 units, is barely measurable.
Test 04: Benchmark Many Short Blocking IO-Bound Calls
We can explore how coroutines, threads, and processes compare when executing a blocking IO-bound function call.
In this case, we will use the time.sleep() function with zero seconds. We will execute this call in a loop and repeat the task 10,000 times.
Calling time.sleep() in coroutines will block the event loop and we expect to make 10,000 calls in time.sleep(0) will impact the overall time of 100 coroutines negatively, forcing all 100 tasks to execute sequentially rather than concurrently.
Calling time.sleep(0) in Python threads and processes is expected to signal to the operating system to context switch, which the OS may or may not act upon. I suspect that the aggressive looping of 10,000 calls will result in a lot of context switching, perhaps negatively impacting the performance of threads and processes, but not preventing them from executing concurrently.
My thinking was that this might be a rare use case where threads and processes might outperform the coroutine version.
We can estimate the time of one task using timeit, for example on the command line:
1 |
python -m timeit "for i in range(10000): time.sleep(0)" |
This reports the results:
1 |
50 loops, best of 5: 4.74 msec per loop |
It suggests that one task takes about 5 milliseconds, therefore if 100 tasks are executed sequentially then it should take about 5 * 100 or 500 milliseconds or 0.5 seconds. Less than this, then some parallelism was achieved.
Example Benchmark with Coroutines
We can benchmark 100 concurrent coroutines with a task that loops 10,000 calls to time.sleep(0).
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# SuperFastPython.com # example of benchmarking many small blocking calls with coroutines import time import asyncio # fixed duration task async def work(): # lots of short blocking calls for i in range(10000): time.sleep(0) # main program async def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [asyncio.create_task(work()) for _ in range(100)] # wait for all tasks to be done _ = await asyncio.gather(*tasks) # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [asyncio.run(main()) for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 0.4891133862081915 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 0.4878994390601292 seconds >took 0.4997508830856532 seconds >took 0.4862078399164602 seconds >took 0.5073097830172628 seconds >took 0.4989006270188838 seconds >took 0.49623313394840807 seconds >took 0.4978959229774773 seconds >took 0.4810180439380929 seconds >took 0.48254009906668216 seconds >took 0.482198101002723 seconds >took 0.48019723896868527 seconds >took 0.48233177699148655 seconds >took 0.4847577419131994 seconds >took 0.49468081595841795 seconds >took 0.49306463403627276 seconds >took 0.4792039319872856 seconds >took 0.4786785760661587 seconds >took 0.4797952549997717 seconds >took 0.4790061500389129 seconds >took 0.4833352550631389 seconds >took 0.5536640679929405 seconds >took 0.49406041705515236 seconds >took 0.5106325470842421 seconds >took 0.4812723359791562 seconds >took 0.48067384399473667 seconds >took 0.47890967002604157 seconds >took 0.47735631896648556 seconds >took 0.47941096604336053 seconds >took 0.48271288396790624 seconds >took 0.4797032860806212 seconds Took 0.4891133862081915 seconds on average (n=30) |
Example Benchmark with Threads
We can benchmark 100 concurrent threads with a task that loops 10,000 calls to time.sleep(0).
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# SuperFastPython.com # example of benchmarking many small blocking calls with threads import time from threading import Thread # fixed duration task def work(): # lots of short blocking calls for i in range(10000): time.sleep(0) # main program def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [Thread(target=work) for _ in range(100)] # start all tasks for task in tasks: task.start() # block and wait for all tasks to complete for task in tasks: task.join() # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [main() for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 4.327525828394573 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 4.293873360031284 seconds >took 4.376115886028856 seconds >took 4.475935518974438 seconds >took 4.433653346030042 seconds >took 4.43829051789362 seconds >took 4.341383031918667 seconds >took 4.397841645986773 seconds >took 4.275599429034628 seconds >took 4.250186958000995 seconds >took 4.241187260020524 seconds >took 4.243161487043835 seconds >took 4.225471145939082 seconds >took 4.282393084024079 seconds >took 4.358860425068997 seconds >took 4.369623012025841 seconds >took 4.443313602008857 seconds >took 4.414740073028952 seconds >took 4.471585859078914 seconds >took 4.479904160951264 seconds >took 4.423275716020726 seconds >took 4.415262004942633 seconds >took 4.248398449970409 seconds >took 4.221310467924923 seconds >took 4.262456791009754 seconds >took 4.260851696017198 seconds >took 4.237421901896596 seconds >took 4.224775590002537 seconds >took 4.2305707620689645 seconds >took 4.227306513930671 seconds >took 4.261025154963136 seconds Took 4.327525828394573 seconds on average (n=30) |
Example Benchmark with Processes
We can benchmark 100 concurrent processes with a task that loops 10,000 calls to time.sleep(0).
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# SuperFastPython.com # example of benchmarking many small blocking calls with processes import time from multiprocessing import Process # fixed duration task def work(): # lots of short blocking calls for i in range(10000): time.sleep(0) # main program def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [Process(target=work) for _ in range(100)] # start all tasks for task in tasks: task.start() # block and wait for all tasks to complete for task in tasks: task.join() # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [main() for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 1.1696696432229752 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 1.2296249369392172 seconds >took 1.1739729280816391 seconds >took 1.1501781400293112 seconds >took 1.168456916930154 seconds >took 1.1498259670333937 seconds >took 1.1953479090007022 seconds >took 1.1676800379063934 seconds >took 1.1929733549477533 seconds >took 1.184617949067615 seconds >took 1.1395360080059618 seconds >took 1.1461424650624394 seconds >took 1.178561212029308 seconds >took 1.212078657001257 seconds >took 1.1456095179310068 seconds >took 1.1693206139607355 seconds >took 1.1887674330500886 seconds >took 1.1555459150113165 seconds >took 1.148561573936604 seconds >took 1.1502260669367388 seconds >took 1.1507865960011259 seconds >took 1.1596319419331849 seconds >took 1.193502969108522 seconds >took 1.1817562490468845 seconds >took 1.1622605109587312 seconds >took 1.159014487056993 seconds >took 1.1667339309351519 seconds >took 1.1656803669175133 seconds >took 1.1696518359240144 seconds >took 1.1615390349179506 seconds >took 1.17250377102755 seconds Took 1.1696696432229752 seconds on average (n=30) |
Comparison of Results
The table below summarizes the results of executing 100 tasks that loop 10,000 times with calls to time.sleep(0) using coroutines, threads, and processes.
1 2 3 4 5 |
Method | Average Time (sec) -----------|------------------- Coroutines | 0.4891133862081915 Threads | 4.327525828394573 Processes | 1.1696696432229752 |
The results surprised me.
I expected coroutines to be slower than threads because of the sequential execution of all 100 tasks. Instead, we can see that coroutines were faster than threads with a difference of 3.838412442186382 seconds.
Nevertheless, the coroutines took as long as we expected from a theoretical perspective to execute all 100 tasks sequentially, e.g. about half a second.
When it comes to the threads, I suspect that the operating system context switches the threads like crazy and that the overhead of so much context switching was worse than executing all 100 tasks sequentially in the asyncio event loop. Fascinating.
We can also see that the processes outperformed the threads, in this case by about 3.157856185171598 seconds.
So if the theory that threads context switched like crazy is the reason why threads performed so badly here, then why didn’t processes show the same effect? Child processes execute the task in a main thread. Is the request to context switch less likely to be honored if there is only a single thread in the process rather than 100? I doubt it.
More work is needed to understand this finding.
Test05: Benchmark Many Short CPU-Bound Calls
We can explore how coroutines, threads, and processes compare when executing a blocking CPU-bound function call.
In this case, we will create a list of one million squared integers in a list comprehension, for example:
1 2 3 |
... # long cpu-bound task [i*i for i in range(1000000)] |
This is a CPU-bound task and a blocking call.
The expectation is that it will block the asyncio event loop and cause all coroutines to execute their tasks sequentially instead of concurrently.
Further, the expectation is that it will block the GIL and cause all threads to execute their tasks sequentially instead of concurrently.
Perhaps this will mean that threads and coroutines will perform similarly in this case.
The expectation is that processes will execute the task concurrently and in parallel given the one GIL per process, but perhaps the one-second overhead of starting the child processes may overwhelm any benefit.
We can time the snippet with timeit, for example from the command line:
1 |
python -m timeit "[i*i for i in range(1000000)]" |
This gives:
1 |
5 loops, best of 5: 58.9 msec per loop |
So, we can expect that each task will take about 60 milliseconds.
A total of 100 tasks will take 100*60 or 6,000 milliseconds or 6 seconds to complete sequentially, and this may be the expected time to complete all tasks with coroutines and threads.
Example Benchmark with Coroutines
We can benchmark 100 concurrent coroutines with a CPU-bound task.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# SuperFastPython.com # example of benchmarking many small cpu-bound calls with coroutines import time import asyncio # fixed duration task async def work(): # long cpu-bound task [i*i for i in range(1000000)] # main program async def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [asyncio.create_task(work()) for _ in range(100)] # wait for all tasks to be done _ = await asyncio.gather(*tasks) # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [asyncio.run(main()) for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 6.203456498515637 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 6.26790036901366 seconds >took 6.087358352961019 seconds >took 6.140904022962786 seconds >took 6.112774651963264 seconds >took 6.098512538941577 seconds >took 6.180439983960241 seconds >took 6.289000188931823 seconds >took 6.111457554041408 seconds >took 6.225467278971337 seconds >took 6.246900469996035 seconds >took 6.221396305016242 seconds >took 6.250519399996847 seconds >took 6.172272679046728 seconds >took 6.137484053964727 seconds >took 6.237339652958326 seconds >took 6.166124025941826 seconds >took 6.210153375985101 seconds >took 6.222252605948597 seconds >took 6.250817980966531 seconds >took 6.255593008012511 seconds >took 6.253392655984499 seconds >took 6.285578145063482 seconds >took 6.222089056973346 seconds >took 6.308556971955113 seconds >took 6.1545265859458596 seconds >took 6.274848052999005 seconds >took 6.2174068910535425 seconds >took 6.180353172007017 seconds >took 6.117866926942952 seconds >took 6.20440799696371 seconds Took 6.203456498515637 seconds on average (n=30) |
Example Benchmark with Threads
We can benchmark 100 concurrent threads with a CPU-bound task.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# SuperFastPython.com # example of benchmarking many small cpu-bound calls with threads import time from threading import Thread # fixed duration task def work(): # long cpu-bound task [i*i for i in range(1000000)] # main program def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [Thread(target=work) for _ in range(100)] # start all tasks for task in tasks: task.start() # block and wait for all tasks to complete for task in tasks: task.join() # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [main() for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 5.568172295178131 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 5.549847619025968 seconds >took 5.59445943904575 seconds >took 5.458023883984424 seconds >took 5.4705089629860595 seconds >took 5.4718697289936244 seconds >took 5.516852483036928 seconds >took 5.4527759230695665 seconds >took 5.556496188044548 seconds >took 5.483522651018575 seconds >took 5.480848659994081 seconds >took 5.532506332034245 seconds >took 5.553190176025964 seconds >took 5.557271999074146 seconds >took 5.787318727001548 seconds >took 6.308657961082645 seconds >took 5.633192701963708 seconds >took 5.592133978032507 seconds >took 5.596235709963366 seconds >took 5.51918849197682 seconds >took 5.6621356919640675 seconds >took 5.573206617031246 seconds >took 5.486937104025856 seconds >took 5.483662002021447 seconds >took 5.484646036056802 seconds >took 5.484199552913196 seconds >took 5.74580648902338 seconds >took 5.555926422006451 seconds >took 5.505242593935691 seconds >took 5.507720064953901 seconds >took 5.440784665057436 seconds Took 5.568172295178131 seconds on average (n=30) |
Example Benchmark with Processes
We can benchmark 100 concurrent processes with a CPU-bound task.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# SuperFastPython.com # example of benchmarking many small cpu-bound calls with processes import time from multiprocessing import Process # fixed duration task def work(): # long cpu-bound task [i*i for i in range(1000000)] # main program def main(): # record the start time time_start = time.perf_counter() # create many tasks for concurrent execution tasks = [Process(target=work) for _ in range(100)] # start all tasks for task in tasks: task.start() # block and wait for all tasks to complete for task in tasks: task.join() # record the end time time_end = time.perf_counter() # calculate and report the duration time_duration = time_end - time_start print(f'>took {time_duration} seconds') # return the duration return time_duration # protect the entry point if __name__ == '__main__': # number of repeats n_repeat = 30 # collect results results = [main() for _ in range(n_repeat)] # calculate the average time_avg = sum(results) / n_repeat # report average print(f'Took {time_avg} seconds on average (n={n_repeat})') |
Running the example, we can see that the average time to complete all tasks is about 2.660895350195157 seconds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
>took 2.6358229879988357 seconds >took 2.696469657937996 seconds >took 2.7975021119927987 seconds >took 2.6465017489390448 seconds >took 2.6222428189357743 seconds >took 2.6590355520602316 seconds >took 2.657320981961675 seconds >took 2.6290031319949776 seconds >took 2.654458374949172 seconds >took 2.65421222592704 seconds >took 2.663659874931909 seconds >took 2.7032910980051383 seconds >took 2.6661344930762425 seconds >took 2.66898558603134 seconds >took 2.6605309769511223 seconds >took 2.6630956949666142 seconds >took 2.6483234349871054 seconds >took 2.660338986082934 seconds >took 2.6446798619581386 seconds >took 2.638959081028588 seconds >took 2.6404150440357625 seconds >took 2.638270872994326 seconds >took 2.8066139520378783 seconds >took 2.6292464279104024 seconds >took 2.632936291047372 seconds >took 2.6408941299887374 seconds >took 2.6718718740157783 seconds >took 2.627940032980405 seconds >took 2.647110069054179 seconds >took 2.6209931310731918 seconds Took 2.660895350195157 seconds on average (n=30) |
Comparison of Results
The table below summarizes the results of executing 100 tasks that create a list of one million squared integers with coroutines, threads, and processes.
1 2 3 4 5 |
Method | Average Time (sec) -----------|------------------- Coroutines | 6.203456498515637 Threads | 5.568172295178131 Processes | 2.660895350195157 |
We can see that the results for coroutines match the theoretical expectation of 100 tasks executed sequentially, e.g. about 6 seconds.
Interestingly, we can see that the threads were faster than coroutines in this case, faster by about 0.635284203337506 seconds.
This is surprising, I expected both coroutines and threads to have very similar performance. Nevertheless, the finding is inconsequential, we do not want to execute CPU-bound tasks in this way, e.g. sequentially by blocking the event loop and blocking the GIL.
Why? I expect switching between blocking tasks in the event loop to be faster than context-switching threads in the OS. Maybe this assumption is wrong. We are switching sequentially in the event loop as each task is completed. The OS is probably doing the same thing, but maybe not.
As we expected, using processes was the fastest because processes can achieve full parallelism, with one GIL per child process.
Another good reminder, if we needed it, is that pure-Python CPU-bound tasks should be executed with multiprocessing, probably in a process pool or ProcessPoolExecutor.
Review and Analysis
What did we find?
Coroutines are not faster than threads. They perform about the same.
Well, maybe coroutines are slightly faster to start, in the order of a few milliseconds, but this might be within the bounds of measurement error. If I could be bothered I’d spin up my stats and comment on the means and the standard error. Not today.
Remember, we don’t reach for asyncio and coroutines because they are faster than threads.
We reach for asyncio and coroutines for other reasons, like:
- We want (or are forced by an API) to do asynchronous programming in Python.
- We want the scalability offered by units of concurrency that use less memory.
- Other reasons.
For more on this, see the tutorial:
We did discover that threads go crazy in one process when requested to context switch a ton with time.sleep(0) calls. Maybe the test design was bad and we found some strangeness, I bet on this. A better test might be to have all tasks read an OS file and count bytes, e.g. a real blocking I/O bound task.
We did discover that even with the GIL, threads can do slightly better than coroutines when abused to execute many CPU-bound tasks concurrently. I think the finding is real and that some quote threads doing better are testing in this kind of scenario (e.g. not going async “all the way down“). Not sure it matters here as we are misusing threads and coroutines for an inappropriate task.
This was fun.
Thoughts?
Further Reading
This section provides additional resources that you may find helpful.
Python Asyncio Books
- Python Asyncio Mastery, Jason Brownlee (my book!)
- Python Asyncio Jump-Start, Jason Brownlee.
- Python Asyncio Interview Questions, Jason Brownlee.
- Asyncio Module API Cheat Sheet
I also recommend the following books:
- Python Concurrency with asyncio, Matthew Fowler, 2022.
- Using Asyncio in Python, Caleb Hattingh, 2020.
- asyncio Recipes, Mohamed Mustapha Tahrioui, 2019.
Guides
APIs
- asyncio — Asynchronous I/O
- Asyncio Coroutines and Tasks
- Asyncio Streams
- Asyncio Subprocesses
- Asyncio Queues
- Asyncio Synchronization Primitives
References
Takeaways
You now know that threads and coroutines perform about the same on average.
Did I make a mistake? See a typo?
I’m a simple humble human. Correct me, please!
Do you have any additional tips?
I’d love to hear about them!
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Kenny Eliason on Unsplash
Rene Nejsum says
Nice job 🙂
Did you consider doing the last test with a non-GIL Python? I think it would be interesting to see, how the non-GIL version performs in these tests.
Jason Brownlee says
Thanks for the suggestion.
Frank says
i test the cpu-bound calls and asynio runs 1.15x faster.
env: python3.11; os: windows 10; cpu: Intel Core i7-9750H
Jason Brownlee says
Nice, thank you for sharing!