Repeat Benchmarks to Get Stable Results

October 19, 2023 Python Benchmarking

You can improve the benchmark estimate by repeating a benchmark many times and reporting the average score.

Each time a benchmark is performed, a different result will be reported. This is for many reasons such as other activity in the operating system, external dependencies, hardware behavior, and more. The result is an estimate of the performance of a program with added random statistical noise.

We can counter the statistical noise in single benchmark results by repeating a benchmark many times and averaging the result, canceling out a lot of the variation. This will give a more stable estimate of performance, an estimate that if repeated would show a closer or nearly identical result.

In this tutorial, you will discover how to develop stable benchmark measures in Python.

Let's get started.

Benchmark Results Differ Every Time

A problem with benchmarking Python code is that we will get a different benchmark result every time.

For example, we can use functions like time.time() or time.perf_counter() to record a start time before a block of a code, and an end time after the end of code and calculate and report the difference between the two times.

Repeating this simple benchmark more than once will give different results.

Which value should we report as the benchmark score?

The first time reported?
The best or lowest time?
The worst case or largest time?

How do we handle the case that benchmark results differ from one run to the next?

Next, let's consider the problems that varying benchmark results introduce.

The Problem With Benchmark Results That Vary

Using and reporting benchmark results from a single run, without repeating and reporting averages, can lead to several issues and inaccuracies in performance analysis.

This can lead to problems, such as:

Unreliable Data: A single benchmark run can be highly susceptible to variations caused by factors like background processes, CPU load, and random fluctuations. Relying on a single run may result in unreliable and misleading performance data.
Inaccurate Assessment: Without multiple runs and averaging, it's challenging to accurately assess the true performance characteristics of code. Reported results may not reflect the code's typical behavior.
Misleading Optimization Decisions: Making optimization decisions based on a single run can be risky. A minor change in external conditions during that run can lead to the incorrect perception that an optimization is effective or ineffective.
Failure to Identify Regressions: When comparing different code versions or implementations, a single-run approach might not detect performance regressions or improvements accurately. A single lucky or unlucky run can skew the comparison.
Lack of Confidence: Stakeholders, including developers and project managers, may lack confidence in performance claims based on a single run. They may question the validity of the results and the reliability of optimizations.

Next, let's consider why benchmark execution times vary.

Why Do Execution Times Vary

Execution time results can differ in each run due to various factors and sources of variability.

Understanding these factors is crucial for accurate performance analysis.

Some common reasons why execution time results can vary from run to run include:

Operating System Activity: Background processes and system activity can affect the execution of your code. The operating system may allocate or deallocate resources, causing variations in execution time.
CPU Load: The CPU may be shared with other processes and tasks running on the system. CPU load fluctuations can impact how quickly your code is processed.
Memory Usage: Changes in memory usage, such as garbage collection or memory swapping, can affect the execution of code, especially if it leads to cache misses or increased paging.
Concurrency: On multi-core systems, the scheduling and interaction of threads or processes (other than those in your program) can introduce variability in execution times. Competition for CPU resources can result in varying execution speeds.
External Dependencies: Code that relies on external services or data sources, such as network requests or database queries, can experience variable execution times due to network latency or resource availability.
Compiler Optimizations: Some Python interpreters may apply optimizations during code execution. These optimizations can affect the speed at which code is executed and may not be consistent across runs.
Caching: The availability and state of CPU caches and memory hierarchies can impact the execution speed. Caching effects can vary from run to run.
Randomness: Code that involves random number generation, such as simulations or games, can produce different results each time it runs. This is expected behavior and not necessarily an issue.
CPU Frequency Scaling: Some CPUs support dynamic frequency scaling, which adjusts the CPU clock speed based on demand. This can lead to variations in execution speed.
Resource Contention: Resource contention, such as disk or network I/O, can introduce delays and affect execution times if multiple processes or threads compete for the same resources.

The system is never the same from moment to moment.

We must expect benchmark results to vary.

Therefore, we need to manage or control for the variability in benchmark results.

Use Repetition to Calculate Fair Estimates of Performance

We can use simple statistics to develop a stable benchmark result.

This can be achieved by repeating a given benchmark more than once and calculating the average.

From a statistical perspective, there is a true but unknown underlying benchmark time. Each time we perform a benchmark and collect a benchmark time, we are estimating the true but unknown benchmark time, but the problem is the estimate has statistical noise added to it, because of the reasons we discussed above.

Therefore, to overcome the statistical noise, we can draw many samples of the benchmark time by repeating the benchmark many times.

Calculating the average of these samples will get much closer to the true unknown underlying benchmark time. The random statistical noise will cancel out and we will be left with a much closer approximation of the true benchmark time.

What Is The Procedure?

The old procedure is as follows:

Determine the code to benchmark and instrument with an appropriate time measure.
Benchmark the code and report the score.

We know that this is not good enough as we will report a different score every time the benchmark is performed.

Therefore, the better procedure for repeating a benchmark and reporting the average is as follows:

Determine the code to benchmark and instrument with an appropriate time measure.
Loop a number of times (n)
- 2a. Benchmark the code
- 2b. Record the score
Calculate and report the average of the scores.

This will give a better estimate of the benchmark.

How Many Times to Repeat The Benchmark?

The result of the procedure will be a stable estimate of the execution time of the target code.

Stable means that we could repeat the same procedure on the same code and on the same computer system at a later date and get an estimate of the execution time that is close to the same value, much closer than two individual runs.

Two average execution times will not be identical, there will be small differences and there always will be. This is the nature of statistical measurements. The difference accounts for statistical noise that we are unable to remove from the estimate. If there was no statistical noise, we would not need to estimate in the first place, one run and measurement would be sufficient.

We can improve the estimate of the performance, which will reduce the difference between two runs of the procedure.

This can be achieved by increasing the number of samples, e.g. the number of repetitions of the benchmark.

How many repetitions are needed?

One repetition is not sufficient, we now know this.
A minimum number of repetitions is 3.
A reasonable number of repetitions is 10.
A good number of repetitions is 30.
A great number of repetitions is 100 or 1,000, maybe even 1,000,000.

More than 100 or 1000 repetitions quickly reach a point of diminishing returns, depending on the code that is being benchmarked.

Generally, the number of repetitions comes down to how long we can afford to wait.

Few repetitions are used with slower code, more repetitions are used with faster code.

How to Calculate The Average Benchmark Score

The first step is to collect multiple individual benchmark scores.

These can be collected in a list.

For example, if we had a function named benchmark() that performed a single benchmark of the target code and returned a time in seconds, we could call this many times and store the results in a list.

...
# the number of times to repeat the benchmark
n_repeats = 30
# collect a list of benchmark scores
all_times = [benchmark() _ for i in range(n_repeats)]

Or, we might use a for loop and report each individual score along the way to show progress.

I like this approach, it might look as follows:

...
# the number of times to repeat the benchmark
n_repeats = 30
# benchmark results
all_times = list()
# repeat benchmark may times
for i in range(n_repeats):
    # calculate the duration
    time_duration = benchmark()
    # report the duration
    print(f'>trial {i} took {time_duration} seconds')
    # store the result
    all_times.append(time_duration)

The average is also referred to by the technical name "arithmetic mean" and is calculated as the sum of all times divided by the number of times that were collected.

In mathematics and statistics, the arithmetic mean, arithmetic average, or just the mean or average (when the context is clear) is the sum of a collection of numbers divided by the count of numbers in the collection.
-- Arithmetic mean, Wikipedia.

For example:

average = sum of all scores / number of scores collected

In code, we can use the built-in sum() function, for example:

...
# calculate the average duration
time_avg = sum(all_times) / repeats
# report the average time
print(f'Took {time_avg} seconds on average')

We can also calculate it using the statistics.mean() function.

For example:

...
# calculate the mean score
time_avg = statistics.mean(all_times)

I prefer to calculate the mean myself instead of using the statistics.mean() function as it does not require any additional imports.

When we report the score, we must indicate that it is an average rather than a single estimate of performance.

For example:

...
Took 2.3 seconds on average

It might also be a good idea to report the standard deviation.

Recall the mean is the expected value or middle of a normal (bell curve) distribution. The standard deviation is the average spread in the same sample distribution.

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.
-- Standard deviation, Wikipedia.

If we have both the mean and the standard deviation values, we can reconstruct the sample distribution. This is helpful when reporting results formally, such as in a report or research context. Someone familiar with statistics can then use statistical tools to compare distributions directly (e.g. benchmarks with and without a change) and report the statistical significance of the results (e.g. did the change really make a difference). This is out of the scope of this tutorial.

Don't calculate the standard deviation manually. There are a few ways to calculate it and you may use the wrong one (e.g. population vs sample standard deviation). Instead, we can use the statistics.stdev() function to calculate the standard deviation of the sample.

For example:

...
# calculate the standard deviation
time_stdv = statistics.stdev(all_times)

We can report the mean and standard deviation together:

...
Took 2.3 seconds on average (std=0.1)

For completeness, we can also report the number of benchmark repetitions that were sampled, reported as "n".

For example:

...
Took 2.3 seconds on average (stdev=0.1, n=10)

In most cases, reporting the average time is sufficient in order to make decisions by ourselves.

I would recommend including the standard deviation (stdev) and number of repetitions (n) only when reporting results formally, or when anyone else also needs to look at the results to help make a decision.

Why Repeat Benchmarks Many Times?

Repeating a code benchmark many times and reporting the average is a common practice in performance analysis for several reasons:

Statistical Significance: A single measurement of code execution time can be subject to random variations caused by factors like system load, background processes, and CPU fluctuations. Repeating the benchmark multiple times helps reduce the impact of these variations and provides a more representative sample of performance data.
Stability Assessment: By taking the average of multiple runs, you can assess the stability of your code's performance. If there is significant variability in execution times, it may indicate underlying issues that need to be addressed.
Outlier Detection: Repeating benchmarks allows you to identify and discard outliers, which are unusually slow or fast measurements that can skew the results. Outliers can be caused by various factors, such as context switches, I/O delays, or CPU spikes.
Minimizing Noise: Benchmarks can be affected by noise introduced by the operating system or other background processes. Averaging multiple runs helps minimize the impact of this noise, providing a more accurate measurement of the code's performance.
Meaningful Comparison: When comparing the performance of different code implementations or optimizations, taking the average of multiple runs ensures that the comparison is based on stable and reliable data, making the results more meaningful.
Response to Variability: In real-world applications, code performance can vary due to changing inputs or system conditions. Repeating benchmarks allows you to observe how your code's performance responds to such variability and whether it remains within acceptable bounds.
Validating Changes: When making code changes or optimizations, repeating benchmarks before and after the changes can help you assess the impact of those changes accurately. Averaging ensures that any improvements or regressions are not due to random fluctuations.
Fair Assessment: Averaging provides a fairer representation of the code's performance because it accounts for the variability in execution time that may occur during different runs.

Repeating a code benchmark multiple times and reporting the average is essential for obtaining accurate, reliable, and statistically significant performance measurements.

Now that we know how to calculate the average benchmark score, let's look at some worked examples.

Example of Variable Benchmark Results

Before we look at how to repeat a benchmark and calculate the average score, let's confirm that performing the same single benchmark many times results in a different score each time.

In this case, we will define a task that creates a list of squared integers. We will then benchmark this function and report the duration in seconds. We will record the time before and after the target code using the time.perf_counter() function.

Firstly, we can define a task() function that performs the CPU-intensive task of creating a list of 100 million squared integers in a list comprehension.

# function to benchmark
def task():
    # create a large list
    data = [i*i for i in range(100000000)]

Next, we define a benchmark() function that records the start time, calls the task() function, records the end time, and calculates and reports the overall duration in seconds.

# benchmark the task() function
def benchmark():
    # record start time
    time_start = perf_counter()
    # execute the function
    task()
    # record end time
    time_end = perf_counter()
    # calculate the duration
    time_duration = time_end - time_start
    # report the duration
    print(f'Took {time_duration:.3f} seconds')

The benchmarking of the target task() function is performed in a standalone function named benchmark(). We will call this function multiple times, and a benchmark score will be reported each time.

...
# repeat the same benchmark a few times
benchmark()
benchmark()
benchmark()

Tying this together, the complete example is listed below.

# SuperFastPython.com
# example a variability in benchmark results
from time import perf_counter

# function to benchmark
def task():
    # create a large list
    data = [i*i for i in range(100000000)]

# benchmark the task() function
def benchmark():
    # record start time
    time_start = perf_counter()
    # execute the function
    task()
    # record end time
    time_end = perf_counter()
    # calculate the duration
    time_duration = time_end - time_start
    # report the duration
    print(f'Took {time_duration:.3f} seconds')

# repeat the same benchmark a few times
benchmark()
benchmark()
benchmark()

Running the example benchmarks the same task() function three times.

In this case, we can see that each time the task() function is benchmarked, we get a different score.

The results are truncated to 3 decimal places to keep things simple. If we allowed full double floating-point precision and ran the same benchmark thousands of times, we would likely get thousands of different benchmark scores.

The reason for the difference is because of random statistical noise in the estimate of the benchmark time.

This highlights that we can expect to get a different benchmark result every time we execute a single benchmark.

Took 6.138 seconds
Took 5.920 seconds
Took 5.904 seconds

Next, let's explore how we can make the benchmark more stable by repeating it many times and reporting the average.

Example of Average Benchmark Results

We can explore how to repeat a benchmark many times and report the average result.

In this case, we can update the above example to perform the same benchmark many times, collect the results, and report the average.

Firstly, we can update the benchmark() function to return the duration rather than report it.

# benchmark the task() function
def benchmark():
    # record start time
    time_start = perf_counter()
    # execute the function
    task()
    # record end time
    time_end = perf_counter()
    # calculate the duration
    time_duration = time_end - time_start
    # return the duration
    return time_duration

Next, we can define a new function repeat_benchmark() that defines the number of repeats and a list for storing all duration times. It then loops can call the benchmark() many times and stores each result. It then calculates the average of all duration times and reports the average.

# repeat a benchmark
def repeat_benchmark():
    # define the number of repeats
    n_repeats = 30
    # list for storing all results
    all_times = list()
    # repeat benchmark may times
    for i in range(n_repeats):
        # benchmark and retrieve the duration
        time_duration = benchmark()
        # report the duration
        print(f'>trial {i} took {time_duration:.3f} seconds')
        # store the result
        all_times.append(time_duration)
    # calculate the average duration
    time_avg = sum(all_times) / n_repeats
    # report the average time
    print(f'Took {time_avg:.3f} seconds on average')

We can then call this function to perform the repeated benchmark of the task() function and calculate a stable estimate of the function's performance.

...
# execute a repeated benchmark
repeat_benchmark()

Tying this together, the complete example is listed below.

# SuperFastPython.com
# example a stable benchmark results
from time import perf_counter

# function to benchmark
def task():
    # create a large list
    data = [i*i for i in range(100000000)]

# benchmark the task() function
def benchmark():
    # record start time
    time_start = perf_counter()
    # execute the function
    task()
    # record end time
    time_end = perf_counter()
    # calculate the duration
    time_duration = time_end - time_start
    # return the duration
    return time_duration

# repeat a benchmark
def repeat_benchmark():
    # define the number of repeats
    n_repeats = 30
    # list for storing all results
    all_times = list()
    # repeat benchmark may times
    for i in range(n_repeats):
        # benchmark and retrieve the duration
        time_duration = benchmark()
        # report the duration
        print(f'>trial {i} took {time_duration:.3f} seconds')
        # store the result
        all_times.append(time_duration)
    # calculate the average duration
    time_avg = sum(all_times) / n_repeats
    # report the average time
    print(f'Took {time_avg:.3f} seconds on average')

# execute a repeated benchmark
repeat_benchmark()

Running the example benchmarks the task() function 30 times and reports the average duration.

Helpfully, the estimated performance is reported in each iteration, showing how the estimated time of the task() function varies each time the benchmark() function is run.

In this case, the average duration is reported as 5.997 seconds.

>trial 0 took 5.999 seconds
>trial 1 took 6.250 seconds
>trial 2 took 6.115 seconds
>trial 3 took 6.057 seconds
>trial 4 took 6.132 seconds
>trial 5 took 6.155 seconds
>trial 6 took 6.106 seconds
>trial 7 took 6.119 seconds
>trial 8 took 5.938 seconds
>trial 9 took 5.949 seconds
>trial 10 took 5.935 seconds
>trial 11 took 5.912 seconds
>trial 12 took 6.102 seconds
>trial 13 took 6.015 seconds
>trial 14 took 5.980 seconds
>trial 15 took 5.905 seconds
>trial 16 took 5.899 seconds
>trial 17 took 5.916 seconds
>trial 18 took 5.912 seconds
>trial 19 took 5.990 seconds
>trial 20 took 5.966 seconds
>trial 21 took 5.933 seconds
>trial 22 took 6.043 seconds
>trial 23 took 6.008 seconds
>trial 24 took 5.919 seconds
>trial 25 took 5.897 seconds
>trial 26 took 5.931 seconds
>trial 27 took 5.919 seconds
>trial 28 took 5.953 seconds
>trial 29 took 5.954 seconds
Took 5.997 seconds on average

This result is more stable than a single run of the benchmark() function.

We can demonstrate this.

Re-run the entire program and note the result.

In this case, the result is 5.994 seconds. This is very close to the 5.997 seconds reported in the previous run, showing the stability of the procedure for the task() function.

We could achieve even more stability by increasing the number of repeats from 30 to 100 and perhaps increasing the precision of the average result from 3 decimal places to 6 to look for any differences.

...
>trial 25 took 5.965 seconds
>trial 26 took 5.951 seconds
>trial 27 took 5.966 seconds
>trial 28 took 5.947 seconds
>trial 29 took 6.036 seconds
Took 5.994 seconds on average

Takeaways

You now know how to develop stable benchmark measures in Python.

If you enjoyed this tutorial, you will love my book: Python Benchmarking. It covers everything you need to master the topic with hands-on examples and clear explanations.