Repeat Benchmarks to Get Stable Results

October 19, 2023 Python Benchmarking

You can improve the benchmark estimate by repeating a benchmark many times and reporting the average score.

Each time a benchmark is performed, a different result will be reported. This is for many reasons such as other activity in the operating system, external dependencies, hardware behavior, and more. The result is an estimate of the performance of a program with added random statistical noise.

We can counter the statistical noise in single benchmark results by repeating a benchmark many times and averaging the result, canceling out a lot of the variation. This will give a more stable estimate of performance, an estimate that if repeated would show a closer or nearly identical result.

In this tutorial, you will discover how to develop stable benchmark measures in Python.

Let's get started.

Benchmark Results Differ Every Time

A problem with benchmarking Python code is that we will get a different benchmark result every time.

For example, we can use functions like time.time() or time.perf_counter() to record a start time before a block of a code, and an end time after the end of code and calculate and report the difference between the two times.

Repeating this simple benchmark more than once will give different results.

Which value should we report as the benchmark score?

How do we handle the case that benchmark results differ from one run to the next?

Next, let's consider the problems that varying benchmark results introduce.

The Problem With Benchmark Results That Vary

Using and reporting benchmark results from a single run, without repeating and reporting averages, can lead to several issues and inaccuracies in performance analysis.

This can lead to problems, such as:

Next, let's consider why benchmark execution times vary.

Why Do Execution Times Vary

Execution time results can differ in each run due to various factors and sources of variability.

Understanding these factors is crucial for accurate performance analysis.

Some common reasons why execution time results can vary from run to run include:

The system is never the same from moment to moment.

We must expect benchmark results to vary.

Therefore, we need to manage or control for the variability in benchmark results.

Use Repetition to Calculate Fair Estimates of Performance

We can use simple statistics to develop a stable benchmark result.

This can be achieved by repeating a given benchmark more than once and calculating the average.

From a statistical perspective, there is a true but unknown underlying benchmark time. Each time we perform a benchmark and collect a benchmark time, we are estimating the true but unknown benchmark time, but the problem is the estimate has statistical noise added to it, because of the reasons we discussed above.

Therefore, to overcome the statistical noise, we can draw many samples of the benchmark time by repeating the benchmark many times.

Calculating the average of these samples will get much closer to the true unknown underlying benchmark time. The random statistical noise will cancel out and we will be left with a much closer approximation of the true benchmark time.

What Is The Procedure?

The old procedure is as follows:

We know that this is not good enough as we will report a different score every time the benchmark is performed.

Therefore, the better procedure for repeating a benchmark and reporting the average is as follows:

This will give a better estimate of the benchmark.

How Many Times to Repeat The Benchmark?

The result of the procedure will be a stable estimate of the execution time of the target code.

Stable means that we could repeat the same procedure on the same code and on the same computer system at a later date and get an estimate of the execution time that is close to the same value, much closer than two individual runs.

Two average execution times will not be identical, there will be small differences and there always will be. This is the nature of statistical measurements. The difference accounts for statistical noise that we are unable to remove from the estimate. If there was no statistical noise, we would not need to estimate in the first place, one run and measurement would be sufficient.

We can improve the estimate of the performance, which will reduce the difference between two runs of the procedure.

This can be achieved by increasing the number of samples, e.g. the number of repetitions of the benchmark.

How many repetitions are needed?

More than 100 or 1000 repetitions quickly reach a point of diminishing returns, depending on the code that is being benchmarked.

Generally, the number of repetitions comes down to how long we can afford to wait.

Few repetitions are used with slower code, more repetitions are used with faster code.

How to Calculate The Average Benchmark Score

The first step is to collect multiple individual benchmark scores.

These can be collected in a list.

For example, if we had a function named benchmark() that performed a single benchmark of the target code and returned a time in seconds, we could call this many times and store the results in a list.

...
# the number of times to repeat the benchmark
n_repeats = 30
# collect a list of benchmark scores
all_times = [benchmark() _ for i in range(n_repeats)]

Or, we might use a for loop and report each individual score along the way to show progress.

I like this approach, it might look as follows:

...
# the number of times to repeat the benchmark
n_repeats = 30
# benchmark results
all_times = list()
# repeat benchmark may times
for i in range(n_repeats):
    # calculate the duration
    time_duration = benchmark()
    # report the duration
    print(f'>trial {i} took {time_duration} seconds')
    # store the result
    all_times.append(time_duration)

The average is also referred to by the technical name "arithmetic mean" and is calculated as the sum of all times divided by the number of times that were collected.

In mathematics and statistics, the arithmetic mean, arithmetic average, or just the mean or average (when the context is clear) is the sum of a collection of numbers divided by the count of numbers in the collection.

-- Arithmetic mean, Wikipedia.

For example:

In code, we can use the built-in sum() function, for example:

...
# calculate the average duration
time_avg = sum(all_times) / repeats
# report the average time
print(f'Took {time_avg} seconds on average')

We can also calculate it using the statistics.mean() function.

For example:

...
# calculate the mean score
time_avg = statistics.mean(all_times)

I prefer to calculate the mean myself instead of using the statistics.mean() function as it does not require any additional imports.

When we report the score, we must indicate that it is an average rather than a single estimate of performance.

For example:

...
Took 2.3 seconds on average

It might also be a good idea to report the standard deviation.

Recall the mean is the expected value or middle of a normal (bell curve) distribution. The standard deviation is the average spread in the same sample distribution.

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.

-- Standard deviation, Wikipedia.

If we have both the mean and the standard deviation values, we can reconstruct the sample distribution. This is helpful when reporting results formally, such as in a report or research context. Someone familiar with statistics can then use statistical tools to compare distributions directly (e.g. benchmarks with and without a change) and report the statistical significance of the results (e.g. did the change really make a difference). This is out of the scope of this tutorial.

Don't calculate the standard deviation manually. There are a few ways to calculate it and you may use the wrong one (e.g. population vs sample standard deviation). Instead, we can use the statistics.stdev() function to calculate the standard deviation of the sample.

For example:

...
# calculate the standard deviation
time_stdv = statistics.stdev(all_times)

We can report the mean and standard deviation together:

...
Took 2.3 seconds on average (std=0.1)

For completeness, we can also report the number of benchmark repetitions that were sampled, reported as "n".

For example:

...
Took 2.3 seconds on average (stdev=0.1, n=10)

In most cases, reporting the average time is sufficient in order to make decisions by ourselves.

I would recommend including the standard deviation (stdev) and number of repetitions (n) only when reporting results formally, or when anyone else also needs to look at the results to help make a decision.

Why Repeat Benchmarks Many Times?

Repeating a code benchmark many times and reporting the average is a common practice in performance analysis for several reasons:

Repeating a code benchmark multiple times and reporting the average is essential for obtaining accurate, reliable, and statistically significant performance measurements.

Now that we know how to calculate the average benchmark score, let's look at some worked examples.

Example of Variable Benchmark Results

Before we look at how to repeat a benchmark and calculate the average score, let's confirm that performing the same single benchmark many times results in a different score each time.

In this case, we will define a task that creates a list of squared integers. We will then benchmark this function and report the duration in seconds. We will record the time before and after the target code using the time.perf_counter() function.

Firstly, we can define a task() function that performs the CPU-intensive task of creating a list of 100 million squared integers in a list comprehension.

# function to benchmark
def task():
    # create a large list
    data = [i*i for i in range(100000000)]

Next, we define a benchmark() function that records the start time, calls the task() function, records the end time, and calculates and reports the overall duration in seconds.

# benchmark the task() function
def benchmark():
    # record start time
    time_start = perf_counter()
    # execute the function
    task()
    # record end time
    time_end = perf_counter()
    # calculate the duration
    time_duration = time_end - time_start
    # report the duration
    print(f'Took {time_duration:.3f} seconds')

The benchmarking of the target task() function is performed in a standalone function named benchmark(). We will call this function multiple times, and a benchmark score will be reported each time.

...
# repeat the same benchmark a few times
benchmark()
benchmark()
benchmark()

Tying this together, the complete example is listed below.

# SuperFastPython.com
# example a variability in benchmark results
from time import perf_counter

# function to benchmark
def task():
    # create a large list
    data = [i*i for i in range(100000000)]

# benchmark the task() function
def benchmark():
    # record start time
    time_start = perf_counter()
    # execute the function
    task()
    # record end time
    time_end = perf_counter()
    # calculate the duration
    time_duration = time_end - time_start
    # report the duration
    print(f'Took {time_duration:.3f} seconds')

# repeat the same benchmark a few times
benchmark()
benchmark()
benchmark()

Running the example benchmarks the same task() function three times.

In this case, we can see that each time the task() function is benchmarked, we get a different score.

The results are truncated to 3 decimal places to keep things simple. If we allowed full double floating-point precision and ran the same benchmark thousands of times, we would likely get thousands of different benchmark scores.

The reason for the difference is because of random statistical noise in the estimate of the benchmark time.

This highlights that we can expect to get a different benchmark result every time we execute a single benchmark.

Took 6.138 seconds
Took 5.920 seconds
Took 5.904 seconds

Next, let's explore how we can make the benchmark more stable by repeating it many times and reporting the average.

Example of Average Benchmark Results

We can explore how to repeat a benchmark many times and report the average result.

In this case, we can update the above example to perform the same benchmark many times, collect the results, and report the average.

Firstly, we can update the benchmark() function to return the duration rather than report it.

# benchmark the task() function
def benchmark():
    # record start time
    time_start = perf_counter()
    # execute the function
    task()
    # record end time
    time_end = perf_counter()
    # calculate the duration
    time_duration = time_end - time_start
    # return the duration
    return time_duration

Next, we can define a new function repeat_benchmark() that defines the number of repeats and a list for storing all duration times. It then loops can call the benchmark() many times and stores each result. It then calculates the average of all duration times and reports the average.

# repeat a benchmark
def repeat_benchmark():
    # define the number of repeats
    n_repeats = 30
    # list for storing all results
    all_times = list()
    # repeat benchmark may times
    for i in range(n_repeats):
        # benchmark and retrieve the duration
        time_duration = benchmark()
        # report the duration
        print(f'>trial {i} took {time_duration:.3f} seconds')
        # store the result
        all_times.append(time_duration)
    # calculate the average duration
    time_avg = sum(all_times) / n_repeats
    # report the average time
    print(f'Took {time_avg:.3f} seconds on average')

We can then call this function to perform the repeated benchmark of the task() function and calculate a stable estimate of the function's performance.

...
# execute a repeated benchmark
repeat_benchmark()

Tying this together, the complete example is listed below.

# SuperFastPython.com
# example a stable benchmark results
from time import perf_counter

# function to benchmark
def task():
    # create a large list
    data = [i*i for i in range(100000000)]

# benchmark the task() function
def benchmark():
    # record start time
    time_start = perf_counter()
    # execute the function
    task()
    # record end time
    time_end = perf_counter()
    # calculate the duration
    time_duration = time_end - time_start
    # return the duration
    return time_duration

# repeat a benchmark
def repeat_benchmark():
    # define the number of repeats
    n_repeats = 30
    # list for storing all results
    all_times = list()
    # repeat benchmark may times
    for i in range(n_repeats):
        # benchmark and retrieve the duration
        time_duration = benchmark()
        # report the duration
        print(f'>trial {i} took {time_duration:.3f} seconds')
        # store the result
        all_times.append(time_duration)
    # calculate the average duration
    time_avg = sum(all_times) / n_repeats
    # report the average time
    print(f'Took {time_avg:.3f} seconds on average')

# execute a repeated benchmark
repeat_benchmark()

Running the example benchmarks the task() function 30 times and reports the average duration.

Helpfully, the estimated performance is reported in each iteration, showing how the estimated time of the task() function varies each time the benchmark() function is run.

In this case, the average duration is reported as 5.997 seconds.

>trial 0 took 5.999 seconds
>trial 1 took 6.250 seconds
>trial 2 took 6.115 seconds
>trial 3 took 6.057 seconds
>trial 4 took 6.132 seconds
>trial 5 took 6.155 seconds
>trial 6 took 6.106 seconds
>trial 7 took 6.119 seconds
>trial 8 took 5.938 seconds
>trial 9 took 5.949 seconds
>trial 10 took 5.935 seconds
>trial 11 took 5.912 seconds
>trial 12 took 6.102 seconds
>trial 13 took 6.015 seconds
>trial 14 took 5.980 seconds
>trial 15 took 5.905 seconds
>trial 16 took 5.899 seconds
>trial 17 took 5.916 seconds
>trial 18 took 5.912 seconds
>trial 19 took 5.990 seconds
>trial 20 took 5.966 seconds
>trial 21 took 5.933 seconds
>trial 22 took 6.043 seconds
>trial 23 took 6.008 seconds
>trial 24 took 5.919 seconds
>trial 25 took 5.897 seconds
>trial 26 took 5.931 seconds
>trial 27 took 5.919 seconds
>trial 28 took 5.953 seconds
>trial 29 took 5.954 seconds
Took 5.997 seconds on average

This result is more stable than a single run of the benchmark() function.

We can demonstrate this.

Re-run the entire program and note the result.

In this case, the result is 5.994 seconds. This is very close to the 5.997 seconds reported in the previous run, showing the stability of the procedure for the task() function.

We could achieve even more stability by increasing the number of repeats from 30 to 100 and perhaps increasing the precision of the average result from 3 decimal places to 6 to look for any differences.

...
>trial 25 took 5.965 seconds
>trial 26 took 5.951 seconds
>trial 27 took 5.966 seconds
>trial 28 took 5.947 seconds
>trial 29 took 6.036 seconds
Took 5.994 seconds on average

Takeaways

You now know how to develop stable benchmark measures in Python.



If you enjoyed this tutorial, you will love my book: Python Benchmarking. It covers everything you need to master the topic with hands-on examples and clear explanations.