Benchmark Fastest Way To Calculate Sum of NumPy Arrays

November 18, 2023 Python Benchmarking

You can benchmark functions and algorithms to calculate the sum of NumPy arrays to discover the fastest approaches to use.

Generally, it is significantly faster to use the numpy.sum() module function and ndarray.sum() method over approaches that calculate the sum via multiplication. Further, using the sum() method may be slightly faster than the module function, although the difference is minor and may be due to random variation.

A surprising result is that it is significantly faster to use numpy.einsum() over the sum() functions for arrays of floating point values and some integer types. This is a valuable finding as the speedup can be significant, up to 3.071x in some cases.

In this tutorial, you will discover how to benchmark the fastest approach to calculate the sum of a NumPy array in Python.

Let's get started.

Need Fast Sum of NumPy Arrays

Calculating the sum of NumPy arrays in our Python program is a common operation.

It is perhaps one of the most commonly used summary statistics.

It seems straightforward, call sum() and we have the result.

But is this the fastest method?

In fact, there are many ways we can calculate the sum of NumPy arrays.

For example, some approaches include:

There are also more exotic methods we could devise.

Nevertheless, if we need to calculate the sum of a NumPy array frequently in our program, such as within a loop, what is the fastest way we should use?

What is the fastest way to calculate the sum of a NumPy array?

Benchmark Sum of NumPy Arrays

We can explore the question of how fast the different approaches to calculate the sum of NumPy arrays are using benchmarking.

In this case, we will use an approach to calculate the sum of an array of random numbers of a modest fixed size, then repeat this process many times to give an estimated time. We can then compare the times to see the relative performance of the approaches tested.

You can use this approach to benchmark your own favorite NumPy array operations.

If you use or extend the NumPy benchmarking approach used in this tutorial, let me know in the comments below. I'd love to see what you come up with.

We could use the time.perf_counter() function directly and develop a helper function to perform the benchmarking and report results.

You can learn more about benchmarking with the time.perf_counter() function in the tutorial:

Instead, in this case, we will use the timeit API, specifically the timeit.timeit() function and specify the string of sum calculation code to run and a fixed number of times to run it.

We will also provide the "globals" argument for any constants defined in our benchmark code, such as the defined array to operate upon.

For example:

...
# benchmark a thing
result = timeit.timeit('...', globals=globals(), number=N)
print(f'approach {result:.3f} seconds')

You can learn more about benchmarking with the timeit.timeit() function in the tutorial:

The number of runs in each benchmark was tuned to ensure that each snippet was executed in more than one second and less than about 10 seconds.

Let's get started.

Fastest Way to Calculate Sum of 1D Float NumPy Arrays

We can explore the fastest way to calculate the sum of a modestly sized NumPy array of random floating point values in [0,1).

In this case, we will create a fixed size 1d array with half a million elements (500,000) of random floats with the default data type, float64 on most platforms. Each approach will be used to calculate the sum of the array 100,000 times.

Generally, we may expect the numpy.sum() function to be the fastest.

The complete example is listed below.

# SuperFastPython.com
# benchmark calculating the sum a 1d array of floats
import numpy
import timeit
# define the data used in all runs
rng = numpy.random.default_rng(1)
A = rng.random(500000)
# number of times to run each snippet
N = 100000
# numpy.sum()
result = timeit.timeit('numpy.sum(A)', globals=globals(), number=N)
print(f'numpy.sum() {result:.3f} seconds')
# numpy.sum(axis=0)
result = timeit.timeit('numpy.sum(A, axis=0)', globals=globals(), number=N)
print(f'numpy.sum(axis=0) {result:.3f} seconds')
# numpy.ndarray.sum()
result = timeit.timeit('A.sum()', globals=globals(), number=N)
print(f'numpy.ndarray.sum() {result:.3f} seconds')
# numpy.ndarray.sum(axis=0)
result = timeit.timeit('A.sum(axis=0)', globals=globals(), number=N)
print(f'numpy.ndarray.sum(axis=0) {result:.3f} seconds')
# numpy.add.reduce()
result = timeit.timeit('numpy.add.reduce(A)', globals=globals(), number=N)
print(f'numpy.add.reduce() {result:.3f} seconds')
# numpy.einsum('i->')
result = timeit.timeit("numpy.einsum('i->', A)", globals=globals(), number=N)
print(f'numpy.einsum(i->) {result:.3f} seconds')
# numpy.inner(numpy.ones_like())
result = timeit.timeit('numpy.inner(A, numpy.ones_like(A))', globals=globals(), number=N)
print(f'numpy.inner(numpy.ones_like()) {result:.3f} seconds')
# numpy.dot(numpy.ones_like())
result = timeit.timeit('numpy.dot(A, numpy.ones_like(A))', globals=globals(), number=N)
print(f'numpy.dot(numpy.ones_like()) {result:.3f} seconds')
# numpy.matmul(numpy.ones_like())
result = timeit.timeit('numpy.matmul(A, numpy.ones_like(A))', globals=globals(), number=N)
print(f'numpy.matmul(numpy.ones_like()) {result:.3f} seconds')

Running the example benchmarks each approach and reports the sum execution time.

numpy.sum() 14.688 seconds
numpy.sum(axis=0) 14.320 seconds
numpy.ndarray.sum() 14.311 seconds
numpy.ndarray.sum(axis=0) 14.172 seconds
numpy.add.reduce() 13.749 seconds
numpy.einsum(i->) 8.731 seconds
numpy.inner(numpy.ones_like()) 76.145 seconds
numpy.dot(numpy.ones_like()) 77.916 seconds
numpy.matmul(numpy.ones_like()) 80.903 seconds

We can restructure the output into a table for comparison.

Approach                        | Time (sec)
--------------------------------|------------
numpy.sum()                     | 14.688
numpy.sum(axis=0)               | 14.320
numpy.ndarray.sum()             | 14.311
numpy.ndarray.sum(axis=0)       | 14.172
numpy.add.reduce()              | 13.749
numpy.einsum(i->)               | 8.731
numpy.inner(numpy.ones_like())  | 76.145
numpy.dot(numpy.ones_like())    | 77.916
numpy.matmul(numpy.ones_like()) | 80.903

The results show that the numpy.sum() and numpy.ndarray.sum() have similar performance.

The results are all very close and I suspect any differences are likely due to random flotations.

Not surprising, the approaches that use multiplication such as numpy.inner(), numpy.dot(), and numpy.matmul() are all significantly slower than the sum() functions. These require creating a new array and initializing it before performing the operation.

One difference is the numpy.einsum() function shows a result that is significantly faster than the sum() functions and the multiplication functions.

In fact, in this case, it took 8.731 seconds compared to about 14 seconds for the sum() functions, this is a speedup factor of about 1.682x.

Re-running the benchmark, we see a similar pattern. The sum() functions all have similar results, the multiplication functions are significantly slower, and the numpy.einsum() is significantly faster.

numpy.sum() 14.040 seconds
numpy.sum(axis=0) 13.775 seconds
numpy.ndarray.sum() 13.603 seconds
numpy.ndarray.sum(axis=0) 13.656 seconds
numpy.add.reduce() 15.289 seconds
numpy.einsum(i->) 8.760 seconds
numpy.inner(numpy.ones_like()) 75.926 seconds
numpy.dot(numpy.ones_like()) 76.079 seconds

Next, let's see if the results hold for different float data types.

Float Data Types Matter When Summing

We can explore whether the pattern of results holds with different float data types.

Specifically, we are interested in whether it is faster to use numpy.einsum() over numpy.sum() and similar for both arrays of float32 and float64 types.

In this case, we can test both functions with arrays of random floats of both types.

The complete example is listed below.

# SuperFastPython.com
# benchmark summing arrays of random numbers with different float data types
import numpy
import timeit
# number of times to run each snippet
N = 100000
# list of types to compare
types = ['float32', 'float64']
# benchmark each data type
for t in types:
    # create data to sum
    rng = numpy.random.default_rng(1)
    A = rng.random(500000, dtype=t)
    # numpy.ndarray.sum(axis=0)
    result = timeit.timeit('A.sum(axis=0)', globals=globals(), number=N)
    print(f'numpy.ndarray.sum(axis=0) {t} {result:.3f} seconds')
    # numpy.einsum('i->')
    result = timeit.timeit("numpy.einsum('i->', A)", globals=globals(), number=N)
    print(f'numpy.einsum(i->) {t} {result:.3f} seconds')

Running the example benchmarks each approach and reports the sum execution time.

numpy.ndarray.sum(axis=0) float32 13.390 seconds
numpy.einsum(i->) float32 4.360 seconds
numpy.ndarray.sum(axis=0) float64 13.370 seconds
numpy.einsum(i->) float64 8.222 seconds

We can restructure the output into a table for comparison.

Approach                  | Type    | Time (sec)
--------------------------|---------|------------
numpy.ndarray.sum(axis=0) | float32 | 13.390
numpy.einsum(i->)         | float32 | 4.360
numpy.ndarray.sum(axis=0) | float64 | 13.370
numpy.einsum(i->)         | float64 | 8.222

The results show that it is indeed faster to use numpy.einsum() over the sum() function for arrays of float32 and float64 types.

We can also see a larger benefit for the simpler float32 type. Specifically numpy.einsum() is about 3.071x faster than sum() for float32 and 1.626x faster when summing float64.

Repeating the benchmark shows the same general pattern of results.

numpy.ndarray.sum(axis=0) float32 13.631 seconds
numpy.einsum(i->) float32 4.416 seconds
numpy.ndarray.sum(axis=0) float64 13.966 seconds
numpy.einsum(i->) float64 8.392 seconds

This is a valuable finding.

Next, let's explore a similar benchmark with integer values.

Fastest Way to Calculate Sum of 1D Integer NumPy Arrays

We can explore whether the same findings from the previous benchmark hold for arrays of integer values.

In this case, we will update the example to generate an array of random integer values between 0 and 100 (inclusive) and then calculate the sum of each using the same approaches.

The complete example is listed below.

# SuperFastPython.com
# benchmark calculating the sum a 1d array of integers
import numpy
import timeit
# define the data used in all runs
rng = numpy.random.default_rng(1)
A = rng.integers(0, 100+1, 500000)
# number of times to run each snippet
N = 100000
# numpy.sum()
result = timeit.timeit('numpy.sum(A)', globals=globals(), number=N)
print(f'numpy.sum() {result:.3f} seconds')
# numpy.sum(axis=0)
result = timeit.timeit('numpy.sum(A, axis=0)', globals=globals(), number=N)
print(f'numpy.sum(axis=0) {result:.3f} seconds')
# numpy.ndarray.sum()
result = timeit.timeit('A.sum()', globals=globals(), number=N)
print(f'numpy.ndarray.sum() {result:.3f} seconds')
# numpy.ndarray.sum(axis=0)
result = timeit.timeit('A.sum(axis=0)', globals=globals(), number=N)
print(f'numpy.ndarray.sum(axis=0) {result:.3f} seconds')
# numpy.add.reduce()
result = timeit.timeit('numpy.add.reduce(A)', globals=globals(), number=N)
print(f'numpy.add.reduce() {result:.3f} seconds')
# numpy.einsum('i->')
result = timeit.timeit("numpy.einsum('i->', A)", globals=globals(), number=N)
print(f'numpy.einsum(i->) {result:.3f} seconds')
# numpy.inner(numpy.ones_like())
result = timeit.timeit('numpy.inner(A, numpy.ones_like(A))', globals=globals(), number=N)
print(f'numpy.inner(numpy.ones_like()) {result:.3f} seconds')
# numpy.dot(numpy.ones_like())
result = timeit.timeit('numpy.dot(A, numpy.ones_like(A))', globals=globals(), number=N)
print(f'numpy.dot(numpy.ones_like()) {result:.3f} seconds')
# numpy.matmul(numpy.ones_like())
result = timeit.timeit('numpy.matmul(A, numpy.ones_like(A))', globals=globals(), number=N)
print(f'numpy.matmul(numpy.ones_like()) {result:.3f} seconds')

Running the example benchmarks each approach and reports the sum execution time.

numpy.sum() 7.351 seconds
numpy.sum(axis=0) 7.320 seconds
numpy.ndarray.sum() 7.173 seconds
numpy.ndarray.sum(axis=0) 6.605 seconds
numpy.add.reduce() 6.649 seconds
numpy.einsum(i->) 12.219 seconds
numpy.inner(numpy.ones_like()) 69.057 seconds
numpy.dot(numpy.ones_like()) 63.762 seconds
numpy.matmul(numpy.ones_like()) 66.072 seconds

We can restructure the output into a table for comparison.

Approach                        | Time (sec)
--------------------------------|------------
numpy.sum()                     | 7.351
numpy.sum(axis=0)               | 7.320
numpy.ndarray.sum()             | 7.173
numpy.ndarray.sum(axis=0)       | 6.605
numpy.add.reduce()              | 6.649
numpy.einsum(i->)               | 12.219
numpy.inner(numpy.ones_like())  | 69.057
numpy.dot(numpy.ones_like())    | 63.762
numpy.matmul(numpy.ones_like()) | 66.072

We see a similar pattern in these results as we did for summing arrays of floating point values.

The results show that all of the sum functions have a similar result and that the multiplication approaches are significantly slower.

Interestingly, we can see that the numpy.einsum() function does not have better performance than the sum functions. This may suggest that it is preferred for some data types over others, e.g. floating point types.

Re-running the result, we see a similar pattern of performance.

It may be the case that using the sum() method on the ndarray is slightly faster than using the module function, although the small difference in the result may be due to the natural variance in the results.

numpy.sum() 6.907 seconds
numpy.sum(axis=0) 6.564 seconds
numpy.ndarray.sum() 6.394 seconds
numpy.ndarray.sum(axis=0) 6.640 seconds
numpy.add.reduce() 6.511 seconds
numpy.einsum(i->) 12.833 seconds
numpy.inner(numpy.ones_like()) 71.127 seconds

Next, let's see if the results hold for different float data types.

Integer Data Types Matter When Summing

We can explore the effect of summing different integer data types.

Specifically, we are interested to see if numpy.einsum() can offer a benefit over the sum() function with any integer type, besides the default int64.

In this case, we can test both functions with arrays of integer floats with a range of float types from 8 to 64 bits, both signed and unsigned.

The complete example is listed below.

# SuperFastPython.com
# benchmark summing arrays of random numbers with different int data types
import numpy
import timeit
# number of times to run each snippet
N = 100000
# list of types to compare
types = ['int8', 'uint8', 'int16', 'uint16', 'int32', 'uint32', 'int64', 'uint64']
# benchmark each data type
for t in types:
    # create data to sum
    rng = numpy.random.default_rng(1)
    A = rng.integers(0, 100+1, 500000, dtype=t)
    # numpy.ndarray.sum(axis=0)
    result = timeit.timeit('A.sum(axis=0)', globals=globals(), number=N)
    print(f'numpy.ndarray.sum(axis=0) {t} {result:.3f} seconds')
    # numpy.einsum('i->')
    result = timeit.timeit("numpy.einsum('i->', A)", globals=globals(), number=N)
    print(f'numpy.einsum(i->) {t} {result:.3f} seconds')

Running the example benchmarks each approach and reports the sum execution time.

numpy.ndarray.sum(axis=0) int8 13.070 seconds
numpy.einsum(i->) int8 5.562 seconds
numpy.ndarray.sum(axis=0) uint8 13.001 seconds
numpy.einsum(i->) uint8 5.566 seconds
numpy.ndarray.sum(axis=0) int16 12.704 seconds
numpy.einsum(i->) int16 13.356 seconds
numpy.ndarray.sum(axis=0) uint16 12.725 seconds
numpy.einsum(i->) uint16 13.397 seconds
numpy.ndarray.sum(axis=0) int32 13.067 seconds
numpy.einsum(i->) int32 9.942 seconds
numpy.ndarray.sum(axis=0) uint32 13.059 seconds
numpy.einsum(i->) uint32 10.212 seconds
numpy.ndarray.sum(axis=0) int64 6.155 seconds
numpy.einsum(i->) int64 11.444 seconds
numpy.ndarray.sum(axis=0) uint64 6.167 seconds
numpy.einsum(i->) uint64 12.241 seconds

We can restructure the output into a table for comparison.

Approach                  | Type    | Time (sec)
--------------------------|---------|------------
numpy.ndarray.sum(axis=0) | int8    | 13.070
numpy.einsum(i->)         | int8    | 5.562
numpy.ndarray.sum(axis=0) | uint8   | 13.001
numpy.einsum(i->)         | uint8   | 5.566
numpy.ndarray.sum(axis=0) | int16   | 12.704
numpy.einsum(i->)         | int16   | 13.356
numpy.ndarray.sum(axis=0) | uint16  | 12.725
numpy.einsum(i->)         | uint16  | 13.397
numpy.ndarray.sum(axis=0) | int32   | 13.067
numpy.einsum(i->)         | int32   | 9.942
numpy.ndarray.sum(axis=0) | uint32  | 13.059
numpy.einsum(i->)         | uint32  | 10.212
numpy.ndarray.sum(axis=0) | int64   | 6.155
numpy.einsum(i->)         | int64   | 11.444
numpy.ndarray.sum(axis=0) | uint64  | 6.167
numpy.einsum(i->)         | uint64  | 12.241

The results show that it is indeed faster to use numpy.einsum() over the sum() function for arrays of integers of many types.

Specifically, we see a benefit in using numpy.einsum() for the types:

The benefit is more significant with int8, showing a 2.349x speedup, whereas int32 shows a 1.314x speedup.

We see a benefit in using sum() for int64, and perhaps a parity in the results for int64.

The results are very similar for signed and unsigned types.

Repeating the benchmark shows the same general pattern of results.

numpy.ndarray.sum(axis=0) int8 13.680 seconds
numpy.einsum(i->) int8 5.762 seconds
numpy.ndarray.sum(axis=0) uint8 13.546 seconds
numpy.einsum(i->) uint8 5.636 seconds
numpy.ndarray.sum(axis=0) int16 12.958 seconds
numpy.einsum(i->) int16 13.500 seconds
numpy.ndarray.sum(axis=0) uint16 12.874 seconds
numpy.einsum(i->) uint16 13.693 seconds
numpy.ndarray.sum(axis=0) int32 13.542 seconds
numpy.einsum(i->) int32 10.149 seconds
numpy.ndarray.sum(axis=0) uint32 13.358 seconds
numpy.einsum(i->) uint32 10.551 seconds
numpy.ndarray.sum(axis=0) int64 6.236 seconds
numpy.einsum(i->) int64 11.559 seconds
numpy.ndarray.sum(axis=0) uint64 6.191 seconds
numpy.einsum(i->) uint64 11.991 seconds

This is also a valuable finding.

Takeaways

You now know how to benchmark the fastest approach to calculate the sum of a NumPy array in Python.



If you enjoyed this tutorial, you will love my book: Python Benchmarking. It covers everything you need to master the topic with hands-on examples and clear explanations.