Benchmark Fastest Way To Calculate Sum of NumPy Arrays

You can benchmark functions and algorithms to calculate the sum of NumPy arrays to discover the fastest approaches to use.

Generally, it is significantly faster to use the numpy.sum() module function and ndarray.sum() method over approaches that calculate the sum via multiplication. Further, using the sum() method may be slightly faster than the module function, although the difference is minor and may be due to random variation.

A surprising result is that it is significantly faster to use numpy.einsum() over the sum() functions for arrays of floating point values and some integer types. This is a valuable finding as the speedup can be significant, up to 3.071x in some cases.

In this tutorial, you will discover how to benchmark the fastest approach to calculate the sum of a NumPy array in Python.

Let’s get started.

Table of Contents

Need Fast Sum of NumPy Arrays

Calculating the sum of NumPy arrays in our Python program is a common operation.

It is perhaps one of the most commonly used summary statistics.

It seems straightforward, call sum() and we have the result.

But is this the fastest method?

In fact, there are many ways we can calculate the sum of NumPy arrays.

For example, some approaches include:

There are also more exotic methods we could devise.

Nevertheless, if we need to calculate the sum of a NumPy array frequently in our program, such as within a loop, what is the fastest way we should use?

What is the fastest way to calculate the sum of a NumPy array?

Run loops using all CPUs, download your FREE book to learn how.

Benchmark Sum of NumPy Arrays

We can explore the question of how fast the different approaches to calculate the sum of NumPy arrays are using benchmarking.

In this case, we will use an approach to calculate the sum of an array of random numbers of a modest fixed size, then repeat this process many times to give an estimated time. We can then compare the times to see the relative performance of the approaches tested.

You can use this approach to benchmark your own favorite NumPy array operations.

If you use or extend the NumPy benchmarking approach used in this tutorial, let me know in the comments below. I’d love to see what you come up with.

We could use the time.perf_counter() function directly and develop a helper function to perform the benchmarking and report results.

You can learn more about benchmarking with the time.perf_counter() function in the tutorial:

Benchmark Python with time.perf_counter()

Instead, in this case, we will use the timeit API, specifically the timeit.timeit() function and specify the string of sum calculation code to run and a fixed number of times to run it.

We will also provide the “globals” argument for any constants defined in our benchmark code, such as the defined array to operate upon.

For example:

...

# benchmark a thing

result = timeit.timeit('...', globals=globals(), number=N)

print(f'approach {result:.3f} seconds')

You can learn more about benchmarking with the timeit.timeit() function in the tutorial:

Benchmark Python with timeit.timeit()

The number of runs in each benchmark was tuned to ensure that each snippet was executed in more than one second and less than about 10 seconds.

Let’s get started.

Start Now: Free Python Benchmarking Crash Course

Fastest Way to Calculate Sum of 1D Float NumPy Arrays

We can explore the fastest way to calculate the sum of a modestly sized NumPy array of random floating point values in [0,1).

In this case, we will create a fixed size 1d array with half a million elements (500,000) of random floats with the default data type, float64 on most platforms. Each approach will be used to calculate the sum of the array 100,000 times.

Generally, we may expect the numpy.sum() function to be the fastest.

The complete example is listed below.

# SuperFastPython.com

# benchmark calculating the sum a 1d array of floats

import numpy

import timeit

# define the data used in all runs

rng = numpy.random.default_rng(1)

A = rng.random(500000)

# number of times to run each snippet

N = 100000

# numpy.sum()

result = timeit.timeit('numpy.sum(A)', globals=globals(), number=N)

print(f'numpy.sum() {result:.3f} seconds')

# numpy.sum(axis=0)

result = timeit.timeit('numpy.sum(A, axis=0)', globals=globals(), number=N)

print(f'numpy.sum(axis=0) {result:.3f} seconds')

# numpy.ndarray.sum()

result = timeit.timeit('A.sum()', globals=globals(), number=N)

print(f'numpy.ndarray.sum() {result:.3f} seconds')

# numpy.ndarray.sum(axis=0)

result = timeit.timeit('A.sum(axis=0)', globals=globals(), number=N)

print(f'numpy.ndarray.sum(axis=0) {result:.3f} seconds')

# numpy.add.reduce()

result = timeit.timeit('numpy.add.reduce(A)', globals=globals(), number=N)

print(f'numpy.add.reduce() {result:.3f} seconds')

# numpy.einsum('i->')

result = timeit.timeit("numpy.einsum('i->', A)", globals=globals(), number=N)

print(f'numpy.einsum(i->) {result:.3f} seconds')

# numpy.inner(numpy.ones_like())

result = timeit.timeit('numpy.inner(A, numpy.ones_like(A))', globals=globals(), number=N)

print(f'numpy.inner(numpy.ones_like()) {result:.3f} seconds')

# numpy.dot(numpy.ones_like())

result = timeit.timeit('numpy.dot(A, numpy.ones_like(A))', globals=globals(), number=N)

print(f'numpy.dot(numpy.ones_like()) {result:.3f} seconds')

# numpy.matmul(numpy.ones_like())

result = timeit.timeit('numpy.matmul(A, numpy.ones_like(A))', globals=globals(), number=N)

print(f'numpy.matmul(numpy.ones_like()) {result:.3f} seconds')

Running the example benchmarks each approach and reports the sum execution time.

numpy.sum() 14.688 seconds

numpy.sum(axis=0) 14.320 seconds

numpy.ndarray.sum() 14.311 seconds

numpy.ndarray.sum(axis=0) 14.172 seconds

numpy.add.reduce() 13.749 seconds

numpy.einsum(i->) 8.731 seconds

numpy.inner(numpy.ones_like()) 76.145 seconds

numpy.dot(numpy.ones_like()) 77.916 seconds

numpy.matmul(numpy.ones_like()) 80.903 seconds

We can restructure the output into a table for comparison.

Approach | Time (sec)

--------------------------------|------------

numpy.sum() | 14.688

numpy.sum(axis=0) | 14.320

numpy.ndarray.sum() | 14.311

numpy.ndarray.sum(axis=0) | 14.172

numpy.add.reduce() | 13.749

numpy.einsum(i->) | 8.731

numpy.inner(numpy.ones_like()) | 76.145

numpy.dot(numpy.ones_like()) | 77.916

numpy.matmul(numpy.ones_like()) | 80.903

The results show that the numpy.sum() and numpy.ndarray.sum() have similar performance.

The results are all very close and I suspect any differences are likely due to random flotations.

Not surprising, the approaches that use multiplication such as numpy.inner(), numpy.dot(), and numpy.matmul() are all significantly slower than the sum() functions. These require creating a new array and initializing it before performing the operation.

One difference is the numpy.einsum() function shows a result that is significantly faster than the sum() functions and the multiplication functions.

In fact, in this case, it took 8.731 seconds compared to about 14 seconds for the sum() functions, this is a speedup factor of about 1.682x.

Re-running the benchmark, we see a similar pattern. The sum() functions all have similar results, the multiplication functions are significantly slower, and the numpy.einsum() is significantly faster.

numpy.sum() 14.040 seconds

numpy.sum(axis=0) 13.775 seconds

numpy.ndarray.sum() 13.603 seconds

numpy.ndarray.sum(axis=0) 13.656 seconds

numpy.add.reduce() 15.289 seconds

numpy.einsum(i->) 8.760 seconds

numpy.inner(numpy.ones_like()) 75.926 seconds

numpy.dot(numpy.ones_like()) 76.079 seconds

Next, let’s see if the results hold for different float data types.

Free Python Benchmarking Course

Get FREE access to my 7-day email course on Python Benchmarking.

Discover benchmarking with the time.perf_counter() function, how to develop a benchmarking helper function and context manager and how to use the timeit API and command line.

Learn more

Float Data Types Matter When Summing

We can explore whether the pattern of results holds with different float data types.

Specifically, we are interested in whether it is faster to use numpy.einsum() over numpy.sum() and similar for both arrays of float32 and float64 types.

In this case, we can test both functions with arrays of random floats of both types.

The complete example is listed below.

# SuperFastPython.com

# benchmark summing arrays of random numbers with different float data types

import numpy

import timeit

# number of times to run each snippet

N = 100000

# list of types to compare

types = ['float32', 'float64']

# benchmark each data type

for t in types:

# create data to sum

rng = numpy.random.default_rng(1)

A = rng.random(500000, dtype=t)

# numpy.ndarray.sum(axis=0)

result = timeit.timeit('A.sum(axis=0)', globals=globals(), number=N)

print(f'numpy.ndarray.sum(axis=0) {t} {result:.3f} seconds')

# numpy.einsum('i->')

result = timeit.timeit("numpy.einsum('i->', A)", globals=globals(), number=N)

print(f'numpy.einsum(i->) {t} {result:.3f} seconds')

Running the example benchmarks each approach and reports the sum execution time.

numpy.ndarray.sum(axis=0) float32 13.390 seconds

numpy.einsum(i->) float32 4.360 seconds

numpy.ndarray.sum(axis=0) float64 13.370 seconds

numpy.einsum(i->) float64 8.222 seconds

We can restructure the output into a table for comparison.

Approach | Type | Time (sec)

--------------------------|---------|------------

numpy.ndarray.sum(axis=0) | float32 | 13.390

numpy.einsum(i->) | float32 | 4.360

numpy.ndarray.sum(axis=0) | float64 | 13.370

numpy.einsum(i->) | float64 | 8.222

The results show that it is indeed faster to use numpy.einsum() over the sum() function for arrays of float32 and float64 types.

We can also see a larger benefit for the simpler float32 type. Specifically numpy.einsum() is about 3.071x faster than sum() for float32 and 1.626x faster when summing float64.

Repeating the benchmark shows the same general pattern of results.

numpy.ndarray.sum(axis=0) float32 13.631 seconds

numpy.einsum(i->) float32 4.416 seconds

numpy.ndarray.sum(axis=0) float64 13.966 seconds

numpy.einsum(i->) float64 8.392 seconds

This is a valuable finding.

Next, let’s explore a similar benchmark with integer values.

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Fastest Way to Calculate Sum of 1D Integer NumPy Arrays

We can explore whether the same findings from the previous benchmark hold for arrays of integer values.

In this case, we will update the example to generate an array of random integer values between 0 and 100 (inclusive) and then calculate the sum of each using the same approaches.

The complete example is listed below.

# SuperFastPython.com

# benchmark calculating the sum a 1d array of integers

import numpy

import timeit

# define the data used in all runs

rng = numpy.random.default_rng(1)

A = rng.integers(0, 100+1, 500000)

# number of times to run each snippet

N = 100000

# numpy.sum()

result = timeit.timeit('numpy.sum(A)', globals=globals(), number=N)

print(f'numpy.sum() {result:.3f} seconds')

# numpy.sum(axis=0)

result = timeit.timeit('numpy.sum(A, axis=0)', globals=globals(), number=N)

print(f'numpy.sum(axis=0) {result:.3f} seconds')

# numpy.ndarray.sum()

result = timeit.timeit('A.sum()', globals=globals(), number=N)

print(f'numpy.ndarray.sum() {result:.3f} seconds')

# numpy.ndarray.sum(axis=0)

result = timeit.timeit('A.sum(axis=0)', globals=globals(), number=N)

print(f'numpy.ndarray.sum(axis=0) {result:.3f} seconds')

# numpy.add.reduce()

result = timeit.timeit('numpy.add.reduce(A)', globals=globals(), number=N)

print(f'numpy.add.reduce() {result:.3f} seconds')

# numpy.einsum('i->')

result = timeit.timeit("numpy.einsum('i->', A)", globals=globals(), number=N)

print(f'numpy.einsum(i->) {result:.3f} seconds')

# numpy.inner(numpy.ones_like())

result = timeit.timeit('numpy.inner(A, numpy.ones_like(A))', globals=globals(), number=N)

print(f'numpy.inner(numpy.ones_like()) {result:.3f} seconds')

# numpy.dot(numpy.ones_like())

result = timeit.timeit('numpy.dot(A, numpy.ones_like(A))', globals=globals(), number=N)

print(f'numpy.dot(numpy.ones_like()) {result:.3f} seconds')

# numpy.matmul(numpy.ones_like())

result = timeit.timeit('numpy.matmul(A, numpy.ones_like(A))', globals=globals(), number=N)

print(f'numpy.matmul(numpy.ones_like()) {result:.3f} seconds')

Running the example benchmarks each approach and reports the sum execution time.

numpy.sum() 7.351 seconds

numpy.sum(axis=0) 7.320 seconds

numpy.ndarray.sum() 7.173 seconds

numpy.ndarray.sum(axis=0) 6.605 seconds

numpy.add.reduce() 6.649 seconds

numpy.einsum(i->) 12.219 seconds

numpy.inner(numpy.ones_like()) 69.057 seconds

numpy.dot(numpy.ones_like()) 63.762 seconds

numpy.matmul(numpy.ones_like()) 66.072 seconds

We can restructure the output into a table for comparison.

Approach | Time (sec)

--------------------------------|------------

numpy.sum() | 7.351

numpy.sum(axis=0) | 7.320

numpy.ndarray.sum() | 7.173

numpy.ndarray.sum(axis=0) | 6.605

numpy.add.reduce() | 6.649

numpy.einsum(i->) | 12.219

numpy.inner(numpy.ones_like()) | 69.057

numpy.dot(numpy.ones_like()) | 63.762

numpy.matmul(numpy.ones_like()) | 66.072

We see a similar pattern in these results as we did for summing arrays of floating point values.

The results show that all of the sum functions have a similar result and that the multiplication approaches are significantly slower.

Interestingly, we can see that the numpy.einsum() function does not have better performance than the sum functions. This may suggest that it is preferred for some data types over others, e.g. floating point types.

Re-running the result, we see a similar pattern of performance.

It may be the case that using the sum() method on the ndarray is slightly faster than using the module function, although the small difference in the result may be due to the natural variance in the results.

numpy.sum() 6.907 seconds

numpy.sum(axis=0) 6.564 seconds

numpy.ndarray.sum() 6.394 seconds

numpy.ndarray.sum(axis=0) 6.640 seconds

numpy.add.reduce() 6.511 seconds

numpy.einsum(i->) 12.833 seconds

numpy.inner(numpy.ones_like()) 71.127 seconds

Next, let’s see if the results hold for different float data types.

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Integer Data Types Matter When Summing

We can explore the effect of summing different integer data types.

Specifically, we are interested to see if numpy.einsum() can offer a benefit over the sum() function with any integer type, besides the default int64.

In this case, we can test both functions with arrays of integer floats with a range of float types from 8 to 64 bits, both signed and unsigned.

The complete example is listed below.

# SuperFastPython.com

# benchmark summing arrays of random numbers with different int data types

import numpy

import timeit

# number of times to run each snippet

N = 100000

# list of types to compare

types = ['int8', 'uint8', 'int16', 'uint16', 'int32', 'uint32', 'int64', 'uint64']

# benchmark each data type

for t in types:

# create data to sum

rng = numpy.random.default_rng(1)

A = rng.integers(0, 100+1, 500000, dtype=t)

# numpy.ndarray.sum(axis=0)

result = timeit.timeit('A.sum(axis=0)', globals=globals(), number=N)

print(f'numpy.ndarray.sum(axis=0) {t} {result:.3f} seconds')

# numpy.einsum('i->')

result = timeit.timeit("numpy.einsum('i->', A)", globals=globals(), number=N)

print(f'numpy.einsum(i->) {t} {result:.3f} seconds')

Running the example benchmarks each approach and reports the sum execution time.

numpy.ndarray.sum(axis=0) int8 13.070 seconds

numpy.einsum(i->) int8 5.562 seconds

numpy.ndarray.sum(axis=0) uint8 13.001 seconds

numpy.einsum(i->) uint8 5.566 seconds

numpy.ndarray.sum(axis=0) int16 12.704 seconds

numpy.einsum(i->) int16 13.356 seconds

numpy.ndarray.sum(axis=0) uint16 12.725 seconds

numpy.einsum(i->) uint16 13.397 seconds

numpy.ndarray.sum(axis=0) int32 13.067 seconds

numpy.einsum(i->) int32 9.942 seconds

numpy.ndarray.sum(axis=0) uint32 13.059 seconds

numpy.einsum(i->) uint32 10.212 seconds

numpy.ndarray.sum(axis=0) int64 6.155 seconds

numpy.einsum(i->) int64 11.444 seconds

numpy.ndarray.sum(axis=0) uint64 6.167 seconds

numpy.einsum(i->) uint64 12.241 seconds

We can restructure the output into a table for comparison.

Approach | Type | Time (sec)

--------------------------|---------|------------

numpy.ndarray.sum(axis=0) | int8 | 13.070

numpy.einsum(i->) | int8 | 5.562

numpy.ndarray.sum(axis=0) | uint8 | 13.001

numpy.einsum(i->) | uint8 | 5.566

numpy.ndarray.sum(axis=0) | int16 | 12.704

numpy.einsum(i->) | int16 | 13.356

numpy.ndarray.sum(axis=0) | uint16 | 12.725

numpy.einsum(i->) | uint16 | 13.397

numpy.ndarray.sum(axis=0) | int32 | 13.067

numpy.einsum(i->) | int32 | 9.942

numpy.ndarray.sum(axis=0) | uint32 | 13.059

numpy.einsum(i->) | uint32 | 10.212

numpy.ndarray.sum(axis=0) | int64 | 6.155

numpy.einsum(i->) | int64 | 11.444

numpy.ndarray.sum(axis=0) | uint64 | 6.167

numpy.einsum(i->) | uint64 | 12.241

The results show that it is indeed faster to use numpy.einsum() over the sum() function for arrays of integers of many types.

Specifically, we see a benefit in using numpy.einsum() for the types:

int8
int32

The benefit is more significant with int8, showing a 2.349x speedup, whereas int32 shows a 1.314x speedup.

We see a benefit in using sum() for int64, and perhaps a parity in the results for int64.

The results are very similar for signed and unsigned types.

Repeating the benchmark shows the same general pattern of results.

numpy.ndarray.sum(axis=0) int8 13.680 seconds

numpy.einsum(i->) int8 5.762 seconds

numpy.ndarray.sum(axis=0) uint8 13.546 seconds

numpy.einsum(i->) uint8 5.636 seconds

numpy.ndarray.sum(axis=0) int16 12.958 seconds

numpy.einsum(i->) int16 13.500 seconds

numpy.ndarray.sum(axis=0) uint16 12.874 seconds

numpy.einsum(i->) uint16 13.693 seconds

numpy.ndarray.sum(axis=0) int32 13.542 seconds

numpy.einsum(i->) int32 10.149 seconds

numpy.ndarray.sum(axis=0) uint32 13.358 seconds

numpy.einsum(i->) uint32 10.551 seconds

numpy.ndarray.sum(axis=0) int64 6.236 seconds

numpy.einsum(i->) int64 11.559 seconds

numpy.ndarray.sum(axis=0) uint64 6.191 seconds

numpy.einsum(i->) uint64 11.991 seconds

This is also a valuable finding.

Takeaways

You now know how to benchmark the fastest approach to calculate the sum of a NumPy array in Python.

Did I make a mistake? See a typo?
I’m a simple humble human. Correct me, please!

Do you have any additional tips?
I’d love to hear about them!

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by Vincent Ghilione on Unsplash

Benchmark Fastest Way To Calculate Sum of NumPy Arrays

Need Fast Sum of NumPy Arrays

Benchmark Sum of NumPy Arrays

Fastest Way to Calculate Sum of 1D Float NumPy Arrays

Float Data Types Matter When Summing

Fastest Way to Calculate Sum of 1D Integer NumPy Arrays

Integer Data Types Matter When Summing

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Python Benchmarking Fast
(without the frustration)

Additional menu

Need Fast Sum of NumPy Arrays

Benchmark Sum of NumPy Arrays

Fastest Way to Calculate Sum of 1D Float NumPy Arrays

Float Data Types Matter When Summing

Fastest Way to Calculate Sum of 1D Integer NumPy Arrays

Integer Data Types Matter When Summing

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Leave a Reply Cancel reply

Footer

Learn Python Benchmarking Fast (without the frustration)

Learn Python Benchmarking Fast
(without the frustration)