You can benchmark functions and algorithms to calculate the sum of NumPy arrays to discover the fastest approaches to use.
Generally, it is significantly faster to use the numpy.sum() module function and ndarray.sum() method over approaches that calculate the sum via multiplication. Further, using the sum() method may be slightly faster than the module function, although the difference is minor and may be due to random variation.
A surprising result is that it is significantly faster to use numpy.einsum() over the sum() functions for arrays of floating point values and some integer types. This is a valuable finding as the speedup can be significant, up to 3.071x in some cases.
In this tutorial, you will discover how to benchmark the fastest approach to calculate the sum of a NumPy array in Python.
Let’s get started.
Need Fast Sum of NumPy Arrays
Calculating the sum of NumPy arrays in our Python program is a common operation.
It is perhaps one of the most commonly used summary statistics.
It seems straightforward, call sum() and we have the result.
But is this the fastest method?
In fact, there are many ways we can calculate the sum of NumPy arrays.
For example, some approaches include:
- numpy.sum()
- numpy.sum(axis=0)
- numpy.ndarray.sum()
- numpy.ndarray.sum(axis=0)
- numpy.add.reduce()
- numpy.einsum(‘i->’)
- numpy.inner(numpy.ones_like())
- numpy.dot(numpy.ones_like())
- numpy.matmul(numpy.ones_like())
There are also more exotic methods we could devise.
Nevertheless, if we need to calculate the sum of a NumPy array frequently in our program, such as within a loop, what is the fastest way we should use?
What is the fastest way to calculate the sum of a NumPy array?
Run loops using all CPUs, download your FREE book to learn how.
Benchmark Sum of NumPy Arrays
We can explore the question of how fast the different approaches to calculate the sum of NumPy arrays are using benchmarking.
In this case, we will use an approach to calculate the sum of an array of random numbers of a modest fixed size, then repeat this process many times to give an estimated time. We can then compare the times to see the relative performance of the approaches tested.
You can use this approach to benchmark your own favorite NumPy array operations.
If you use or extend the NumPy benchmarking approach used in this tutorial, let me know in the comments below. I’d love to see what you come up with.
We could use the time.perf_counter() function directly and develop a helper function to perform the benchmarking and report results.
You can learn more about benchmarking with the time.perf_counter() function in the tutorial:
Instead, in this case, we will use the timeit API, specifically the timeit.timeit() function and specify the string of sum calculation code to run and a fixed number of times to run it.
We will also provide the “globals” argument for any constants defined in our benchmark code, such as the defined array to operate upon.
For example:
1 2 3 4 |
... # benchmark a thing result = timeit.timeit('...', globals=globals(), number=N) print(f'approach {result:.3f} seconds') |
You can learn more about benchmarking with the timeit.timeit() function in the tutorial:
The number of runs in each benchmark was tuned to ensure that each snippet was executed in more than one second and less than about 10 seconds.
Let’s get started.
Fastest Way to Calculate Sum of 1D Float NumPy Arrays
We can explore the fastest way to calculate the sum of a modestly sized NumPy array of random floating point values in [0,1).
In this case, we will create a fixed size 1d array with half a million elements (500,000) of random floats with the default data type, float64 on most platforms. Each approach will be used to calculate the sum of the array 100,000 times.
Generally, we may expect the numpy.sum() function to be the fastest.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# SuperFastPython.com # benchmark calculating the sum a 1d array of floats import numpy import timeit # define the data used in all runs rng = numpy.random.default_rng(1) A = rng.random(500000) # number of times to run each snippet N = 100000 # numpy.sum() result = timeit.timeit('numpy.sum(A)', globals=globals(), number=N) print(f'numpy.sum() {result:.3f} seconds') # numpy.sum(axis=0) result = timeit.timeit('numpy.sum(A, axis=0)', globals=globals(), number=N) print(f'numpy.sum(axis=0) {result:.3f} seconds') # numpy.ndarray.sum() result = timeit.timeit('A.sum()', globals=globals(), number=N) print(f'numpy.ndarray.sum() {result:.3f} seconds') # numpy.ndarray.sum(axis=0) result = timeit.timeit('A.sum(axis=0)', globals=globals(), number=N) print(f'numpy.ndarray.sum(axis=0) {result:.3f} seconds') # numpy.add.reduce() result = timeit.timeit('numpy.add.reduce(A)', globals=globals(), number=N) print(f'numpy.add.reduce() {result:.3f} seconds') # numpy.einsum('i->') result = timeit.timeit("numpy.einsum('i->', A)", globals=globals(), number=N) print(f'numpy.einsum(i->) {result:.3f} seconds') # numpy.inner(numpy.ones_like()) result = timeit.timeit('numpy.inner(A, numpy.ones_like(A))', globals=globals(), number=N) print(f'numpy.inner(numpy.ones_like()) {result:.3f} seconds') # numpy.dot(numpy.ones_like()) result = timeit.timeit('numpy.dot(A, numpy.ones_like(A))', globals=globals(), number=N) print(f'numpy.dot(numpy.ones_like()) {result:.3f} seconds') # numpy.matmul(numpy.ones_like()) result = timeit.timeit('numpy.matmul(A, numpy.ones_like(A))', globals=globals(), number=N) print(f'numpy.matmul(numpy.ones_like()) {result:.3f} seconds') |
Running the example benchmarks each approach and reports the sum execution time.
1 2 3 4 5 6 7 8 9 |
numpy.sum() 14.688 seconds numpy.sum(axis=0) 14.320 seconds numpy.ndarray.sum() 14.311 seconds numpy.ndarray.sum(axis=0) 14.172 seconds numpy.add.reduce() 13.749 seconds numpy.einsum(i->) 8.731 seconds numpy.inner(numpy.ones_like()) 76.145 seconds numpy.dot(numpy.ones_like()) 77.916 seconds numpy.matmul(numpy.ones_like()) 80.903 seconds |
We can restructure the output into a table for comparison.
1 2 3 4 5 6 7 8 9 10 11 |
Approach | Time (sec) --------------------------------|------------ numpy.sum() | 14.688 numpy.sum(axis=0) | 14.320 numpy.ndarray.sum() | 14.311 numpy.ndarray.sum(axis=0) | 14.172 numpy.add.reduce() | 13.749 numpy.einsum(i->) | 8.731 numpy.inner(numpy.ones_like()) | 76.145 numpy.dot(numpy.ones_like()) | 77.916 numpy.matmul(numpy.ones_like()) | 80.903 |
The results show that the numpy.sum() and numpy.ndarray.sum() have similar performance.
The results are all very close and I suspect any differences are likely due to random flotations.
Not surprising, the approaches that use multiplication such as numpy.inner(), numpy.dot(), and numpy.matmul() are all significantly slower than the sum() functions. These require creating a new array and initializing it before performing the operation.
One difference is the numpy.einsum() function shows a result that is significantly faster than the sum() functions and the multiplication functions.
In fact, in this case, it took 8.731 seconds compared to about 14 seconds for the sum() functions, this is a speedup factor of about 1.682x.
Re-running the benchmark, we see a similar pattern. The sum() functions all have similar results, the multiplication functions are significantly slower, and the numpy.einsum() is significantly faster.
1 2 3 4 5 6 7 8 |
numpy.sum() 14.040 seconds numpy.sum(axis=0) 13.775 seconds numpy.ndarray.sum() 13.603 seconds numpy.ndarray.sum(axis=0) 13.656 seconds numpy.add.reduce() 15.289 seconds numpy.einsum(i->) 8.760 seconds numpy.inner(numpy.ones_like()) 75.926 seconds numpy.dot(numpy.ones_like()) 76.079 seconds |
Next, let’s see if the results hold for different float data types.
Free Python Benchmarking Course
Get FREE access to my 7-day email course on Python Benchmarking.
Discover benchmarking with the time.perf_counter() function, how to develop a benchmarking helper function and context manager and how to use the timeit API and command line.
Float Data Types Matter When Summing
We can explore whether the pattern of results holds with different float data types.
Specifically, we are interested in whether it is faster to use numpy.einsum() over numpy.sum() and similar for both arrays of float32 and float64 types.
In this case, we can test both functions with arrays of random floats of both types.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# SuperFastPython.com # benchmark summing arrays of random numbers with different float data types import numpy import timeit # number of times to run each snippet N = 100000 # list of types to compare types = ['float32', 'float64'] # benchmark each data type for t in types: # create data to sum rng = numpy.random.default_rng(1) A = rng.random(500000, dtype=t) # numpy.ndarray.sum(axis=0) result = timeit.timeit('A.sum(axis=0)', globals=globals(), number=N) print(f'numpy.ndarray.sum(axis=0) {t} {result:.3f} seconds') # numpy.einsum('i->') result = timeit.timeit("numpy.einsum('i->', A)", globals=globals(), number=N) print(f'numpy.einsum(i->) {t} {result:.3f} seconds') |
Running the example benchmarks each approach and reports the sum execution time.
1 2 3 4 |
numpy.ndarray.sum(axis=0) float32 13.390 seconds numpy.einsum(i->) float32 4.360 seconds numpy.ndarray.sum(axis=0) float64 13.370 seconds numpy.einsum(i->) float64 8.222 seconds |
We can restructure the output into a table for comparison.
1 2 3 4 5 6 |
Approach | Type | Time (sec) --------------------------|---------|------------ numpy.ndarray.sum(axis=0) | float32 | 13.390 numpy.einsum(i->) | float32 | 4.360 numpy.ndarray.sum(axis=0) | float64 | 13.370 numpy.einsum(i->) | float64 | 8.222 |
The results show that it is indeed faster to use numpy.einsum() over the sum() function for arrays of float32 and float64 types.
We can also see a larger benefit for the simpler float32 type. Specifically numpy.einsum() is about 3.071x faster than sum() for float32 and 1.626x faster when summing float64.
Repeating the benchmark shows the same general pattern of results.
1 2 3 4 |
numpy.ndarray.sum(axis=0) float32 13.631 seconds numpy.einsum(i->) float32 4.416 seconds numpy.ndarray.sum(axis=0) float64 13.966 seconds numpy.einsum(i->) float64 8.392 seconds |
This is a valuable finding.
Next, let’s explore a similar benchmark with integer values.
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Fastest Way to Calculate Sum of 1D Integer NumPy Arrays
We can explore whether the same findings from the previous benchmark hold for arrays of integer values.
In this case, we will update the example to generate an array of random integer values between 0 and 100 (inclusive) and then calculate the sum of each using the same approaches.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# SuperFastPython.com # benchmark calculating the sum a 1d array of integers import numpy import timeit # define the data used in all runs rng = numpy.random.default_rng(1) A = rng.integers(0, 100+1, 500000) # number of times to run each snippet N = 100000 # numpy.sum() result = timeit.timeit('numpy.sum(A)', globals=globals(), number=N) print(f'numpy.sum() {result:.3f} seconds') # numpy.sum(axis=0) result = timeit.timeit('numpy.sum(A, axis=0)', globals=globals(), number=N) print(f'numpy.sum(axis=0) {result:.3f} seconds') # numpy.ndarray.sum() result = timeit.timeit('A.sum()', globals=globals(), number=N) print(f'numpy.ndarray.sum() {result:.3f} seconds') # numpy.ndarray.sum(axis=0) result = timeit.timeit('A.sum(axis=0)', globals=globals(), number=N) print(f'numpy.ndarray.sum(axis=0) {result:.3f} seconds') # numpy.add.reduce() result = timeit.timeit('numpy.add.reduce(A)', globals=globals(), number=N) print(f'numpy.add.reduce() {result:.3f} seconds') # numpy.einsum('i->') result = timeit.timeit("numpy.einsum('i->', A)", globals=globals(), number=N) print(f'numpy.einsum(i->) {result:.3f} seconds') # numpy.inner(numpy.ones_like()) result = timeit.timeit('numpy.inner(A, numpy.ones_like(A))', globals=globals(), number=N) print(f'numpy.inner(numpy.ones_like()) {result:.3f} seconds') # numpy.dot(numpy.ones_like()) result = timeit.timeit('numpy.dot(A, numpy.ones_like(A))', globals=globals(), number=N) print(f'numpy.dot(numpy.ones_like()) {result:.3f} seconds') # numpy.matmul(numpy.ones_like()) result = timeit.timeit('numpy.matmul(A, numpy.ones_like(A))', globals=globals(), number=N) print(f'numpy.matmul(numpy.ones_like()) {result:.3f} seconds') |
Running the example benchmarks each approach and reports the sum execution time.
1 2 3 4 5 6 7 8 9 |
numpy.sum() 7.351 seconds numpy.sum(axis=0) 7.320 seconds numpy.ndarray.sum() 7.173 seconds numpy.ndarray.sum(axis=0) 6.605 seconds numpy.add.reduce() 6.649 seconds numpy.einsum(i->) 12.219 seconds numpy.inner(numpy.ones_like()) 69.057 seconds numpy.dot(numpy.ones_like()) 63.762 seconds numpy.matmul(numpy.ones_like()) 66.072 seconds |
We can restructure the output into a table for comparison.
1 2 3 4 5 6 7 8 9 10 11 |
Approach | Time (sec) --------------------------------|------------ numpy.sum() | 7.351 numpy.sum(axis=0) | 7.320 numpy.ndarray.sum() | 7.173 numpy.ndarray.sum(axis=0) | 6.605 numpy.add.reduce() | 6.649 numpy.einsum(i->) | 12.219 numpy.inner(numpy.ones_like()) | 69.057 numpy.dot(numpy.ones_like()) | 63.762 numpy.matmul(numpy.ones_like()) | 66.072 |
We see a similar pattern in these results as we did for summing arrays of floating point values.
The results show that all of the sum functions have a similar result and that the multiplication approaches are significantly slower.
Interestingly, we can see that the numpy.einsum() function does not have better performance than the sum functions. This may suggest that it is preferred for some data types over others, e.g. floating point types.
Re-running the result, we see a similar pattern of performance.
It may be the case that using the sum() method on the ndarray is slightly faster than using the module function, although the small difference in the result may be due to the natural variance in the results.
1 2 3 4 5 6 7 |
numpy.sum() 6.907 seconds numpy.sum(axis=0) 6.564 seconds numpy.ndarray.sum() 6.394 seconds numpy.ndarray.sum(axis=0) 6.640 seconds numpy.add.reduce() 6.511 seconds numpy.einsum(i->) 12.833 seconds numpy.inner(numpy.ones_like()) 71.127 seconds |
Next, let’s see if the results hold for different float data types.
Integer Data Types Matter When Summing
We can explore the effect of summing different integer data types.
Specifically, we are interested to see if numpy.einsum() can offer a benefit over the sum() function with any integer type, besides the default int64.
In this case, we can test both functions with arrays of integer floats with a range of float types from 8 to 64 bits, both signed and unsigned.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# SuperFastPython.com # benchmark summing arrays of random numbers with different int data types import numpy import timeit # number of times to run each snippet N = 100000 # list of types to compare types = ['int8', 'uint8', 'int16', 'uint16', 'int32', 'uint32', 'int64', 'uint64'] # benchmark each data type for t in types: # create data to sum rng = numpy.random.default_rng(1) A = rng.integers(0, 100+1, 500000, dtype=t) # numpy.ndarray.sum(axis=0) result = timeit.timeit('A.sum(axis=0)', globals=globals(), number=N) print(f'numpy.ndarray.sum(axis=0) {t} {result:.3f} seconds') # numpy.einsum('i->') result = timeit.timeit("numpy.einsum('i->', A)", globals=globals(), number=N) print(f'numpy.einsum(i->) {t} {result:.3f} seconds') |
Running the example benchmarks each approach and reports the sum execution time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
numpy.ndarray.sum(axis=0) int8 13.070 seconds numpy.einsum(i->) int8 5.562 seconds numpy.ndarray.sum(axis=0) uint8 13.001 seconds numpy.einsum(i->) uint8 5.566 seconds numpy.ndarray.sum(axis=0) int16 12.704 seconds numpy.einsum(i->) int16 13.356 seconds numpy.ndarray.sum(axis=0) uint16 12.725 seconds numpy.einsum(i->) uint16 13.397 seconds numpy.ndarray.sum(axis=0) int32 13.067 seconds numpy.einsum(i->) int32 9.942 seconds numpy.ndarray.sum(axis=0) uint32 13.059 seconds numpy.einsum(i->) uint32 10.212 seconds numpy.ndarray.sum(axis=0) int64 6.155 seconds numpy.einsum(i->) int64 11.444 seconds numpy.ndarray.sum(axis=0) uint64 6.167 seconds numpy.einsum(i->) uint64 12.241 seconds |
We can restructure the output into a table for comparison.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Approach | Type | Time (sec) --------------------------|---------|------------ numpy.ndarray.sum(axis=0) | int8 | 13.070 numpy.einsum(i->) | int8 | 5.562 numpy.ndarray.sum(axis=0) | uint8 | 13.001 numpy.einsum(i->) | uint8 | 5.566 numpy.ndarray.sum(axis=0) | int16 | 12.704 numpy.einsum(i->) | int16 | 13.356 numpy.ndarray.sum(axis=0) | uint16 | 12.725 numpy.einsum(i->) | uint16 | 13.397 numpy.ndarray.sum(axis=0) | int32 | 13.067 numpy.einsum(i->) | int32 | 9.942 numpy.ndarray.sum(axis=0) | uint32 | 13.059 numpy.einsum(i->) | uint32 | 10.212 numpy.ndarray.sum(axis=0) | int64 | 6.155 numpy.einsum(i->) | int64 | 11.444 numpy.ndarray.sum(axis=0) | uint64 | 6.167 numpy.einsum(i->) | uint64 | 12.241 |
The results show that it is indeed faster to use numpy.einsum() over the sum() function for arrays of integers of many types.
Specifically, we see a benefit in using numpy.einsum() for the types:
- int8
- int32
The benefit is more significant with int8, showing a 2.349x speedup, whereas int32 shows a 1.314x speedup.
We see a benefit in using sum() for int64, and perhaps a parity in the results for int64.
The results are very similar for signed and unsigned types.
Repeating the benchmark shows the same general pattern of results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
numpy.ndarray.sum(axis=0) int8 13.680 seconds numpy.einsum(i->) int8 5.762 seconds numpy.ndarray.sum(axis=0) uint8 13.546 seconds numpy.einsum(i->) uint8 5.636 seconds numpy.ndarray.sum(axis=0) int16 12.958 seconds numpy.einsum(i->) int16 13.500 seconds numpy.ndarray.sum(axis=0) uint16 12.874 seconds numpy.einsum(i->) uint16 13.693 seconds numpy.ndarray.sum(axis=0) int32 13.542 seconds numpy.einsum(i->) int32 10.149 seconds numpy.ndarray.sum(axis=0) uint32 13.358 seconds numpy.einsum(i->) uint32 10.551 seconds numpy.ndarray.sum(axis=0) int64 6.236 seconds numpy.einsum(i->) int64 11.559 seconds numpy.ndarray.sum(axis=0) uint64 6.191 seconds numpy.einsum(i->) uint64 11.991 seconds |
This is also a valuable finding.
Further Reading
This section provides additional resources that you may find helpful.
Books
- Python Benchmarking, Jason Brownlee (my book!)
Also, the following Python books have chapters on benchmarking that may be helpful:
- Python Cookbook, 2013. (sections 9.1, 9.10, 9.22, 13.13, and 14.13)
- High Performance Python, 2020. (chapter 2)
Guides
- 4 Ways to Benchmark Python Code
- 5 Ways to Measure Execution Time in Python
- Python Benchmark Comparison Metrics
Benchmarking APIs
- time — Time access and conversions
- timeit — Measure execution time of small code snippets
- The Python Profilers
References
Takeaways
You now know how to benchmark the fastest approach to calculate the sum of a NumPy array in Python.
Did I make a mistake? See a typo?
I’m a simple humble human. Correct me, please!
Do you have any additional tips?
I’d love to hear about them!
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Vincent Ghilione on Unsplash
Do you have any questions?