Last Updated on September 29, 2023

You can multiply a matrix by a vector in parallel with numpy.

Matrix-vector multiplication can be achieved in numpy using the **numpy.dot()** method, the ‘**@**‘ operator and the **numpy.matmul()** function. All three approaches call down into the BLAS library which implements the operation in parallel using native threads.

This means that matrix-vector multiplication is parallel using multithreading by default.

In this tutorial, you will discover how to perform matrix-vector multiplication in parallel using threads in numpy.

Let’s get started.

## Need Parallel Matrix-Vector Multiplication

Linear algebra is a field of mathematics concerned with linear equations with arrays and matrices of numbers.

Numpy is a Python library for working with arrays of numbers. As such, it implements many linear algebra functions.

One important linear algebra operation is multiplying a vector by a matrix.

For example:

1 2 3 |
1, 1, 1 1 3 [1, 1, 1] * [1] = [3] 1, 1, 1 1 3 |

This is a fundamental mathematical operation.

You can learn more about matrix-matrix and matrix-vector multiplication here:

This operation can be slow when multiplying a large matrix and vector. Performance is worse as the size of the matrix and vector is increased.

As such, we need a way to make use of modern multiple CPU core systems to speed up the matrix-vector multiplication by executing it in parallel.

**How can we execute matrix-vector multiplication in parallel with numpy?**

Run loops using all CPUs, download your FREE book to learn how.

## Matrix-Vector Multiplication is Parallel in Numpy

Matrix multiplication is multithreaded in numpy.

There are perhaps three common ways to multiply a matrix by a vector in numpy they are:

- The
**numpy.dot()**function and**ndarray.dot()**method. - The ‘
**@**‘ operator. - The
**numpy.matmul()**function.

All three of these approaches to multiplying a matrix by a vector use the underlying BLAS library.

BLAS is an acronym that stands for Basic Linear Algebra Subprograms. It is a specification for matrix and vector mathematical operations that is implemented efficiently by third-party libraries such as OpenBLAS and MKL.

You can learn more about BLAS in numpy in the tutorial:

Most BLAS library implementations are used by numpy multithreaded basic matrix and vector operations, such as matrix-vector multiplication.

BLAS is able to use native threads (called pthreads) independent of python threads. This means BLAS threads offer true parallelism and speed-up many linear algebra operations in numpy via multithreading behind the scenes.

Now that we know that matrix-vector multiplication is multithreaded in numpy, let’s look at some worked examples.

## Example of Single-Threaded Matrix-Vector Multiplication

Before we explore multithreaded matrix-vector multiplication, let’s develop some single-threaded versions.

We can control the number of threads used by BLAS operations in numpy using an environment variable The specific environment variable used depends on the BLAS library installed, although the **OMP_NUM_THREADS** variable works in many cases.

We can set the **OMP_NUM_THREADS** variable to the number of threads to use in BLAS function class, such as 1 thread if we want to perform single-threaded matrix-vector multiplication.

This can be achieved before starting the Python program or within the python program prior to importing numpy via the **os.environ()** function.

For example:

1 2 3 |
... from os import environ environ['OMP_NUM_THREADS'] = '1' |

You can learn more about setting the number of threads used by BLAS in the tutorial:

We will benchmark matrix-vector multiplication with a single thread using three common approaches, the **dot()** method, the ‘**@**‘ operator, and the **numpy.matmul()** function.

### Matrix-Vector Multiplication with dot()

We can benchmark single-threaded matrix-vector multiplication with the **ndarray.dot()** method.

Firstly, we will define a task function that creates a matrix and vector with a modest size. In this case, the matrix will have the dimensions 50,000×50,000, and the vector will have 50,000 elements.

Each array will be created using the **numpy.ones()** function.

Once created, the matrix and vector are multiplied together using the **dot()** method.

1 2 3 |
... # perform the multiplication result = matrix.dot(vector) |

The **task()** function below implements this, creating the arrays and performing the operation and returning the time to complete in seconds.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# multiply a matrix by a vector def task(n=50000): # record the start time start = time() # create a new matrix and fill with 1s matrix = ones((n,n)) # create the new vector filled with 1s vector = ones((n,1)) # perform the multiplication result = matrix.dot(vector) # calculate and report duration duration = time() - start # return duration return duration |

Running the function a single time may report a time that is not representative of the time to complete the operation. This may be due to loading Python libraries or the operating system performing background tasks.

Therefore, we can run the task many times and report the average time to complete the operation.

The **experiment()** function below implements this, performing the task 3 times and returning the average time over the three trials.

1 2 3 4 5 6 |
# experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats |

Finally, we can call the **experiment()** function and report the average time to multiply the matrix by the vector.

1 2 3 4 |
... # run the experiment and report the result duration = experiment() print(f'Took {duration:.3f} seconds') |

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# single-threaded matrix-vector multiplication with dot() from os import environ environ['OMP_NUM_THREADS'] = '1' from time import time from numpy import ones # multiply a matrix by a vector def task(n=50000): # record the start time start = time() # create a new matrix and fill with 1s matrix = ones((n,n)) # create the new vector filled with 1s vector = ones((n,1)) # perform the multiplication result = matrix.dot(vector) # calculate and report duration duration = time() - start # return duration return duration # experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats # run the experiment and report the result duration = experiment() print(f'Took {duration:.3f} seconds') |

Running the example creates a matrix, a vector, then multiplies them together.

This task is repeated three times and the average time is reported in seconds.

In this case, the example took about 7.142 seconds on my system.

It may take more or fewer seconds to complete on your system depending on the speed of your hardware.

1 |
Took 7.142 seconds |

Next, let’s explore the same experiment with the ‘**@**‘ operator.

### Matrix-Vector Multiplication with @

We can re-run the same single-threaded matrix-vector multiplication experiment using the ‘**@**‘ operator.

This can be achieved by changing the call to the **dot()** method to instead use the ‘**@**‘ operator.

For example:

1 2 3 |
... # perform the multiplication result = matrix @ vector |

We expect this to be functionally equivalent to the **dot()** method and to take about the same time.

The complete example with this change is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# single-threaded matrix-vector multiplication with @ from os import environ environ['OMP_NUM_THREADS'] = '1' from time import time from numpy import ones # multiply a matrix by a vector def task(n=50000): # record the start time start = time() # create a new matrix and fill with 1s matrix = ones((n,n)) # create the new vector filled with 1s vector = ones((n,1)) # perform the multiplication result = matrix @ vector # calculate and report duration duration = time() - start # return duration return duration # experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats # run the experiment and report the result duration = experiment() print(f'Took {duration:.3f} seconds') |

Running the example took about 7.001 seconds on my system.

This is approximately equivalent to the **dot()** method which took about 7.142 seconds.

1 |
Took 7.001 seconds |

Next, let’s explore the same experiment using the **numpy.matmul()** method.

### Matrix-Vector Multiplication with numpy.matmul()

We can re-run the same single-threaded matrix-vector multiplication experiment using the **numpy.matmul()** function.

This can be achieved by changing the call to the **dot()** method to instead use **matmul()**.

For example:

1 2 3 |
... # perform the multiplication result = matmul(matrix, vector) |

We expect this to be functionally equivalent to the **dot()** method and the ‘**@**‘ operator and to take about the same time.

The complete example with this change is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# single-threaded matrix-vector multiplication with matmul() from os import environ environ['OMP_NUM_THREADS'] = '1' from time import time from numpy import ones from numpy import matmul # multiply a matrix by a vector def task(n=50000): # record the start time start = time() # create a new matrix and fill with 1s matrix = ones((n,n)) # create the new vector filled with 1s vector = ones((n,1)) # perform the multiplication result = matmul(matrix, vector) # calculate and report duration duration = time() - start # return duration return duration # experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats # run the experiment and report the result duration = experiment() print(f'Took {duration:.3f} seconds') |

Running the example took about 7.028 seconds on my system.

This is approximately equivalent to the **dot()** method which took about 7.142 seconds and the ‘@’ operator which took about 7.001 seconds.

1 |
Took 7.028 seconds |

Next, let’s explore multithreaded versions of the matrix-vector multiplication.

**Free Concurrent NumPy Course**

Get FREE access to my 7-day email course on concurrent NumPy.

Discover how to configure the number of BLAS threads, how to execute NumPy tasks faster with thread pools, and how to share arrays super fast.

## Example of Multithreaded Matrix-Vector Multiplication

We can explore multithreaded matrix-vector multiplication.

The previous section showed that all three common approaches to matrix-vector multiplication used in Numpy have similar performance characteristics. We can choose one in this case, such as the **dot()** method, and benchmark its performance using multiple threads.

The **dot()** method in numpy is multithreaded via the installed BLAS library.

We can configure the number of threads used by the BLAS library via the appropriate environment variable.

In this case, we will set the number of threads to be equal to the number of physical CPU cores in the system.

We can achieve better performance with CPU-bound operations in some cases by setting the number of threads to be equal to the number of physical CPU cores.

I have 4 physical CPU cores in my system, so this example will use 4 BLAS threads. If you have more or fewer physical CPU cores, adjust the configuration accordingly.

1 2 3 |
... from os import environ environ['OMP_NUM_THREADS'] = '4' |

Tying this together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# multithreaded matrix-vector multiplication from os import environ environ['OMP_NUM_THREADS'] = '4' from time import time from numpy import ones # multiply a matrix by a vector def task(n=50000): # record the start time start = time() # create a new matrix and fill with 1s matrix = ones((n,n)) # create the new vector filled with 1s vector = ones((n,1)) # perform the multiplication result = matrix.dot(vector) # calculate and report duration duration = time() - start # return duration return duration # experiment that averages duration of task function def experiment(repeats=3): # repeat the experiment and gather results results = [task() for _ in range(repeats)] # return the average of the results return sum(results) / repeats # run the experiment and report the result duration = experiment() print(f'Took {duration:.3f} seconds') |

Running the example multiplies the matrix by the vector using multiple threads and reports the duration averaged over three repeats.

In this case, the example takes about 6.646 seconds to complete.

This is about 0.496 seconds (or 496 milliseconds) faster than the single-threaded version, or a speed-up of about 1.07x.

This is not a massive speed-up, perhaps given that the operation itself is challenging to run with multiple threads.

1 |
Took 6.646 seconds |

**Can you achieve better performance?**

For example, you can repeat the experiment with different configurations, such as:

- BLAS configured with one less the number of physical CPU cores (e.g. 3).
- BLAS configured with double the number of physical CPU cores (e.g. 8).
- BLAS configured with an arbitrary number of threads (e.g. 5).

If you try any of these configurations, let me know how you go.

**Overwhelmed by the python concurrency APIs?**

Find relief, download my FREE Python Concurrency Mind Maps

## Further Reading

This section provides additional resources that you may find helpful.

**Books**

- Concurrent NumPy in Python, Jason Brownlee (
)**my book!**

**Guides**

- Concurrent NumPy 7-Day Course
- Which NumPy Functions Are Multithreaded
- Numpy Multithreaded Matrix Multiplication (up to 5x faster)
- NumPy vs the Global Interpreter Lock (GIL)
- ThreadPoolExecutor Fill NumPy Array (3x faster)
- Fastest Way To Share NumPy Array Between Processes

**Documentation**

- Parallel Programming with numpy and scipy, SciPi Cookbook, 2015
- Parallel Programming with numpy and scipy (older archived version)
- Parallel Random Number Generation, NumPy API

**NumPy APIs**

**Concurrency APIs**

- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks

## Takeaways

You now know how to perform matrix-vector multiplication in parallel using threads in numpy.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

## Do you have any questions?