Performance May Not Scale with CPU Cores in Python

We expect the performance of executing independent tasks in parallel to scale with the number of physical CPU cores available.

This assumption is true, but only with some types of tasks. There are some tasks where this expectation does not hold, such as working with large lists.

In this tutorial, you will discover some independent tasks that do scale and some tasks that do not scale with the number of CPUs in Python.

This tutorial was prompted by questions and discussion with William G. Thank you deeply! If you have a question about Python concurrency, message me any time.

Let’s get started.

Table of Contents

Performance Scales with CPU Cores

When we execute independent tasks in parallel, we generally expect the performance to improve as the number of CPU cores is increased.

That is if we execute 4 equivalent and independent tasks sequentially that it will take 4 units of time to complete, whereas if we execute all 4 tasks in parallel, it takes one unit of time to complete.

This is the basis of parallel programming.

We can implement parallel programming in Python using the multiprocessing module and process pools such as the multiprocessing.Pool class and the concurrent.futures.ProcessPoolExecutor class.

These process pools create pools of reusable workers for executing independent tasks using child processes, in parallel.

Let’s explore the case of executing a fixed-duration independent task using a process pool and increasing the number of parallel tasks and parallel workers together in concert.

If we increase the number of parallel tasks and workers from 1 to the number of CPU cores, then we expect the time taken to execute all tasks to remain fixed at the duration of one task (or close enough, because of any overhead in managing the pool and other things running on the operating system).

The example below explores this.

It defines a task that sleeps for one second. I have 4 physical CPU cores (8 logical cores with hyperthreading) in my system. Therefore, as we increase the number of tasks and workers from 1 to 4, we expect the overall duration of the batch of tasks to remain fixed at around one second.

If we had more workers and tasks than physical CPU cores, we might expect the performance of parallel tasks to begin to degrade as the operating system begins to context-switch between tasks every 100 operations or so.

The complete example is listed below.

Adapt the number of iterations based on the number of physical CPU cores in your example as needed.

# SuperFastPython.com

# example of exploring the duration of fixed-length tasks in a process pool

from time import time

from time import sleep

from concurrent.futures import ProcessPoolExecutor

# task that does some work

def task(value):

# simulate work for a fixed interval

sleep(1)

# entry point

if __name__ == '__main__':

# test an increasing number of processes

for i in range(1,5):

# record start time

start = time()

# create the pool

with ProcessPoolExecutor(max_workers=i) as exe:

# issue one task per processor

_ = exe.map(task, range(i))

# wait for tasks to complete...

# record end time

end = time()

# calculate time taken in seconds

duration = end - start

# report duration

print(f'>{i} worker/task took {duration:.3f} seconds')

Running the example performs the experiment.

It runs a loop from 1 to 4, each iteration creating a pool with the given number of child process workers and then issuing the same number of tasks, then waiting for all tasks to complete.

The time taken for each test is reported and each test is close to 1 second, as expected.

>1 worker/task took 1.073 seconds

>2 worker/task took 1.062 seconds

>3 worker/task took 1.068 seconds

>4 worker/task took 1.076 seconds

Let’s repeat the experiment with a task that takes about one second, but requires engaging the CPU, rather than blocking and doing nothing.

We can define a task that creates a list of square root numbers.

The example below provides the updated example with a CPU-bound task.

Note, I had to finesse the number of iterations in the task to make it take about one second. If you have faster or slower CPU cores, adapt the example accordingly.

# SuperFastPython.com

# example of exploring the duration of fixed length cpu-bound tasks in a process pool

from time import time

from time import sleep

from math import sqrt

from concurrent.futures import ProcessPoolExecutor

# task that does some work

def task(value):

# simulate work for a fixed interval

data = [sqrt(i) for i in range(1, 10800000)]

# entry point

if __name__ == '__main__':

# test an increasing number of processes

for i in range(1,5):

# record start time

start = time()

# create the pool

with ProcessPoolExecutor(max_workers=i) as exe:

# issue one task per processor

_ = exe.map(task, range(i))

# wait for tasks to complete...

# record end time

end = time()

# calculate time taken in seconds

duration = end - start

# report duration

print(f'>{i} worker/task took {duration:.3f} seconds')

Running the example performs the experiment with much the same results.

The tasks are independent.

The tasks fully occupy the CPU cores.

The duration of 1, 2, 3, and 4 parallel tasks is equivalent because I have 4 physical CPU cores.

>1 worker/task took 1.019 seconds

>2 worker/task took 1.030 seconds

>3 worker/task took 1.072 seconds

>4 worker/task took 1.120 seconds

No surprises.

Now, let’s look at a task that does not behave as expected.

Run loops using all CPUs, download your FREE book to learn how.

Tasks That Do Not Scale (surprisingly!)

It seems there are some typical tasks in regular Python that do not behave as expected when executed in parallel using process-based concurrency.

An example is those operations that involve large lists of objects, like integers.

Examples such as:

Creating a copy of a large list via the copy() method.
Using the plus operator to append items to a large list iteratively in a loop, e.g. a = a + [i]
Slicing a large list, e.g. a = b[:]

Executing operations of this type looks like an independent task.

Executing two or more of these operations in parallel should scale with the number of physical CPU cores. That is, executing 1, 2, 3, or 4 tasks of this type in parallel should be the duration of one of these tasks.

Unfortunately, this is not the case.

The example below demonstrates this.

# SuperFastPython.com

# example of an independent task that does not scale with the number of cpu cores

from time import time

from time import sleep

from concurrent.futures import ProcessPoolExecutor

# task that does some work

def task(value):

# create a large list

mylist = [i for i in range(20000000)]

# copy the list

mylist2 = mylist.copy()

# entry point

if __name__ == '__main__':

# test an increasing number of processes

for i in range(1,5):

# record start time

start = time()

# create the pool

with ProcessPoolExecutor(max_workers=i) as exe:

# issue one task per processor

_ = exe.map(task, range(i))

# wait for tasks to complete

# record end time

end = time()

# calculate time taken in seconds

duration = end - start

# report duration

print(duration)

Running the example performs 4 experiments.

The number of tasks and workers in the pool is increased from 1 to 4.

We expect the overall duration to remain fixed (roughly) for each experiment, because I have 4 physical CPU cores, allowing all tasks to run in parallel.

Instead, we see a trend of increasingly worse performance from about 1 second to nearly 1.5 seconds with 4 tasks and workers.

1.0626261234283447

1.126122236251831

1.2505171298980713

1.4071791172027588

We can exaggerate the trend by making the task take longer, such as 5 seconds.

The updated example is listed below.

# SuperFastPython.com

# example of an independent task that does not scale with the number of cpu cores

from time import time

from time import sleep

from concurrent.futures import ProcessPoolExecutor

# task that does some work

def task(value):

# create a large list

mylist = [i for i in range(100000000)]

# copy the list

mylist2 = mylist.copy()

# entry point

if __name__ == '__main__':

# test an increasing number of processes

for i in range(1,5):

# record start time

start = time()

# create the pool

with ProcessPoolExecutor(max_workers=i) as exe:

# issue one task per processor

_ = exe.map(task, range(i))

# wait for tasks to complete

# record end time

end = time()

# calculate time taken in seconds

duration = end - start

# report duration

print(duration)

Running the example performs the same experiment with a longer duration task.

As we increase the number of parallel tasks and workers, we see the same pattern of increasing time from about 5 seconds with one task and work to nearly 7 seconds for 4 tasks and workers.

4.839679002761841

5.280073881149292

5.9325501918792725

6.95676589012146

Appending items to a large list using the plus operator seems to exaggerate the pattern even more.

The example below provides an updated task with this change, with each task supposed to take about 7 seconds.

# SuperFastPython.com

# example of performance degrading with increase in parallel tasks and workers

from time import time

from time import sleep

from concurrent.futures import ProcessPoolExecutor

# task that does some work

def task(value):

# do work

test = []

for j in range(80000):

test = test + [j]

# entry point

if __name__ == '__main__':

# test an increasing number of processes

for i in range(1,5):

# record start time

start = time()

# create the pool

with ProcessPoolExecutor(max_workers=i) as exe:

# issue one task per processor

_ = exe.map(task, range(i))

# wait for tasks to complete

# record end time

end = time()

# calculate time taken in seconds

duration = end - start

# report duration

print(duration)

We see the same pattern, but much worse.

One task takes about 7 seconds.

Two tasks in parallel take about 7.5 seconds. Longer, but not crazy.

Three tasks take double the expected duration at 14 seconds.

Four parallel tasks executed by four parallel workers take about 28 seconds, 4 times longer.

Yes, 4-times longer. The same length of time expected if the 4 tasks were executed sequentially.

Why?

6.903233051300049

7.598363876342773

14.149807691574097

28.29033398628235

Download Now: Free Multiprocessing PDF Cheat Sheet

Analysis

At first, I thought there was a bug in the experimental code.

I tried in vain to sniff out the bug.

Negative Findings

Some things I discovered:

The behavior is replicated with the task function executed by workers in the multiprocessing.Pool class and with separate multiprocessing.Process instances.

The behavior is replicated if alternate methods are used to issue the tasks, e.g. submit() on the ProcessPoolExecutor or apply_async() when using Pool.

The behavior is replicated with the ‘fork‘ and the ‘spawn‘ methods for starting child processes.

The behavior is replicated when the number of workers is fixed at the number of physical CPU cores for each experimental run.

The behavior is replicated if the task function returns a value or does not.

The behavior is replicated regardless of the size of the thread stack space.

Positive Findings

I did discover a little more about the shape of the issue.

The issue was discovered and pointed out to me by William G.

His initial examples used list concatenation in a loop, which remains the most exaggerated example found so far.

Testing discovered the effect is observed, to a lesser degree, when doing many operations in large lists, like slicing and copying, as stated above.

Confirming that the scaling of the child process with non-list-based tasks behaves as expected, as we did above, shows that the issue is task-dependent.

The issue does not appear to replicate with dict‘s, so it may be something specific about large lists. I have not tried other sequence types, like bytes and chars.

The issue does appear to replicate with Python 2.7 (using the Pool class), Python 3.9, and Python 3.10. This suggests the performance profile is intentional or expected and the task is egregiously bad (anti-pythonic) in some specific way.

Hypotheses (I basically have nothing)

My guess is that the tasks are interacting in some way.

This may be in Python, e.g. a lock preventing the tasks from progressing. Not the GIL, as tasks are in separate child processes, but perhaps something deep down in the CPython runtime.

We are not using NumPy/SciPy in these examples, so there is not some back-end library spinning up C-threads to do list operations in parallel (as far as I know).

A quick skim of the code for list objects (see listobject.c) does not show anything obvious.

All tasks occupy their CPU core at 100% for the duration according to top. This may suggest a Python-level lock is less likely, although far from definitive.

Tasks may be interacting at the operating system level.

We may be doing too much object creation, somehow, running out of stack or heap memory, then swapping to disk.

Hard to imagine, and top does not confirm this assumption with memory usage capped at less than 20 megabytes per child process.

It is probably an issue of thrashing the RAM, somehow?, e.g. a hardware bottleneck of some kind.

What is going on?

If you have some good testable theories, please share them in the comments below.

Update: There is a discussion of this issue on StackOverflow here:

Python concurrent futures multiprocessing Pool does not scale with the number of processors.

Comments on SO suggest the issue is a memory bandwidth bottleneck. I’m not yet convinced, it feels hand-wavy. I’d like a way to gather direct evidence in a profiler or similar.

Free Python Multiprocessing Course

Download your FREE multiprocessing PDF cheat sheet and get BONUS access to my free 7-day crash course on the multiprocessing API.

Discover how to use the Python multiprocessing module including how to create and start child processes and how to use a mutex locks and semaphores.

Learn more

Takeaways

You now know of some tasks that do not scale with the number of CPU cores when executed in parallel.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Photo by Jp Valery on Unsplash

Comments

Andy Ke says

February 22, 2023 at 3:54 pm

I test the code:
(this one) for j in range(80000):
test = test + [j]

On my PC, I got the similar results with yours.
BUT on server, the results are different:
root@server:~# python mp_test.py
17.817729949951172
18.301397562026978
18.36843514442444
19.379492044448853
20.381116151809692
Server’s CPU: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz

- Jason Brownlee says
  
  February 23, 2023 at 5:14 am
  
  Fascinating.
  
  Are you able to check which version of Python is installed? e.g.
  
  1
  
  python -V
  
  - publicznyprofil1 says
    
    April 19, 2023 at 3:16 am
    
    For my case it takes:
    9.345500946044922
    10.307498931884766
    10.06449842453003
    10.81950068473816