Last Updated on October 28, 2023
You can share a large data structure between child processes and achieve a speedup by operating on the structure in parallel.
Thank you to Carter D. who promoted the development of this tutorial. If you have a Python concurrency problem, reach out, perhaps I can offer some suggestions and develop a tutorial to help you too.
In this tutorial, you will discover how to speed up a program composed of tasks that operate on the same shared large data structure in memory.
Let’s get started.
Problem of Large Data Structure Shared Among Processes
Sharing a large data structure between many child processes for parallel processing in Python can be slow.
Consider this situation:
We have a large complex data structure in memory. A dict of lists of custom objects or similar that is 500+ megabytes in size.
- The data is stored in custom objects making using a database challenging.
- The size of the data means we are limited in the number of copies we can have in memory.
We need to perform the same general task on the data many times.
- Each task needs full access to the data structure and the work is CPU or computationally intensive.
Executing the tasks sequentially, one by one, is slow and we wish to speed things up by executing the tasks in parallel.
The computational nature of the tasks means we prefer to use multiprocessing child processes, threads would probably not be appropriate unless the data structure is NumPy arrays.
The large complex data structure needs to be available to each child process. The use of multiprocessing means that sharing the large data structure can be very slow, potentially requiring inter-process communication. In many cases, this can be even slower than the sequential version of the program.
How can we share the data structure and parallelize the tasks with multiprocessing and achieve a speed-up?
Do you have a problem of this type?
Let me know in the comments below.
Run loops using all CPUs, download your FREE book to learn how.
How to Share a Large Data Structure Among Processes
A good first approach is to share copies of the in-memory data structure with child processes.
There are three main approaches we can use for this:
- Inherit a copy of the structure in each process
- Pass a copy of the structure to each task via an argument
- Initialize process workers with a copy of the structure once.
Other approaches we might consider include:
- Share Data Via Queue or Pipe
- Share data using shared ctypes.
- Shared data using shared_memory.
Using a queue or a pipe is generally equivalent to passing the structure via an argument. In fact, it is likely significantly slower. The use of shared ctypes and shared_memory are precluded given the complex nature of the shared data structure and use of custom objects (in this hypothetical scenario).
A good second-order approach would be to consider how the data structure can be stored and accessed by separate processes, without having to have the entire structure in memory within the reach process.
This would require developing a database-like capability where the structure is stored once in memory or on disk and queries on demand from child processes.
Some approaches include:
- Hosted structure with manager
- Stored structure in binary or memory-mapped file
For more ideas of approaches to try, see the tutorials:
Now that we have some ideas of approaches we could explore, let’s look at how we might develop each in turn.
Inherit a Copy of the Data Structure
A good simple approach is to have each child process that performs the task inherit the data structure.
This requires that the “fork” start method be used to create and start child processes. The parent process is “forked” which copies all memory in the parent process into the child process and this process is very fast.
The child process can then use the copy of the data “inherited” from the parent to perform its task.
This can be achieved by setting the process start method to “fork” via the set_start_method() function.
For example:
1 2 3 |
... # set fork start method set_start_method('fork') |
The data can be stored in a global variable and then accessed directly in child processes. Specifically, the task function executing in the child process would declare the global variable and then make use of it directly.
For example:
1 2 3 4 5 6 |
# task function executed in the child process def task(): # declare inherited global data global data # use data ... |
The benefit of this approach is that it is very fast as the copy of the memory is performed efficiently down in the operating system.
The downside is that the fork start method is not available on all platforms, e.g. it is not available on Windows.
You can learn more about setting the process start method in the tutorial:
You can learn more about inheriting data in child processes in the tutorials:
- Multiprocessing Inherit Global Variables in Python
- Inheriting is 34x Faster Than Sending Data Between Processes in Python
Pass Copy of Data Via Argument
Another approach is to pass the data structure as a function argument to the child process.
The task function executed in the child process would be configured to take the data as an argument.
For example:
1 2 3 |
# task function executed in the child process def task(data): ... |
This function can then be executed using a new multiprocessing.Process instance and pass the argument via the “args” argument.
For example:
1 2 3 |
... # create process, pass data as an argument process = Process(target=task, args=(data,)) |
Or the data could be provided as an argument to a worker process in a multiprocessing.Pool or ProcessPoolExecutor class.
The benefit of this approach is that it is very simple.
The downside is that it is slow. The data must be pickled, transmitted, received, and unpickled for each child process.
This approach will likely be slower than the sequential version of the program that does not use child processes.
You can learn more about passing data to child processes via arguments in the tutorial:
Initialize Workers With Copy of Data
We can get an improvement with a variation of the above approach.
Instead of using inter-process communication to share the data structure with a child process for each task, we can share the structure once for a fixed number of worker processes and reuse the workers for each task.
For example, if we had 100 tasks, we might transmit the structure 100 times. Instead, we would use 4 workers to execute all 100 tasks and transmit the structure 4 times only and each worker would keep and reuse its local copy.
This could be achieved using the ProcessPoolExecutor and the ability to specify a function and arguments to initialize worker processes.
A custom worker initialization function can be defined that takes the data structure as an argument and stores it in a global variable.
For example:
1 2 3 4 5 6 |
# initialize worker process def init_worker(data_arg): # declare global variable global data # declare the global variable data = data_arg |
The ProcessPoolExecutor can then be configured to initialize each worker with this function.
For example:
1 2 3 |
... # create process pool and init workers with data exe = ProcessPoolExecutor(4, initializer=init_worker, initargs=(data,)) |
The task() function executed in the child worker processes can then access the global variable within each worker process.
For example:
1 2 3 4 5 |
# perform a task on the structure def task(value): # declare global variable global data ... |
The benefit of this approach is that the data structure only needs to be shared with each worker process once.
The downside is that we must still use inter-process communication to share the pickled data structure a number of times, which is very slow.
You can learn more about initializing ProcessPoolExecutor worker processes in the tutorial:
This same approach can be used with the multiprocessing.Pool class.
For more on initializing workers in the Pool class, see the tutorial:
Share Data with Queue or Pipe
Another approach is to create a shared multiprocessing Queue or Pipe and share the data structure with each child process via the Queue or Pipe.
This is possible, but not recommended.
It is expected to be as slow as passing the structure as an argument to a child process, perhaps even slower.
The reason is that each time data is placed on the Queue or transmitted via the Pipe, it must be pickled and unpickled and one copy of the data structure would have to be transmitted for each task.
Nevertheless, if you want to explore this approach, you can learn more about sharing data via a queue in the tutorials:
- Multiprocessing Queue in Python
- Multiprocessing Pipe in Python
- How to Share a Queue with a Multiprocessing Pool
Host the Structure With a Manager
Rather than sharing the data structure, an alternative approach is to store one copy of the structure and access it many times from each task.
One approach to implementing this is to store the structure on a server process and have each child process message the server process with requests or queries.
This can be implemented using the multiprocessing.Manager.
A Manager will create a server process and host objects.
Child processes can then get proxy objects that can be used to interact with the hosted object without copying data back and forth between the processes.
The benefit of this approach is that it avoids copying the data. This may be needed if the data is being updated often while tasks are operating upon it or if the memory on the machine is limited.
The downside is that any data from the main structure needed in the task must be transmitted using inter-process communication, which is slow. This could be addressed by developing helper functions on the data structure to be executed in the server process and sending only summaries and results back to the child processes. This in turn may limit any benefits of operating on the structure in parallel with multiple processes. It really depends on the nature of the tasks and the data needed from the main structure.
You can learn more about using a Manager to share custom objects in the tutorial:
Store Data Structure in File
Another approach to centralizing the data structure is to store it in a file on disk.
Each process can then read the structure into memory as needed.
Reading the entire data structure into memory in each task would likely be slower than the sequential version if simple pickling is used. Instead, this process could be made more efficient using a memory-mapped file or storing the data in a binary format.
This would allow tasks to read directly into only those parts of the data structure needed for the task. It may also allow the memory-mapped file to manage a cache of in-memory data to serve multiple tasks operating on the same structure in parallel (depending on the implementation).
The benefit of this approach is that tasks can read data piecewise which could be more efficient. Note that if tasks only need portions of the data, then perhaps only portions of the data could be shared directly rather than using the disk as intermediate storage.
The downside is that it may only offer a benefit if saving/loading the binary data is faster than pickling and inter-process communication. This may very well be the case, but only if a binary format can be found for the data structure (something other than the pickle module).
It may also only offer a benefit if the tasks only need access to portions of the data rather than the entire data structure.
You can learn more about memory-mapped files in the mmap module in the standard library:
You can also see an example of sharing NumPy arrays between processes using a memory-mapped file:
Now that we have explored some ideas on how to share the large data structure, let’s look at some worked examples.
Example of Sequential Version
Before we explore parallel versions that share the data structure between processes, let’s start with a simple and slow sequential version.
In this example, we will define a data structure that is about 500 megabytes in memory. We will then execute a suite of time-consuming tasks on the structure and time the overall execution time of the entire program.
Firstly, we can define a large in-memory data structure.
There are many ways we could do this, although in this case, we will create a large list of integers – mainly because it’s fast to run relative to other structures we could create.
1 2 3 4 5 |
# create the large data structure def create_data(): # create a list of lists n = 1024 * 1024 * 64 return [i for i in range(n)] |
Next, we can define the task that operates on the structure.
This is arbitrary and could just be a sleep. Nevertheless, we will use the structure and sum all of the values in the list with a unique task id and report the sum.
1 2 3 4 5 6 |
# perform a task on the structure def task(value, data): # iterate the structure and do something result = sum([value + v for v in data]) # report result print(f'>task {value} got {result}') |
Next, we can define the main() function.
This will first create the large data structure and report its in-memory size in megabytes via the sys.getsizeof() function.
Next, 20 tasks are executed, each provided a unique task id and access to the shared data structure.
The main() function below implements this.
1 2 3 4 5 6 7 8 9 10 11 |
# the main program def main(): # create the structure data = create_data() # determine size of data in mb data_size = getsizeof(data) / (1024 * 1024) print(f'Prepared data: {data_size:.3f} MiB') # perform tasks for i in range(20): task(i, data) print('done') |
Finally, we can define the entry point of the program.
This involves executing the main() function. This is wrapped in benchmark code that records the start and end time, then calculates and reports the overall execution time.
1 2 3 4 5 6 7 8 9 10 11 12 |
# protect the entry point if __name__ == '__main__': # record start time time_start = perf_counter() # run the program main() # record end time time_end = perf_counter() # calculate execution time time_duration = time_end - time_start # report the execution time print(f'Took {time_duration:.3f} seconds') |
We use the time.perf_counter() function to record the time, preferred for benchmarking.
You can learn more about this function in the tutorial:
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# SuperFastPython.com # example of sequential tasks on large collection of objects from sys import getsizeof from time import perf_counter # create the large data structure def create_data(): # create a list of lists n = 1024 * 1024 * 64 return [i for i in range(n)] # perform a task on the structure def task(value, data): # iterate the structure and do something result = sum([value + v for v in data]) # report result print(f'>task {value} got {result}') # the main program def main(): # create the structure data = create_data() # determine size of data in mb data_size = getsizeof(data) / (1024 * 1024) print(f'Prepared data: {data_size:.3f} MiB') # perform tasks for i in range(20): task(i, data) print('done') # protect the entry point if __name__ == '__main__': # record start time time_start = perf_counter() # run the program main() # record end time time_end = perf_counter() # calculate execution time time_duration = time_end - time_start # report the execution time print(f'Took {time_duration:.3f} seconds') |
Running the example first creates the large data structure and reports its size.
In this case, we can see that it is about 559.366 megabytes.
Next, 20 tasks are executed sequentially on the data structure.
Finally, the overall execution time is reported.
In this case, the total execution time for the program is about 71.598 seconds.
Your results may differ.
This provides a baseline in performance that we wish to improve upon using concurrency.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
Prepared data: 559.366 MiB >task 0 got 2251799780130816 >task 1 got 2251799847239680 >task 2 got 2251799914348544 >task 3 got 2251799981457408 >task 4 got 2251800048566272 >task 5 got 2251800115675136 >task 6 got 2251800182784000 >task 7 got 2251800249892864 >task 8 got 2251800317001728 >task 9 got 2251800384110592 >task 10 got 2251800451219456 >task 11 got 2251800518328320 >task 12 got 2251800585437184 >task 13 got 2251800652546048 >task 14 got 2251800719654912 >task 15 got 2251800786763776 >task 16 got 2251800853872640 >task 17 got 2251800920981504 >task 18 got 2251800988090368 >task 19 got 2251801055199232 done Took 71.598 seconds |
Next, let’s see if we can improve the performance of the program using a process pool and data inheritance.
Free Python Multiprocessing Course
Download your FREE multiprocessing PDF cheat sheet and get BONUS access to my free 7-day crash course on the multiprocessing API.
Discover how to use the Python multiprocessing module including how to create and start child processes and how to use a mutex locks and semaphores.
Example of Inheriting the Data Structure
We can explore how to speed up the program using process-based concurrency and share the large data structure using inheritance.
Each task is CPU-bound, meaning we would not get a benefit using threads and instead should consider using processes.
We will use the ProcessPoolExecutor with one task per physical CPU core, 4 in this case.
1 2 3 4 |
... # create process pool with ProcessPoolExecutor(4) as exe: # ... |
You can learn more about how to use the ProcessPoolExecutor in the tutorial:
We must configure our program to use the ‘fork‘ method for starting child processes.
This will allow us child processes to inherit global variables from parent processes.
1 2 3 |
... # set fork start method set_start_method('fork') |
Next, we can explicitly declare and define the “data” variable as a global variable.
We can then issue all 20 tasks to the pool for execution.
1 2 3 4 5 |
... # declare global variable global data # create and define the structure data = create_data() |
We can update the task() function so that it no longer takes the data structure as an argument and instead declares the global variable inherited from the parent process.
1 2 3 4 5 6 7 8 |
# perform a task on the structure def task(value): # declare global variable global data # iterate the structure and do something result = sum([value + v for v in data]) # report result print(f'>task {value} got {result}') |
Finally, we can issue all 20 tasks via the submit() method on the ProcessPoolExecutor.
1 2 3 4 |
... # issue all tasks for i in range(20): _ = exe.submit(task, i) |
Tying this together, the complete example is listed below.
Note that the fork start method is not available on all platforms, e.g. windows. If you are on Windows, you may not be able to execute this example or use this approach.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
# SuperFastPython.com # example of concurrent tasks on inherited shared data from sys import getsizeof from time import perf_counter from multiprocessing import set_start_method from concurrent.futures import ProcessPoolExecutor # create the large data structure def create_data(): # create a list of lists n = 1024 * 1024 * 64 return [i for i in range(n)] # perform a task on the structure def task(value): # declare global variable global data # iterate the structure and do something result = sum([value + v for v in data]) # report result print(f'>task {value} got {result}') # the main program def main(): # declare global variable global data # create and define the structure data = create_data() # determine size of data in mb data_size = getsizeof(data) / (1024 * 1024) print(f'Prepared data: {data_size:.3f} MiB') # create process pool with ProcessPoolExecutor(4) as exe: # issue all tasks for i in range(20): _ = exe.submit(task, i) print('done') # protect the entry point if __name__ == '__main__': # set fork start method set_start_method('fork') # record start time time_start = perf_counter() # run the program main() # record end time time_end = perf_counter() # calculate execution time time_duration = time_end - time_start # report the execution time print(f'Took {time_duration:.3f} seconds') |
Running the example creates the large data structure as before and reports its size.
Next, the process pool is created and all 20 tasks are issued.
Each process that is created is started using the fork start method allowing it to inherit all data from the parent process, such as the data structure stored in a global variable.
The tasks are executed by the child worker processes, each accessing their inherited copy of the data structure.
Once all tasks are done the overall execution time is reported.
In this case, we can see that the program was completed in about 25.685 seconds. This is a speedup of about 2.79x over the sequential version.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
Prepared data: 559.366 MiB >task 1 got 2251799847239680 >task 2 got 2251799914348544 >task 0 got 2251799780130816 >task 3 got 2251799981457408 >task 4 got 2251800048566272 >task 5 got 2251800115675136 >task 6 got 2251800182784000 >task 7 got 2251800249892864 >task 8 got 2251800317001728 >task 10 got 2251800451219456 >task 9 got 2251800384110592 >task 11 got 2251800518328320 >task 12 got 2251800585437184 >task 13 got 2251800652546048 >task 14 got 2251800719654912 >task 15 got 2251800786763776 >task 16 got 2251800853872640 >task 18 got 2251800988090368 >task 19 got 2251801055199232 >task 17 got 2251800920981504 done Took 25.685 seconds |
Next, let’s look at the case of pickling the data structure for each task executed in a separate process.
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Example of Passing the Data Structure By Argument
We can explore how to pass the data structure to child processes by argument.
When we pass data to child processes as arguments they must be pickled in the sending process and unpickled in the receiving process. This can be very slow, and proportionally slow to the size of the data structure.
As such, we might expect this approach to have worse performance than using inheritance described above. Nevertheless, it does not require the ‘fork’ start method and will work on any platform.
This approach requires updating the task() function to take the structure as an argument.
1 2 3 4 5 6 |
# perform a task on the structure def task(value, data): # iterate the structure and do something result = sum([value + v for v in data]) # report result print(f'>task {value} got {result}') |
It then requires that we update the submit() method on the ProcessPoolExecutor to pass the data structure to the task() function as an argument.
1 2 3 4 5 6 |
... # create process pool with ProcessPoolExecutor(4) as exe: # issue all tasks for i in range(20): _ = exe.submit(task, i, data) |
The updated main() function with this change is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# the main program def main(): # create the structure data = create_data() # determine size of data in mb data_size = getsizeof(data) / (1024 * 1024) print(f'Prepared data: {data_size:.3f} MiB') # create process pool with ProcessPoolExecutor(4) as exe: # issue all tasks for i in range(20): _ = exe.submit(task, i, data) print('done') |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
# SuperFastPython.com # example of concurrent tasks on shared data via argument from sys import getsizeof from time import perf_counter from concurrent.futures import ProcessPoolExecutor # create the large data structure def create_data(): # create a list of lists n = 1024 * 1024 * 64 return [i for i in range(n)] # perform a task on the structure def task(value, data): # iterate the structure and do something result = sum([value + v for v in data]) # report result print(f'>task {value} got {result}') # the main program def main(): # create the structure data = create_data() # determine size of data in mb data_size = getsizeof(data) / (1024 * 1024) print(f'Prepared data: {data_size:.3f} MiB') # create process pool with ProcessPoolExecutor(4) as exe: # issue all tasks for i in range(20): _ = exe.submit(task, i, data) print('done') # protect the entry point if __name__ == '__main__': # record start time time_start = perf_counter() # run the program main() # record end time time_end = perf_counter() # calculate execution time time_duration = time_end - time_start # report the execution time print(f'Took {time_duration:.3f} seconds') |
Running the example creates the large data structure as before and reports its size.
Next, the process pool is created and all 20 tasks are issued with the data argument, one copy per task.
The tasks are executed by the child worker processes, each accessing their own passed-in copy of the data structure.
Once all tasks are done the overall execution time is reported.
In this case, we can see that the program was completed in about 142.346 seconds.
This is significantly slower than the sequential version of the program, almost 2x slower. The worse performance is not surprising given that one copy of the data structure is pickled and transmitted for each task.
This highlights how we can indeed achieve worse performance when using parallelism, due to the overhead of inter-process communication.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
Prepared data: 559.366 MiB >task 0 got 2251799780130816 >task 4 got 2251800048566272 >task 8 got 2251800317001728 >task 12 got 2251800585437184 >task 16 got 2251800853872640 >task 1 got 2251799847239680 >task 5 got 2251800115675136 >task 9 got 2251800384110592 >task 13 got 2251800652546048 >task 17 got 2251800920981504 >task 2 got 2251799914348544 >task 6 got 2251800182784000 >task 10 got 2251800451219456 >task 14 got 2251800719654912 >task 18 got 2251800988090368 >task 3 got 2251799981457408 >task 7 got 2251800249892864 >task 11 got 2251800518328320 >task 15 got 2251800786763776 >task 19 got 2251801055199232 done Took 142.346 seconds |
Next, let’s look at the case of passing the data as an argument in a more efficient manner.
Example of Global Data Structure Per Worker
We can explore a more efficient way of passing the data structure to child processes.
Instead of passing one copy of the structure to child processes per task as we did in the previous example, we can share one copy of the structure per child process and reuse it across each task.
This means that if we had 4 processes, we only needed to create and transmit 4 copies of the data.
This can be achieved by initializing each worker process in the ProcessPoolExecutor with a copy of the data structure, having each process store the structure in a global variable, and then having each child process access a copy of the structure in the global variable.
It provides an intermediate approach between the first approach using data inheritance and the second approach that passes a copy of the data per task. We pass one copy per process and each task reuses the same structure.
Firstly, we can define a worker process initialization function that takes the data structure as an argument, declares a global variable, and stores the argument in the global variable.
1 2 3 4 5 6 |
# initialize worker process def init_worker(data_arg): # declare global variable global data # declare the global variable data = data_arg |
Next, we can update the task() function to no longer take the data structure as a function argument and instead declare the data global variable.
1 2 3 4 5 6 7 8 |
# perform a task on the structure def task(value): # declare global variable global data # iterate the structure and do something result = sum([value + v for v in data]) # report result print(f'>task {value} got {result}') |
We can then create the ProcessPoolExecutor and initialize each worker process using our custom initialization function and pass in a copy of the data structure.
1 2 3 4 |
... # create process pool with ProcessPoolExecutor(4, initializer=init_worker, initargs=(data,)) as exe: # ... |
You can learn more about initializing ProcessPoolExecutor worker processes in the tutorial:
Finally, we can execute the tasks without passing a copy of the data structure for each task.
1 2 3 4 |
... # issue all tasks for i in range(20): _ = exe.submit(task, i) |
Tying this together, the complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# SuperFastPython.com # example of concurrent tasks on shared data via init worker from sys import getsizeof from time import perf_counter from concurrent.futures import ProcessPoolExecutor # create the large data structure def create_data(): # create a list of lists n = 1024 * 1024 * 64 return [i for i in range(n)] # initialize worker process def init_worker(data_arg): # declare global variable global data # declare the global variable data = data_arg # perform a task on the structure def task(value): # declare global variable global data # iterate the structure and do something result = sum([value + v for v in data]) # report result print(f'>task {value} got {result}') # the main program def main(): # create the structure data = create_data() # determine size of data in mb data_size = getsizeof(data) / (1024 * 1024) print(f'Prepared data: {data_size:.3f} MiB') # create process pool with ProcessPoolExecutor(4, initializer=init_worker, initargs=(data,)) as exe: # issue all tasks for i in range(20): _ = exe.submit(task, i) print('done') # protect the entry point if __name__ == '__main__': # record start time time_start = perf_counter() # run the program main() # record end time time_end = perf_counter() # calculate execution time time_duration = time_end - time_start # report the execution time print(f'Took {time_duration:.3f} seconds') |
Running the example creates the large data structure as before and reports its size.
Next, the process pool is created and each worker in the pool is initialized with a copy of the data structure. The initialization function runs for each process and declares and defines a process-specific global variable for its copy of the data structure.
All 20 tasks are issued and the tasks are executed by the child worker processes, each accessing their own local copy of the data structure.
Once all tasks are done the overall execution time is reported.
In this case, we can see that the program was completed in about 41.876 seconds.
This is a speedup of about 1.71x over the sequential version of the program although is still slower than the version of the program that used data inheritance.
This example highlights how we can achieve some benefits by using an intermediate approach between the data inheritance and passing data copies by function arguments.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
Prepared data: 559.366 MiB >task 3 got 2251799981457408 >task 8 got 2251800317001728 >task 12 got 2251800585437184 >task 16 got 2251800853872640 >task 5 got 2251800115675136 >task 9 got 2251800384110592 >task 13 got 2251800652546048 >task 17 got 2251800920981504 >task 0 got 2251799780130816 >task 2 got 2251799914348544 >task 7 got 2251800249892864 >task 10 got 2251800451219456 >task 15 got 2251800786763776 >task 18 got 2251800988090368 >task 1 got 2251799847239680 >task 4 got 2251800048566272 >task 6 got 2251800182784000 >task 11 got 2251800518328320 >task 14 got 2251800719654912 >task 19 got 2251801055199232 done Took 41.876 seconds |
Next, let’s review and compare the performance of each approach.
Comparison of Results
We have explored four different approaches of creating a large in-memory data structure and executing a suite of tasks on the same structure.
The table below summarizes the execution times and the speedups of each approach over the sequential version.
1 2 3 4 5 6 |
Approach | Time (sec) | Speedup ---------------|------------|-------- Sequential | 71.598 | n/a Inherited | 25.685 | 2.79x Argument | 142.346 | 0.50x Initialization | 41.876 | 1.71x |
We can see that the sequential version of the program provides the baseline execution time for any improved version to improve upon.
The updated version that used 4 child worker processes and data inheritance was the fastest with a 2.79x. This approach would likely continue to scale as more CPU cores were made available and more child processes were created in response. Although fast, this approach is limited to those platforms that support the fork start method for child processes.
The updated version that passed one copy of the data structure to each task via function argument showed the worst performance, with a speedup of 0.5x, that is, a slowdown of 2x compared to the sequential version. The reason for the worse performance is the inter-process communication required to share a copy of the data structure with each task. It is a reminder to avoid pickling and transmission of large data structures between processes whenever possible.
The updated version that created one copy of the data per worker process that was reused for each task provided a good middle ground between the two previous approaches, and most importantly, a speedup of about 1.71x compared to the sequential version. This approach too is expected to continue to scale as the number of physical CPU cores and child processes is increased.
We have only explored examples that share copies of the data structure with child processes. There is scope to explore versions of the program that host the structure in a server process and versions that store the structure in a binary file. These are left as an exercise to the reader to explore.
Further Reading
This section provides additional resources that you may find helpful.
Python Multiprocessing Books
- Python Multiprocessing Jump-Start, Jason Brownlee (my book!)
- Multiprocessing API Interview Questions
- Multiprocessing API Cheat Sheet
I would also recommend specific chapters in the books:
- Effective Python, Brett Slatkin, 2019.
- See: Chapter 7: Concurrency and Parallelism
- High Performance Python, Ian Ozsvald and Micha Gorelick, 2020.
- See: Chapter 9: The multiprocessing Module
- Python in a Nutshell, Alex Martelli, et al., 2017.
- See: Chapter: 14: Threads and Processes
Guides
- Python Multiprocessing: The Complete Guide
- Python Multiprocessing Pool: The Complete Guide
- Python ProcessPoolExecutor: The Complete Guide
APIs
References
Takeaways
You now know how to speed up a program composed of tasks that operate on the same shared large data structure in memory.
Did I make a mistake? See a typo?
I’m a simple humble human. Correct me, please!
Do you have any additional tips?
I’d love to hear about them!
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Photo by Daniel Salgado on Unsplash
Do you have any questions?