Last Updated on August 21, 2023
Deleting a single file is fast, but deleting many files can be slow.
It can become painfully slow in situations where you may need to delete thousands of files. The hope is that multithreading can be used to speed up the file deletion operation.
In this tutorial, you will explore how to delete thousands of files using multithreading.
Let’s dive in.
Create 10,000 Files To Delete
Before we can delete thousands of files, we must create them.
We can write a script that will create 10,000 CSV files, each file with 10,000 lines and each line containing 10 random numeric values.
Each file will be about 1.9 megabytes in size, and all files together will be about 19 gigabytes in size.
First, we can create a function that will take a file path and string data and save it to file.
The save_file() function below implements this, taking the full path and data, opens the file in ASCII format, then saves the data to the file. The context manager is used so that the file is closed automatically.
1 2 3 4 5 6 |
# save data to a file def save_file(filepath, data): # open the file with open(filepath, 'w') as handle: # save the data handle.write(data) |
Next, we can generate the data to be saved.
First, we can write a function to generate a single line of data. In this case, a line is composed of 10 random numeric values in the range between 0 and 1, stored in CSV format (e.g. separated by a comma per value).
The generate_line() function below implements this, returning a string of random data.
1 2 3 |
# generate a line of mock data of 10 random data points def generate_line(): return ','.join([str(random()) for _ in range(10)]) |
Next, we can write a function to generate the contents of one file.
This will be 10,000 lines of random values, where each line is separated by a new line.
The generate_file_data() function below implements this, returning a string representing the data for a single data file.
1 2 3 4 5 6 |
# generate file data of 10K lines each with 10 data points def generate_file_data(): # generate many lines of data lines = [generate_line() for _ in range(10000)] # convert to a single ascii doc with new lines return '\n'.join(lines) |
Finally, we can generate data for all of the files and save each with a separate file name.
First, we can create a directory to store all of the created files (e.g. ‘tmp‘) under the current working directory.
This can be achieved using the os.makedirs() function, for example:
1 2 3 |
... # create a local directory to save files makedirs(path, exist_ok=True) |
We can then loop 10,000 times and create the data and a unique filename for each iteration, then save the generated contents of the file to disk.
1 2 3 4 5 6 7 8 9 10 11 |
... # create all files for i in range(10000): # generate data data = generate_file_data() # create filenames filepath = join(path, f'data-{i:04d}.csv') # save data file save_file(filepath, data) # report progress print(f'.saved {filepath}') |
The generate_all_files() function listed below implements this, creating the 10,000 files of 10,000 lines of data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# generate 10K files in a directory def generate_all_files(path='tmp'): # create a local directory to save files makedirs(path, exist_ok=True) # create all files for i in range(10000): # generate data data = generate_file_data() # create filenames filepath = join(path, f'data-{i:04d}.csv') # save data file save_file(filepath, data) # report progress print(f'.saved {filepath}') |
Tying this together, the complete example of creating a large number of CSV files is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# SuperFastPython.com # create a large number of data files from os import makedirs from os.path import join from random import random # save data to a file def save_file(filepath, data): # open the file with open(filepath, 'w') as handle: # save the data handle.write(data) # generate a line of mock data of 10 random data points def generate_line(): return ','.join([str(random()) for _ in range(10)]) # generate file data of 10K lines each with 10 data points def generate_file_data(): # generate many lines of data lines = [generate_line() for _ in range(10000)] # convert to a single ascii doc with new lines return '\n'.join(lines) # generate 10K files in a directory def generate_all_files(path='tmp'): # create a local directory to save files makedirs(path, exist_ok=True) # create all files for i in range(10000): # generate data data = generate_file_data() # create filenames filepath = join(path, f'data-{i:04d}.csv') # save data file save_file(filepath, data) # report progress print(f'.saved {filepath}') # entry point, generate all of the files generate_all_files() |
Running the example will create 10,000 CSV files of random data into a tmp/ directory.
It takes a long time to run, depending on the speed of your hard drive, e.g. SSD used in modern computer systems are quite fast.
On my system it takes about 11 minutes to complete.
1 2 3 4 5 6 7 8 9 10 11 |
... .saved tmp/data-9990.csv .saved tmp/data-9991.csv .saved tmp/data-9992.csv .saved tmp/data-9993.csv .saved tmp/data-9994.csv .saved tmp/data-9995.csv .saved tmp/data-9996.csv .saved tmp/data-9997.csv .saved tmp/data-9998.csv .saved tmp/data-9999.csv |
Next, we can explore ways of deleting all of the files from the directory.
Run loops using all CPUs, download your FREE book to learn how.
Delete Files One-By-One (slowly)
Deleting all files in one directory to a second directory in Python is relatively straightforward.
A file can be deleted in Python by calling the os.remove() function and specifying the path to the file.
For example:
1 2 3 |
... # delete the file remove(filepath) |
First, we can create a list of all file paths to delete.
This can be achieved by calling the os.listdir() function that will list the contents of a directory and then calling the os.path.join() function for each file name to join it with the directory name to create a path.
This can be achieved in a list comprehension, for example:
1 2 3 |
... # create a list of all files to delete files = [join(path, filename) for filename in listdir(path)] |
We can then iterate over all of the prepared file paths and call the os.remove() function on each.
1 2 3 4 5 |
... # delete all files for filepath in files: # delete the file remove(filepath) |
Tying this together, the complete example of deleting all files is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# SuperFastPython.com # delete files in a directory sequentially from os import listdir from os import remove from os.path import join # delete all files in a directory def main(path='tmp'): # create a list of all files to delete files = [join(path, filename) for filename in listdir(path)] # delete all files for filepath in files: # delete the file remove(filepath) # report progress print(f'.deleted {filepath}') print('Done') # entry point if __name__ == '__main__': main() |
Running the example will iterate over all files in the tmp/ sub-directory and delete each file in turn, one-by-one.
Even though there are 10,000 files and about 19 gigabytes of data total, deleting the files is relatively fast.
On my system it finishes in under a second, specifically 671.6 milliseconds (recall there are 1,000 milliseconds in a second).
1 2 3 |
[Finished in 666ms] [Finished in 647ms] [Finished in 702ms] |
How long did the example take to run on your system?
Let me know in the comments below.
Even though the deletion process is fast, we can use this as a baseline to see if the file deletion process can be sped up using multiple threads.
How to Delete Files Concurrently
File deletion is performed by the operating system which then calls down to the hardware, probably to do something with the table of contents for files on the disk (file allocation table or some-such).
We might expect that file deletion is an IO-bound task, limited by the speed that files can be “deleted” (perhaps just dropped from the allocation table) by the hard disk drive.
This suggests that threads may be appropriate to perform concurrent deletion operations. This is because the CPython interpreter will release the global interpreter lock when performing IO operations, perhaps such as deleting a file. This is opposed to processes that are better suited to CPU-bound tasks, that execute as fast as the CPU can run them.
Given that deletion does not involve moving data around on the disk, deletion is a fast process generally.
Ultimately we might expect that file deletion on the disk is performed sequentially. That is the disk is only able to delete one file at a time.
As such, our expectation may be that file deletion cannot be performed in parallel on a single disk and using threads will not offer any speed-up.
Let’s explore this in the next section.
Free Concurrent File I/O Course
Get FREE access to my 7-day email course on concurrent File I/O.
Discover patterns for concurrent file I/O, how save files with a process pool, how to copy files with a thread pool, and how to append to a file from multiple threads safely.
Delete Files Concurrently With Threads
The ThreadPoolExecutor provides a pool of worker threads that we can use to multithread the deleting of thousands of data files.
Each file in the tmp/ directory will represent a task that will require the file to be deleted from disk by a call to the os.remove() function.
First, we can define a function that takes the path of a file to delete, removes the file, then reports progress.
The delete_file() function below implements this.
1 2 3 4 5 6 |
# delete a file and report progress def delete_file(filepath): # delete the file remove(filepath) # report progress print(f'.deleted {filepath}') |
Next, we can call this function for each file in the tmp/ directory.
We can then create the thread pool with ten worker threads. We will use the context manager to ensure the thread pool is closed automatically once all pending and running tasks have been completed.
1 2 3 4 |
... # create the thread pool with ThreadPoolExecutor(10) as exe: # ... |
We can then call the submit() function for each file to be deleted, calling the delete_file() function with the file path as an argument.
This can be achieved in a list comprehension, and the resulting list of Future objects can be ignored as we do not require any result from the task.
1 2 3 |
... # delete all files _ = [exe.submit(delete_file, filepath) for filepath in files] |
Tying this together, the complete example of multithreaded deletion of files from one directory is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# SuperFastPython.com # delete files in a directory concurrently with threads from os import listdir from os import remove from os.path import join from concurrent.futures import ThreadPoolExecutor # delete a file and report progress def delete_file(filepath): # delete the file remove(filepath) # report progress print(f'.deleted {filepath}') # delete all files in a directory def main(path='tmp'): # create a list of all files to delete files = [join(path, filename) for filename in listdir(path)] # create a thread pool with ThreadPoolExecutor(10) as exe: # delete all files _ = [exe.submit(delete_file, filepath) for filepath in files] print('Done') # entry point if __name__ == '__main__': main() |
Running the example deletes all files from the directory, as before.
1 2 3 4 5 6 7 8 9 10 11 12 |
... .deleted tmp/data-0962.csv .deleted tmp/data-9946.csv .deleted tmp/data-4292.csv .deleted tmp/data-0751.csv .deleted tmp/data-8127.csv .deleted tmp/data-6334.csv .deleted tmp/data-0792.csv .deleted tmp/data-2191.csv .deleted tmp/data-8866.csv .deleted tmp/data-2152.csv Done |
We do not see any speed benefit to multithreading the file deletion process, or any penalty slowing down the process.
On my system it takes about 629.6 milliseconds, which is roughly the same, probably within the margin of measurement error, as the sequential case that took 671.6 milliseconds.
1 2 3 |
[Finished in 622ms] [Finished in 645ms] [Finished in 622ms] |
How long did it take to execute on your system?
Let me know in the comments below.
We can try increasing the number of worker threads to see if it offers any further benefit.
For example, we could try ten times more threads, e.g. 100 workers.
This requires simply changing the argument to the ThreadPoolExecutor class from 10 to 100.
1 2 3 4 |
... # create the thread pool with ThreadPoolExecutor(100) as exe: # ... |
Running the example with the increase in worker threads deletes all files as before.
In this case, we do not see an improvement in performance. In fact we see a small increase in the overall task duration.
On my system it took about 723.3 milliseconds with 100, compared to about 629.6 milliseconds with 10 threads. This may be within the margin of measurement error, or perhaps caused by the overhead of having to manage additional threads.
1 2 3 |
[Finished in 754ms] [Finished in 719ms] [Finished in 697ms] |
You might like to test additional configurations such as 500 or even 1,000 threads to see if it makes a reasonable difference in speed.
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Delete Files Concurrently with Threads in Batch
Each task in the thread pool adds overhead that slows down the overall task.
This includes function calls and object creation for each task, occurring internally within the ThreadPoolExecutor class.
We might reduce this overhead by performing file operations in a batch within each worker thread.
This can be achieved by updating the delete_file() function to receive a list of files to delete, and splitting up the files in the main() function into chunks to be submitted to worker threads for batch processing.
The hope is that this would reduce some of the overhead required for submitting so many small tasks by replacing it with a few larger batch tasks.
First, we can change the target function to receive a list of file names to delete, listed below.
1 2 3 4 5 6 7 8 |
# delete a file and report progress def delete_files(filepaths): # process all file names for filepath in filepaths: # delete the file remove(filepath) # report progress print(f'.deleted {filepath}') |
Next, in the main() function we can select a chunksize based on the number of worker threads and the number of files to delete.
First, we must create a list of all files to delete that we can split up later.
1 2 3 |
... # create a list of all files to delete files = [join(path, filename) for filename in listdir(path)] |
In this case, we will use 10 worker threads with 10,000 files to delete. Dividing the files evenly between workers (10,000 / 10) or 1,000 files to delete for each worker.
1 2 3 4 |
... # determine chunksize n_workers = 10 chunksize = round(len(files) / n_workers) |
Next, we can create the thread pool with the parameterized number of worker threads.
1 2 3 4 |
... # create the thread pool with ThreadPoolExecutor(n_workers) as exe: ... |
We can then iterate over the list of files to delete and split them into chunks defined by a chunksize.
This can be achieved by iterating over the list of files and using the chunksize as the increment size in a for-loop. We can then split off a chunk of files to send to a worker thread as a task.
1 2 3 4 5 6 7 |
... # split the move operations into chunks for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # submit the batch move task _ = exe.submit(delete_files, filenames) |
And that’s it.
The complete example of deleting files in a batch concurrently using worker threads is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# SuperFastPython.com # delete files in a directory concurrently with threads in batch from os import listdir from os import remove from os.path import join from concurrent.futures import ThreadPoolExecutor # delete a file and report progress def delete_files(filepaths): # process all file names for filepath in filepaths: # delete the file remove(filepath) # report progress print(f'.deleted {filepath}') # delete all files in a directory def main(path='tmp'): # create a list of all files to delete files = [join(path, filename) for filename in listdir(path)] # determine chunksize n_workers = 10 chunksize = round(len(files) / n_workers) # create the process pool with ThreadPoolExecutor(n_workers) as exe: # split the move operations into chunks for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # submit the batch move task _ = exe.submit(delete_files, filenames) print('Done') # entry point if __name__ == '__main__': main() |
Running the example deletes all 10,000 files from the tmp/ sub-directory.
1 2 3 4 5 6 7 8 9 10 11 12 |
... .deleted tmp/data-5412.csv .deleted tmp/data-3413.csv .deleted tmp/data-9228.csv .deleted tmp/data-1674.csv .deleted tmp/data-7205.csv .deleted tmp/data-8136.csv .deleted tmp/data-1660.csv .deleted tmp/data-7211.csv .deleted tmp/data-8122.csv .deleted tmp/data-4718.csv Done |
In this case, we do see a lift in performance.
On my system it takes about 394.6 milliseconds, compared to about 671.6 milliseconds in the case where files were deleted one-by-one. This is about a 1.7x speed improvement.
Not bad.
1 2 3 |
[Finished in 395ms] [Finished in 417ms] [Finished in 372ms] |
Delete Files Concurrently with Processes
We can also try to delete files concurrently with processes instead of threads.
It is unclear whether processes can offer a speed benefit in this case. Given that we can get a benefit using threads with batch deletion, we know it is possible. Nevertheless, using processes requires data for each task to be serialised which introduces additional overhead that might negate any speed-up from performing file operations with true parallelism via processes.
We can explore using processes to delete files concurrently using the ProcessPoolExecutor.
This can be achieved by switching out the ThreadPoolExecutor directly and specifying the number of worker processes.
We will use 8 processes in this case as I have 8 logical CPU cores, although you may want to try different configurations based on the number of logical CPU cores you have available.
1 2 3 4 |
... # create the thread pool with ProcessPoolExecutor(8) as exe: # ... |
Tying this together, the complete example of deleting files concurrently using the process pool is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# SuperFastPython.com # delete files in a directory concurrently with processes from os import listdir from os import remove from os.path import join from concurrent.futures import ProcessPoolExecutor # delete a file and report progress def delete_file(filepath): # delete the file remove(filepath) # report progress print(f'.deleted {filepath}', flush=True) # delete all files in a directory def main(path='tmp'): # create a list of all files to delete files = [join(path, filename) for filename in listdir(path)] # create a process pool with ProcessPoolExecutor(4) as exe: # delete all files _ = [exe.submit(delete_file, filepath) for filepath in files] print('Done') # entry point if __name__ == '__main__': main() |
Running the example deletes all files as before.
1 2 3 4 5 6 7 8 9 10 11 12 |
… .deleted tmp/data-5832.csv .deleted tmp/data-2185.csv .deleted tmp/data-5826.csv .deleted tmp/data-2191.csv .deleted tmp/data-1498.csv .deleted tmp/data-0786.csv .deleted tmp/data-7957.csv .deleted tmp/data-6491.csv .deleted tmp/data-5198.csv .deleted tmp/data-4286.csv Done |
We do not see any lift in performance. In fact, we see a dramatic drop in performance compared to both the sequential and multithreaded examples.
On my system it takes about 2.23 seconds, compared to about 629.6 milliseconds for the single-process example above. This is about 3.3x slower.
The dramatic decrease in performance is very likely due to the overhead in serialising the arguments for the 10,000 tasks.
1 2 3 |
[Finished in 1.9s] [Finished in 2.4s] [Finished in 2.4s] |
Delete Files Concurrently with Processes in Batch
Any data sent to a worker process or received from a worker process must be serialised (pickled).
This adds overhead for each task executed by worker threads.
This is likely the cause of worse performance seen when using the process pool in the previous section.
We can address this by batching the delete tasks into chunks to be executed by each worker process, just as we did when we batched filenames for the thread pools.
Again, this can be achieved by adapting the batched version of ThreadPoolExecutor from a previous section to use the ProcessPoolExecutor class.
We will use 8 worker processes. With 10,000 files to be deleted, this means that each worker will receive a batch of 1,250 files to delete.
1 2 3 4 |
... # determine chunksize n_workers = 8 chunksize = round(len(files) / n_workers) |
We can then create the process pool using the parameterized number of workers.
1 2 3 4 |
... # create the thread pool with ProcessPoolExecutor(n_workers) as exe: # ... |
And that’s it.
The complete example of batch deleting files using process pools is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# SuperFastPython.com # delete files in a directory concurrently with processes in batch from os import listdir from os import remove from os.path import join from concurrent.futures import ProcessPoolExecutor # delete a file and report progress def delete_file(filepaths): # process all file names for filepath in filepaths: # delete the file remove(filepath) # report progress print(f'.deleted {filepath}', flush=True) # delete all files in a directory def main(path='tmp'): # create a list of all files to delete files = [join(path, filename) for filename in listdir(path)] # determine chunksize n_workers = 8 chunksize = round(len(files) / n_workers) # create the process pool with ProcessPoolExecutor(n_workers) as exe: # split the move operations into chunks for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # submit the batch move task _ = exe.submit(delete_file, filenames) print('Done') # entry point if __name__ == '__main__': main() |
Running the example deletes all 10,000 files as we did previously.
1 2 3 4 5 6 7 8 9 10 11 12 |
... .deleted tmp/data-7920.csv .deleted tmp/data-2680.csv .deleted tmp/data-2858.csv .deleted tmp/data-2694.csv .deleted tmp/data-9919.csv .deleted tmp/data-0083.csv .deleted tmp/data-7934.csv .deleted tmp/data-6394.csv .deleted tmp/data-4583.csv .deleted tmp/data-5845.csv Done |
On my system, the program took about 497.3 milliseconds, which may be slightly slower than the batch multithreaded example that took about 394.6 milliseconds although slightly faster than the sequential example that took about 629.6 milliseconds seconds (about 1.35x faster).
1 2 3 |
[Finished in 519ms] [Finished in 481ms] [Finished in 492ms] |
Delete Files Concurrently with AsyncIO
We can also delete files concurrently using asyncio.
Generally, Python does not support non-blocking operations when working with files.
It is not provided in the asyncio module. This is because it is challenging to implement in a general cross-platform manner.
The third-party library aiofiles provides file operations for use in asyncio operations, but again, the operations are not true non-blocking IO and instead the concurrency is simulated using threads.
Nevertheless, we can use aiofiles in an asyncio program to delete files concurrently.
First, we can update the delete_file() function to be an async function that can be awaited and change the call to the os.remove() function to use the aiofiles.os.remove() that can be awaited.
The updated version of delete_file() is listed below.
1 2 3 4 5 6 |
# delete a file and report progress async def delete_file(filepath): # delete the file await remove(filepath) # report progress print(f'.deleted {filepath}') |
Next, we can update the main() function.
First, we can create a list of coroutines (asynchronous functions) that will execute concurrently. Each coroutine will be a call to the delete_file() function with a separate file to delete.
This can be achieved in a list comprehension, for example:
1 2 3 |
... # create a list of coroutines for deleting files tasks = [delete_file(name) for name in filepaths] |
You might recall that this defines a list of coroutines, and does not call the asynchronous function while constructing the list.
Next, we can execute the list of coroutines and wait for them to complete. This will attempt to perform all 10,000 file delete operations concurrently, allowing each task to yield control while waiting for the delete operation to complete via the await within the delete_file() function.
This can be achieved using the asyncio.gather() function that takes a collection of async function calls. Given that we have a list of these calls, we can unpack the function calls with the star (*) operator.
1 2 3 |
... # execute and wait for all tasks to complete await asyncio.gather(*tasks) |
Finally, we can start the asyncio event loop by calling the asyncio.run() function and pass it a call to our main() function.
1 2 3 4 |
# entry point if __name__ == '__main__': # execute the asyncio run loop asyncio.run(main()) |
The complete example of deleting files concurrently with asyncio is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# SuperFastPython.com # delete all files from a directory concurrently with asyncio from os import makedirs from os import listdir from os.path import join from shutil import move from aiofiles.os import remove import asyncio # delete a file and report progress async def delete_file(filepath): # delete the file await remove(filepath) # report progress print(f'.deleted {filepath}') # delete all files in a directory async def main(path='tmp'): # create a list of all file paths to delete filepaths = [join(path, filename) for filename in listdir(path)] # create a list of coroutines for deleting files tasks = [delete_file(name) for name in filepaths] # execute and wait for all tasks to complete await asyncio.gather(*tasks) print('Done') # entry point if __name__ == '__main__': # execute the asyncio run loop asyncio.run(main()) |
Running the example deletes all 10,000 files as before.
1 2 3 4 5 6 7 8 9 10 11 12 |
... .deleted tmp/data-5826.csv .deleted tmp/data-7957.csv .deleted tmp/data-5198.csv .deleted tmp/data-2191.csv .deleted tmp/data-0792.csv .deleted tmp/data-4286.csv .deleted tmp/data-4292.csv .deleted tmp/data-2634.csv .deleted tmp/data-6491.csv .deleted tmp/data-5832.csv Done |
We do not see any benefit in using asyncio to delete files concurrently. In fact, the results are worse than the sequential example we developed first.
On my system, the program completes in about 1.33 seconds which is about 2x worse than the sequential example that took about 629.6 milliseconds.
This is not surprising given that asyncio is operating in a single thread, that aiofiles is simulating non-blocking IO using threads internally and that we are attempting to run 10,000 coroutines simultaneously.
Nevertheless, it was an interesting exercise in case Python does support non-blocking file operations in the future.
1 2 3 |
[Finished in 1.3s] [Finished in 1.3s] [Finished in 1.4s] |
Delete Files Concurrently with AsyncIO in Batch
Perhaps we can reduce some of the overhead in deleting files concurrently with asyncio by performing tasks in batch.
In the previous example, each coroutine performed a single IO-bound operation. We can update the delete_file() function to process multiple files in a batch as we did in previous batch processing examples.
This will reduce the number of coroutines in the asyncio runtime and may reduce the amount of data moving around. At the very least, it may be an interesting exercise.
The adaptation of the asyncio example to delete files in batch is very similar to the previous examples that we adapted to operate in batch.
The updated version of the task function that will process a batch of files to delete is listed below. The task will yield control for each delete operation, but only to other coroutines. The for-loop within a task will not progress while it is waiting.
1 2 3 4 5 6 7 8 |
# delete files and report progress async def delete_files(filepaths): # process all file names for filepath in filepaths: # delete the file await remove(filepath) # report progress print(f'.deleted {filepath}') |
Next, we can choose the number of coroutines we wish to use and then divide the total number of files up evenly into chunks.
We will use 100 in this case, chosen arbitrarily.
1 2 3 4 |
... # determine chunksize n_workers = 100 chunksize = round(len(files) / n_workers) |
We can then iterate over the chunks and add each coroutine to a list for execution at the end.
1 2 3 4 5 6 7 8 |
... # split the move operations into chunks tasks = list() for i in range(0, len(filepaths), chunksize): # select a chunk of filenames filenames = filepaths[i:(i + chunksize)] # submit the batch move task tasks.append(delete_files(filenames)) |
Finally, we can execute all 100 coroutines concurrently.
1 2 3 |
... # execute all coroutines concurrently and wait for them to finish await asyncio.gather(*tasks) |
Tying this together, a complete example of batch file deleting using asyncio is listed below.
Again, we don’t expect a speed benefit over the single-threaded example, but perhaps a small lift in performance over the previous example where we had 10,000 coroutines operating concurrently.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# SuperFastPython.com # delete all files from a directory concurrently with asyncio in batch from os import makedirs from os import listdir from os.path import join from shutil import move from aiofiles.os import remove import asyncio # delete files and report progress async def delete_files(filepaths): # process all file names for filepath in filepaths: # delete the file await remove(filepath) # report progress print(f'.deleted {filepath}') # delete all files in a directory async def main(path='tmp'): # create a list of all files to delete filepaths = [join(path, filename) for filename in listdir(path)] # determine chunksize n_workers = 100 chunksize = round(len(filepaths) / n_workers) # split the move operations into chunks tasks = list() for i in range(0, len(filepaths), chunksize): # select a chunk of filenames filenames = filepaths[i:(i + chunksize)] # submit the batch move task tasks.append(delete_files(filenames)) # execute all coroutines concurrently and wait for them to finish await asyncio.gather(*tasks) print('Done') # entry point if __name__ == '__main__': # execute the asyncio run loop asyncio.run(main()) |
Running the example deletes approximately 20 gigabytes of files as we did in all previous examples.
1 2 3 4 5 6 7 8 9 10 11 12 |
... .deleted tmp/data-2208.csv .deleted tmp/data-2959.csv .deleted tmp/data-8523.csv .deleted tmp/data-5210.csv .deleted tmp/data-1702.csv .deleted tmp/data-1216.csv .deleted tmp/data-9729.csv .deleted tmp/data-4286.csv .deleted tmp/data-7942.csv .deleted tmp/data-4700.csv Done |
In this case, we don’t see any performance benefit. The example appears to take about as long as the non-batch asyncio example above.
On my system it takes about 1.23 seconds compared to about 1.33 seconds for the non-batched example. A small difference that might be within the margin of measurement error.
1 2 3 |
[Finished in 1.2s] [Finished in 1.2s] [Finished in 1.3s] |
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent File I/O in Python, Jason Brownlee (my book!)
Guides
Python File I/O APIs
- Built-in Functions
- os - Miscellaneous operating system interfaces
- os.path - Common pathname manipulations
- shutil - High-level file operations
- zipfile — Work with ZIP archives
- Python Tutorial: Chapter 7. Input and Output
Python Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
- asyncio — Asynchronous I/O
File I/O in Asyncio
References
Takeaways
In this tutorial you discovered how to explore using multithreading for batch deleting thousands of files.
Do you have any questions?
Leave your question in a comment below and I will reply fast with my best advice.
Photo by Roberto Nickson on Unsplash
Do you have any questions?