Multithreaded File Deletion in Python

Last Updated on August 21, 2023

Deleting a single file is fast, but deleting many files can be slow.

It can become painfully slow in situations where you may need to delete thousands of files. The hope is that multithreading can be used to speed up the file deletion operation.

In this tutorial, you will explore how to delete thousands of files using multithreading.

Let’s dive in.

Table of Contents

Create 10,000 Files To Delete

Before we can delete thousands of files, we must create them.

We can write a script that will create 10,000 CSV files, each file with 10,000 lines and each line containing 10 random numeric values.

Each file will be about 1.9 megabytes in size, and all files together will be about 19 gigabytes in size.

First, we can create a function that will take a file path and string data and save it to file.

The save_file() function below implements this, taking the full path and data, opens the file in ASCII format, then saves the data to the file. The context manager is used so that the file is closed automatically.

# save data to a file

def save_file(filepath, data):

# open the file

with open(filepath, 'w') as handle:

# save the data

handle.write(data)

Next, we can generate the data to be saved.

First, we can write a function to generate a single line of data. In this case, a line is composed of 10 random numeric values in the range between 0 and 1, stored in CSV format (e.g. separated by a comma per value).

The generate_line() function below implements this, returning a string of random data.

# generate a line of mock data of 10 random data points

def generate_line():

return ','.join([str(random()) for _ in range(10)])

Next, we can write a function to generate the contents of one file.

This will be 10,000 lines of random values, where each line is separated by a new line.

The generate_file_data() function below implements this, returning a string representing the data for a single data file.

# generate file data of 10K lines each with 10 data points

def generate_file_data():

# generate many lines of data

lines = [generate_line() for _ in range(10000)]

# convert to a single ascii doc with new lines

return '\n'.join(lines)

Finally, we can generate data for all of the files and save each with a separate file name.

First, we can create a directory to store all of the created files (e.g. ‘tmp‘) under the current working directory.

This can be achieved using the os.makedirs() function, for example:

...

# create a local directory to save files

makedirs(path, exist_ok=True)

We can then loop 10,000 times and create the data and a unique filename for each iteration, then save the generated contents of the file to disk.

...

# create all files

for i in range(10000):

# generate data

data = generate_file_data()

# create filenames

filepath = join(path, f'data-{i:04d}.csv')

# save data file

save_file(filepath, data)

# report progress

print(f'.saved {filepath}')

The generate_all_files() function listed below implements this, creating the 10,000 files of 10,000 lines of data.

# generate 10K files in a directory

def generate_all_files(path='tmp'):

# create a local directory to save files

makedirs(path, exist_ok=True)

# create all files

for i in range(10000):

# generate data

data = generate_file_data()

# create filenames

filepath = join(path, f'data-{i:04d}.csv')

# save data file

save_file(filepath, data)

# report progress

print(f'.saved {filepath}')

Tying this together, the complete example of creating a large number of CSV files is listed below.

# SuperFastPython.com

# create a large number of data files

from os import makedirs

from os.path import join

from random import random

# save data to a file

def save_file(filepath, data):

# open the file

with open(filepath, 'w') as handle:

# save the data

handle.write(data)

# generate a line of mock data of 10 random data points

def generate_line():

return ','.join([str(random()) for _ in range(10)])

# generate file data of 10K lines each with 10 data points

def generate_file_data():

# generate many lines of data

lines = [generate_line() for _ in range(10000)]

# convert to a single ascii doc with new lines

return '\n'.join(lines)

# generate 10K files in a directory

def generate_all_files(path='tmp'):

# create a local directory to save files

makedirs(path, exist_ok=True)

# create all files

for i in range(10000):

# generate data

data = generate_file_data()

# create filenames

filepath = join(path, f'data-{i:04d}.csv')

# save data file

save_file(filepath, data)

# report progress

print(f'.saved {filepath}')

# entry point, generate all of the files

generate_all_files()

Running the example will create 10,000 CSV files of random data into a tmp/ directory.

It takes a long time to run, depending on the speed of your hard drive, e.g. SSD used in modern computer systems are quite fast.

On my system it takes about 11 minutes to complete.

...

.saved tmp/data-9990.csv

.saved tmp/data-9991.csv

.saved tmp/data-9992.csv

.saved tmp/data-9993.csv

.saved tmp/data-9994.csv

.saved tmp/data-9995.csv

.saved tmp/data-9996.csv

.saved tmp/data-9997.csv

.saved tmp/data-9998.csv

.saved tmp/data-9999.csv

Next, we can explore ways of deleting all of the files from the directory.

Run loops using all CPUs, download your FREE book to learn how.

Delete Files One-By-One (slowly)

Deleting all files in one directory to a second directory in Python is relatively straightforward.

A file can be deleted in Python by calling the os.remove() function and specifying the path to the file.

For example:

...

# delete the file

remove(filepath)

First, we can create a list of all file paths to delete.

This can be achieved by calling the os.listdir() function that will list the contents of a directory and then calling the os.path.join() function for each file name to join it with the directory name to create a path.

This can be achieved in a list comprehension, for example:

...

# create a list of all files to delete

files = [join(path, filename) for filename in listdir(path)]

We can then iterate over all of the prepared file paths and call the os.remove() function on each.

...

# delete all files

for filepath in files:

# delete the file

remove(filepath)

Tying this together, the complete example of deleting all files is listed below.

# SuperFastPython.com

# delete files in a directory sequentially

from os import listdir

from os import remove

from os.path import join

# delete all files in a directory

def main(path='tmp'):

# create a list of all files to delete

files = [join(path, filename) for filename in listdir(path)]

# delete all files

for filepath in files:

# delete the file

remove(filepath)

# report progress

print(f'.deleted {filepath}')

print('Done')

# entry point

if __name__ == '__main__':

main()

Running the example will iterate over all files in the tmp/ sub-directory and delete each file in turn, one-by-one.

Even though there are 10,000 files and about 19 gigabytes of data total, deleting the files is relatively fast.

On my system it finishes in under a second, specifically 671.6 milliseconds (recall there are 1,000 milliseconds in a second).

[Finished in 666ms]

[Finished in 647ms]

[Finished in 702ms]

How long did the example take to run on your system?
Let me know in the comments below.

Even though the deletion process is fast, we can use this as a baseline to see if the file deletion process can be sped up using multiple threads.

Start Now: Free Concurrent File I/O Crash Course

How to Delete Files Concurrently

File deletion is performed by the operating system which then calls down to the hardware, probably to do something with the table of contents for files on the disk (file allocation table or some-such).

We might expect that file deletion is an IO-bound task, limited by the speed that files can be “deleted” (perhaps just dropped from the allocation table) by the hard disk drive.

This suggests that threads may be appropriate to perform concurrent deletion operations. This is because the CPython interpreter will release the global interpreter lock when performing IO operations, perhaps such as deleting a file. This is opposed to processes that are better suited to CPU-bound tasks, that execute as fast as the CPU can run them.

Given that deletion does not involve moving data around on the disk, deletion is a fast process generally.

Ultimately we might expect that file deletion on the disk is performed sequentially. That is the disk is only able to delete one file at a time.

As such, our expectation may be that file deletion cannot be performed in parallel on a single disk and using threads will not offer any speed-up.

Let’s explore this in the next section.

Free Concurrent File I/O Course

Get FREE access to my 7-day email course on concurrent File I/O.

Discover patterns for concurrent file I/O, how save files with a process pool, how to copy files with a thread pool, and how to append to a file from multiple threads safely.

Learn more

Delete Files Concurrently With Threads

The ThreadPoolExecutor provides a pool of worker threads that we can use to multithread the deleting of thousands of data files.

Each file in the tmp/ directory will represent a task that will require the file to be deleted from disk by a call to the os.remove() function.

First, we can define a function that takes the path of a file to delete, removes the file, then reports progress.

The delete_file() function below implements this.

# delete a file and report progress

def delete_file(filepath):

# delete the file

remove(filepath)

# report progress

print(f'.deleted {filepath}')

Next, we can call this function for each file in the tmp/ directory.

We can then create the thread pool with ten worker threads. We will use the context manager to ensure the thread pool is closed automatically once all pending and running tasks have been completed.

...

# create the thread pool

with ThreadPoolExecutor(10) as exe:

# ...

We can then call the submit() function for each file to be deleted, calling the delete_file() function with the file path as an argument.

This can be achieved in a list comprehension, and the resulting list of Future objects can be ignored as we do not require any result from the task.

...

# delete all files

_ = [exe.submit(delete_file, filepath) for filepath in files]

Tying this together, the complete example of multithreaded deletion of files from one directory is listed below.

# SuperFastPython.com

# delete files in a directory concurrently with threads

from os import listdir

from os import remove

from os.path import join

from concurrent.futures import ThreadPoolExecutor

# delete a file and report progress

def delete_file(filepath):

# delete the file

remove(filepath)

# report progress

print(f'.deleted {filepath}')

# delete all files in a directory

def main(path='tmp'):

# create a list of all files to delete

files = [join(path, filename) for filename in listdir(path)]

# create a thread pool

with ThreadPoolExecutor(10) as exe:

# delete all files

_ = [exe.submit(delete_file, filepath) for filepath in files]

print('Done')

# entry point

if __name__ == '__main__':

main()

Running the example deletes all files from the directory, as before.

...

.deleted tmp/data-0962.csv

.deleted tmp/data-9946.csv

.deleted tmp/data-4292.csv

.deleted tmp/data-0751.csv

.deleted tmp/data-8127.csv

.deleted tmp/data-6334.csv

.deleted tmp/data-0792.csv

.deleted tmp/data-2191.csv

.deleted tmp/data-8866.csv

.deleted tmp/data-2152.csv

Done

We do not see any speed benefit to multithreading the file deletion process, or any penalty slowing down the process.

On my system it takes about 629.6 milliseconds, which is roughly the same, probably within the margin of measurement error, as the sequential case that took 671.6 milliseconds.

[Finished in 622ms]

[Finished in 645ms]

[Finished in 622ms]

How long did it take to execute on your system?
Let me know in the comments below.

We can try increasing the number of worker threads to see if it offers any further benefit.

For example, we could try ten times more threads, e.g. 100 workers.

This requires simply changing the argument to the ThreadPoolExecutor class from 10 to 100.

...

# create the thread pool

with ThreadPoolExecutor(100) as exe:

# ...

Running the example with the increase in worker threads deletes all files as before.

In this case, we do not see an improvement in performance. In fact we see a small increase in the overall task duration.

On my system it took about 723.3 milliseconds with 100, compared to about 629.6 milliseconds with 10 threads. This may be within the margin of measurement error, or perhaps caused by the overhead of having to manage additional threads.

[Finished in 754ms]

[Finished in 719ms]

[Finished in 697ms]

You might like to test additional configurations such as 500 or even 1,000 threads to see if it makes a reasonable difference in speed.

Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps

Delete Files Concurrently with Threads in Batch

Each task in the thread pool adds overhead that slows down the overall task.

This includes function calls and object creation for each task, occurring internally within the ThreadPoolExecutor class.

We might reduce this overhead by performing file operations in a batch within each worker thread.

This can be achieved by updating the delete_file() function to receive a list of files to delete, and splitting up the files in the main() function into chunks to be submitted to worker threads for batch processing.

The hope is that this would reduce some of the overhead required for submitting so many small tasks by replacing it with a few larger batch tasks.

First, we can change the target function to receive a list of file names to delete, listed below.

# delete a file and report progress

def delete_files(filepaths):

# process all file names

for filepath in filepaths:

# delete the file

remove(filepath)

# report progress

print(f'.deleted {filepath}')

Next, in the main() function we can select a chunksize based on the number of worker threads and the number of files to delete.

First, we must create a list of all files to delete that we can split up later.

...

# create a list of all files to delete

files = [join(path, filename) for filename in listdir(path)]

In this case, we will use 10 worker threads with 10,000 files to delete. Dividing the files evenly between workers (10,000 / 10) or 1,000 files to delete for each worker.

...

# determine chunksize

n_workers = 10

chunksize = round(len(files) / n_workers)

Next, we can create the thread pool with the parameterized number of worker threads.

...

# create the thread pool

with ThreadPoolExecutor(n_workers) as exe:

...

We can then iterate over the list of files to delete and split them into chunks defined by a chunksize.

This can be achieved by iterating over the list of files and using the chunksize as the increment size in a for-loop. We can then split off a chunk of files to send to a worker thread as a task.

...

# split the move operations into chunks

for i in range(0, len(files), chunksize):

# select a chunk of filenames

filenames = files[i:(i + chunksize)]

# submit the batch move task

_ = exe.submit(delete_files, filenames)

And that’s it.

The complete example of deleting files in a batch concurrently using worker threads is listed below.

# SuperFastPython.com

# delete files in a directory concurrently with threads in batch

from os import listdir

from os import remove

from os.path import join

from concurrent.futures import ThreadPoolExecutor

# delete a file and report progress

def delete_files(filepaths):

# process all file names

for filepath in filepaths:

# delete the file

remove(filepath)

# report progress

print(f'.deleted {filepath}')

# delete all files in a directory

def main(path='tmp'):

# create a list of all files to delete

files = [join(path, filename) for filename in listdir(path)]

# determine chunksize

n_workers = 10

chunksize = round(len(files) / n_workers)

# create the process pool

with ThreadPoolExecutor(n_workers) as exe:

# split the move operations into chunks

for i in range(0, len(files), chunksize):

# select a chunk of filenames

filenames = files[i:(i + chunksize)]

# submit the batch move task

_ = exe.submit(delete_files, filenames)

print('Done')

# entry point

if __name__ == '__main__':

main()

Running the example deletes all 10,000 files from the tmp/ sub-directory.

...

.deleted tmp/data-5412.csv

.deleted tmp/data-3413.csv

.deleted tmp/data-9228.csv

.deleted tmp/data-1674.csv

.deleted tmp/data-7205.csv

.deleted tmp/data-8136.csv

.deleted tmp/data-1660.csv

.deleted tmp/data-7211.csv

.deleted tmp/data-8122.csv

.deleted tmp/data-4718.csv

Done

In this case, we do see a lift in performance.

On my system it takes about 394.6 milliseconds, compared to about 671.6 milliseconds in the case where files were deleted one-by-one. This is about a 1.7x speed improvement.

Not bad.

[Finished in 395ms]

[Finished in 417ms]

[Finished in 372ms]

Loving The Tutorials?

Why not take the next step? Get the book.

Learn more

Delete Files Concurrently with Processes

We can also try to delete files concurrently with processes instead of threads.

It is unclear whether processes can offer a speed benefit in this case. Given that we can get a benefit using threads with batch deletion, we know it is possible. Nevertheless, using processes requires data for each task to be serialised which introduces additional overhead that might negate any speed-up from performing file operations with true parallelism via processes.

We can explore using processes to delete files concurrently using the ProcessPoolExecutor.

This can be achieved by switching out the ThreadPoolExecutor directly and specifying the number of worker processes.

We will use 8 processes in this case as I have 8 logical CPU cores, although you may want to try different configurations based on the number of logical CPU cores you have available.

...

# create the thread pool

with ProcessPoolExecutor(8) as exe:

# ...

Tying this together, the complete example of deleting files concurrently using the process pool is listed below.

# SuperFastPython.com

# delete files in a directory concurrently with processes

from os import listdir

from os import remove

from os.path import join

from concurrent.futures import ProcessPoolExecutor

# delete a file and report progress

def delete_file(filepath):

# delete the file

remove(filepath)

# report progress

print(f'.deleted {filepath}', flush=True)

# delete all files in a directory

def main(path='tmp'):

# create a list of all files to delete

files = [join(path, filename) for filename in listdir(path)]

# create a process pool

with ProcessPoolExecutor(4) as exe:

# delete all files

_ = [exe.submit(delete_file, filepath) for filepath in files]

print('Done')

# entry point

if __name__ == '__main__':

main()

Running the example deletes all files as before.

…

.deleted tmp/data-5832.csv

.deleted tmp/data-2185.csv

.deleted tmp/data-5826.csv

.deleted tmp/data-2191.csv

.deleted tmp/data-1498.csv

.deleted tmp/data-0786.csv

.deleted tmp/data-7957.csv

.deleted tmp/data-6491.csv

.deleted tmp/data-5198.csv

.deleted tmp/data-4286.csv

Done

We do not see any lift in performance. In fact, we see a dramatic drop in performance compared to both the sequential and multithreaded examples.

On my system it takes about 2.23 seconds, compared to about 629.6 milliseconds for the single-process example above. This is about 3.3x slower.

The dramatic decrease in performance is very likely due to the overhead in serialising the arguments for the 10,000 tasks.

[Finished in 1.9s]

[Finished in 2.4s]

Delete Files Concurrently with Processes in Batch

Any data sent to a worker process or received from a worker process must be serialised (pickled).

This adds overhead for each task executed by worker threads.

This is likely the cause of worse performance seen when using the process pool in the previous section.

We can address this by batching the delete tasks into chunks to be executed by each worker process, just as we did when we batched filenames for the thread pools.

Again, this can be achieved by adapting the batched version of ThreadPoolExecutor from a previous section to use the ProcessPoolExecutor class.

We will use 8 worker processes. With 10,000 files to be deleted, this means that each worker will receive a batch of 1,250 files to delete.

...

# determine chunksize

n_workers = 8

chunksize = round(len(files) / n_workers)

We can then create the process pool using the parameterized number of workers.

...

# create the thread pool

with ProcessPoolExecutor(n_workers) as exe:

# ...

And that’s it.

The complete example of batch deleting files using process pools is listed below.

# SuperFastPython.com

# delete files in a directory concurrently with processes in batch

from os import listdir

from os import remove

from os.path import join

from concurrent.futures import ProcessPoolExecutor

# delete a file and report progress

def delete_file(filepaths):

# process all file names

for filepath in filepaths:

# delete the file

remove(filepath)

# report progress

print(f'.deleted {filepath}', flush=True)

# delete all files in a directory

def main(path='tmp'):

# create a list of all files to delete

files = [join(path, filename) for filename in listdir(path)]

# determine chunksize

n_workers = 8

chunksize = round(len(files) / n_workers)

# create the process pool

with ProcessPoolExecutor(n_workers) as exe:

# split the move operations into chunks

for i in range(0, len(files), chunksize):

# select a chunk of filenames

filenames = files[i:(i + chunksize)]

# submit the batch move task

_ = exe.submit(delete_file, filenames)

print('Done')

# entry point

if __name__ == '__main__':

main()

Running the example deletes all 10,000 files as we did previously.

...

.deleted tmp/data-7920.csv

.deleted tmp/data-2680.csv

.deleted tmp/data-2858.csv

.deleted tmp/data-2694.csv

.deleted tmp/data-9919.csv

.deleted tmp/data-0083.csv

.deleted tmp/data-7934.csv

.deleted tmp/data-6394.csv

.deleted tmp/data-4583.csv

.deleted tmp/data-5845.csv

Done

On my system, the program took about 497.3 milliseconds, which may be slightly slower than the batch multithreaded example that took about 394.6 milliseconds although slightly faster than the sequential example that took about 629.6 milliseconds seconds (about 1.35x faster).

[Finished in 519ms]

[Finished in 481ms]

[Finished in 492ms]

Delete Files Concurrently with AsyncIO

We can also delete files concurrently using asyncio.

Generally, Python does not support non-blocking operations when working with files.

It is not provided in the asyncio module. This is because it is challenging to implement in a general cross-platform manner.

The third-party library aiofiles provides file operations for use in asyncio operations, but again, the operations are not true non-blocking IO and instead the concurrency is simulated using threads.

Nevertheless, we can use aiofiles in an asyncio program to delete files concurrently.

First, we can update the delete_file() function to be an async function that can be awaited and change the call to the os.remove() function to use the aiofiles.os.remove() that can be awaited.

The updated version of delete_file() is listed below.

# delete a file and report progress

async def delete_file(filepath):

# delete the file

await remove(filepath)

# report progress

print(f'.deleted {filepath}')

Next, we can update the main() function.

First, we can create a list of coroutines (asynchronous functions) that will execute concurrently. Each coroutine will be a call to the delete_file() function with a separate file to delete.

This can be achieved in a list comprehension, for example:

...

# create a list of coroutines for deleting files

tasks = [delete_file(name) for name in filepaths]

You might recall that this defines a list of coroutines, and does not call the asynchronous function while constructing the list.

Next, we can execute the list of coroutines and wait for them to complete. This will attempt to perform all 10,000 file delete operations concurrently, allowing each task to yield control while waiting for the delete operation to complete via the await within the delete_file() function.

This can be achieved using the asyncio.gather() function that takes a collection of async function calls. Given that we have a list of these calls, we can unpack the function calls with the star (*) operator.

...

# execute and wait for all tasks to complete

await asyncio.gather(*tasks)

Finally, we can start the asyncio event loop by calling the asyncio.run() function and pass it a call to our main() function.

# entry point

if __name__ == '__main__':

# execute the asyncio run loop

asyncio.run(main())

The complete example of deleting files concurrently with asyncio is listed below.

# SuperFastPython.com

# delete all files from a directory concurrently with asyncio

from os import makedirs

from os import listdir

from os.path import join

from shutil import move

from aiofiles.os import remove

import asyncio

# delete a file and report progress

async def delete_file(filepath):

# delete the file

await remove(filepath)

# report progress

print(f'.deleted {filepath}')

# delete all files in a directory

async def main(path='tmp'):

# create a list of all file paths to delete

filepaths = [join(path, filename) for filename in listdir(path)]

# create a list of coroutines for deleting files

tasks = [delete_file(name) for name in filepaths]

# execute and wait for all tasks to complete

await asyncio.gather(*tasks)

print('Done')

# entry point

if __name__ == '__main__':

# execute the asyncio run loop

asyncio.run(main())

Running the example deletes all 10,000 files as before.

...

.deleted tmp/data-5826.csv

.deleted tmp/data-7957.csv

.deleted tmp/data-5198.csv

.deleted tmp/data-2191.csv

.deleted tmp/data-0792.csv

.deleted tmp/data-4286.csv

.deleted tmp/data-4292.csv

.deleted tmp/data-2634.csv

.deleted tmp/data-6491.csv

.deleted tmp/data-5832.csv

Done

We do not see any benefit in using asyncio to delete files concurrently. In fact, the results are worse than the sequential example we developed first.

On my system, the program completes in about 1.33 seconds which is about 2x worse than the sequential example that took about 629.6 milliseconds.

This is not surprising given that asyncio is operating in a single thread, that aiofiles is simulating non-blocking IO using threads internally and that we are attempting to run 10,000 coroutines simultaneously.

Nevertheless, it was an interesting exercise in case Python does support non-blocking file operations in the future.

[Finished in 1.3s]

[Finished in 1.4s]

Delete Files Concurrently with AsyncIO in Batch

Perhaps we can reduce some of the overhead in deleting files concurrently with asyncio by performing tasks in batch.

In the previous example, each coroutine performed a single IO-bound operation. We can update the delete_file() function to process multiple files in a batch as we did in previous batch processing examples.

This will reduce the number of coroutines in the asyncio runtime and may reduce the amount of data moving around. At the very least, it may be an interesting exercise.

The adaptation of the asyncio example to delete files in batch is very similar to the previous examples that we adapted to operate in batch.

The updated version of the task function that will process a batch of files to delete is listed below. The task will yield control for each delete operation, but only to other coroutines. The for-loop within a task will not progress while it is waiting.

# delete files and report progress

async def delete_files(filepaths):

# process all file names

for filepath in filepaths:

# delete the file

await remove(filepath)

# report progress

print(f'.deleted {filepath}')

Next, we can choose the number of coroutines we wish to use and then divide the total number of files up evenly into chunks.

We will use 100 in this case, chosen arbitrarily.

...

# determine chunksize

n_workers = 100

chunksize = round(len(files) / n_workers)

We can then iterate over the chunks and add each coroutine to a list for execution at the end.

...

# split the move operations into chunks

tasks = list()

for i in range(0, len(filepaths), chunksize):

# select a chunk of filenames

filenames = filepaths[i:(i + chunksize)]

# submit the batch move task

tasks.append(delete_files(filenames))

Finally, we can execute all 100 coroutines concurrently.

...

# execute all coroutines concurrently and wait for them to finish

await asyncio.gather(*tasks)

Tying this together, a complete example of batch file deleting using asyncio is listed below.

Again, we don’t expect a speed benefit over the single-threaded example, but perhaps a small lift in performance over the previous example where we had 10,000 coroutines operating concurrently.

# SuperFastPython.com

# delete all files from a directory concurrently with asyncio in batch

from os import makedirs

from os import listdir

from os.path import join

from shutil import move

from aiofiles.os import remove

import asyncio

# delete files and report progress

async def delete_files(filepaths):

# process all file names

for filepath in filepaths:

# delete the file

await remove(filepath)

# report progress

print(f'.deleted {filepath}')

# delete all files in a directory

async def main(path='tmp'):

# create a list of all files to delete

filepaths = [join(path, filename) for filename in listdir(path)]

# determine chunksize

n_workers = 100

chunksize = round(len(filepaths) / n_workers)

# split the move operations into chunks

tasks = list()

for i in range(0, len(filepaths), chunksize):

# select a chunk of filenames

filenames = filepaths[i:(i + chunksize)]

# submit the batch move task

tasks.append(delete_files(filenames))

# execute all coroutines concurrently and wait for them to finish

await asyncio.gather(*tasks)

print('Done')

# entry point

if __name__ == '__main__':

# execute the asyncio run loop

asyncio.run(main())

Running the example deletes approximately 20 gigabytes of files as we did in all previous examples.

...

.deleted tmp/data-2208.csv

.deleted tmp/data-2959.csv

.deleted tmp/data-8523.csv

.deleted tmp/data-5210.csv

.deleted tmp/data-1702.csv

.deleted tmp/data-1216.csv

.deleted tmp/data-9729.csv

.deleted tmp/data-4286.csv

.deleted tmp/data-7942.csv

.deleted tmp/data-4700.csv

Done

In this case, we don’t see any performance benefit. The example appears to take about as long as the non-batch asyncio example above.

On my system it takes about 1.23 seconds compared to about 1.33 seconds for the non-batched example. A small difference that might be within the margin of measurement error.

[Finished in 1.2s]

[Finished in 1.3s]

Takeaways

In this tutorial you discovered how to explore using multithreading for batch deleting thousands of files.

Do you have any questions?
Leave your question in a comment below and I will reply fast with my best advice.

Photo by Roberto Nickson on Unsplash

Multithreaded File Deletion in Python

Create 10,000 Files To Delete

Delete Files One-By-One (slowly)

How to Delete Files Concurrently

Delete Files Concurrently With Threads

Delete Files Concurrently with Threads in Batch

Delete Files Concurrently with Processes

Delete Files Concurrently with Processes in Batch

Delete Files Concurrently with AsyncIO

Delete Files Concurrently with AsyncIO in Batch

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn Concurrent File I/O Fast
(without the frustration)

Additional menu

Create 10,000 Files To Delete

Delete Files One-By-One (slowly)

How to Delete Files Concurrently

Delete Files Concurrently With Threads

Delete Files Concurrently with Threads in Batch

Delete Files Concurrently with Processes

Delete Files Concurrently with Processes in Batch

Delete Files Concurrently with AsyncIO

Delete Files Concurrently with AsyncIO in Batch

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Leave a Reply Cancel reply

Footer

Learn Concurrent File I/O Fast (without the frustration)

Learn Concurrent File I/O Fast
(without the frustration)