Last Updated on August 21, 2023
Renaming a single file is fast, but renaming many files can be slow.
It can become painfully slow in situations where you may need to rename thousands or millions of files. The hope is that multithreading can be used to speed up the file renaming operation.
In this tutorial, you will explore how to rename thousands of files using multithreading.
Let’s dive in.
Create 10,000 Files To Rename
Before we can rename thousands of files, we must create them.
We can write a script that will create 10,000 CSV files, each file with 10,000 lines and each line containing 10 random numeric values.
Each file will be about 1.9 megabytes, and all files together will be about 19 gigabytes in size.
First, we can create a function that will take a file path and string data and save it to file.
The save_file() function below implements this, taking the full path and data, opens the file in ASCII format, then saves the data to the file. The context manager is used so that the file is closed automatically.
1 2 3 4 5 6 |
# save data to a file def save_file(filepath, data): # open the file with open(filepath, 'w') as handle: # save the data handle.write(data) |
Next, we can generate the data to be saved.
First, we can write a function to generate a single line of data. In this case, a line is composed of 10 random numeric values in the range between 0 and 1, stored in CSV format (e.g. separated by a comma per value).
The generate_line() function below implements this, returning a string of random data.
1 2 3 |
# generate a line of mock data of 10 random data points def generate_line(): return ','.join([str(random()) for _ in range(10)]) |
Next, we can write a function to generate the contents of one file.
This will be 10,000 lines of random values, where each line is separated by a new line.
The generate_file_data() function below implements this, returning a string representing the data for a single data file.
1 2 3 4 5 6 |
# generate file data of 10K lines each with 10 data points def generate_file_data(): # generate many lines of data lines = [generate_line() for _ in range(10000)] # convert to a single ascii doc with new lines return '\n'.join(lines) |
Finally, we can generate data for all of the files and save each with a separate file name.
First, we can create a directory to store all of the created files (e.g. ‘tmp‘) under the current working directory.
This can be achieved using the os.makedirs() function, for example:
1 2 3 |
... # create a local directory to save files makedirs(path, exist_ok=True) |
We can then loop 10,000 times and create the data and a unique filename for each iteration, then save the generated contents of the file to disk.
1 2 3 4 5 6 7 8 9 10 11 |
... # create all files for i in range(10000): # generate data data = generate_file_data() # create filenames filepath = join(path, f'data-{i:04d}.csv') # save data file save_file(filepath, data) # report progress print(f'.saved {filepath}') |
The generate_all_files() function listed below implements this, creating the 10,000 files of 10,000 lines of data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# generate 10K files in a directory def generate_all_files(path='tmp'): # create a local directory to save files makedirs(path, exist_ok=True) # create all files for i in range(10000): # generate data data = generate_file_data() # create filenames filepath = join(path, f'data-{i:04d}.csv') # save data file save_file(filepath, data) # report progress print(f'.saved {filepath}') |
Tying this together, the complete example of creating a large number of CSV files is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# SuperFastPython.com # create a large number of data files from os import makedirs from os.path import join from random import random # save data to a file def save_file(filepath, data): # open the file with open(filepath, 'w') as handle: # save the data handle.write(data) # generate a line of mock data of 10 random data points def generate_line(): return ','.join([str(random()) for _ in range(10)]) # generate file data of 10K lines each with 10 data points def generate_file_data(): # generate many lines of data lines = [generate_line() for _ in range(10000)] # convert to a single ascii doc with new lines return '\n'.join(lines) # generate 10K files in a directory def generate_all_files(path='tmp'): # create a local directory to save files makedirs(path, exist_ok=True) # create all files for i in range(10000): # generate data data = generate_file_data() # create filenames filepath = join(path, f'data-{i:04d}.csv') # save data file save_file(filepath, data) # report progress print(f'.saved {filepath}') # entry point, generate all of the files generate_all_files() |
Running the example will create 10,000 CSV files of random data into a tmp/ directory.
It takes a long time to run, depending on the speed of your hard drive, e.g. SSD used in modern computer systems are quite fast.
On my system it takes about 11 minutes to complete.
1 2 3 4 5 6 7 8 9 10 11 |
... .saved tmp/data-9990.csv .saved tmp/data-9991.csv .saved tmp/data-9992.csv .saved tmp/data-9993.csv .saved tmp/data-9994.csv .saved tmp/data-9995.csv .saved tmp/data-9996.csv .saved tmp/data-9997.csv .saved tmp/data-9998.csv .saved tmp/data-9999.csv |
Next, we can explore ways of renaming a large number of files.
Run loops using all CPUs, download your FREE book to learn how.
Rename Files One-By-One (slowly)
We are interested in renaming a list of files that could be in the same or different directories.
We will simulate the general file renaming task by renaming a list of files in one source directory.
Renaming all files in one directory in Python is relatively straightforward.
Python provides a number of ways for renaming files, although the os.rename() function is perhaps the standard approach.
In this tutorial, we will use the os.rename() function to rename files and provide it with the relative path to the source file and the relative path to the destination (renamed) file in the same directory.
For example:
1 2 3 |
... # rename source file to destination file rename(src_path, dest_path) |
First, we can iterate all files in the source directory using the os.listdir() function that will return the name of each file.
1 2 3 4 |
... # rename all files in a directory for filename in listdir(src): # ... |
For each file to be renamed we will first split the filename into a name and extension part, so that we can devise a new filename. This can be achieved using the os.path.splitext() function, for example:
1 2 3 |
... # separate filename into name and extension name, extension = splitext(filename) |
We can then devise a new filename. In this case we will append the string “-new” to the filename.
1 2 3 |
... # devise new filename dest_filename = f'{name}-new{extension}' |
Next, we can define the relative path to the existing file and the new file that will be used as arguments to the rename function. This can be achieved using the os.path.join() function that takes the directory as the first argument and the filename as the second argument.
1 2 3 4 |
... # devise source and destination relative paths src_path = join(src, filename) dest_path = join(src, dest_filename) |
Finally, we can rename the file and report progress.
1 2 3 4 5 |
... # rename file rename(src_path, dest_path) # report progress print(f'.renamed {src_path} to {dest_path}') |
Tying this together, the complete example of renaming all files in a directory sequentially (one-by-one) is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# SuperFastPython.com # rename files in a directory sequentially from os import listdir from os import rename from os.path import join from os.path import splitext # rename files in src def main(src='tmp'): # rename all files in a directory for filename in listdir(src): # separate filename into name and extension name, extension = splitext(filename) # devise new filename dest_filename = f'{name}-new{extension}' # devise source and destination relative paths src_path = join(src, filename) dest_path = join(src, dest_filename) # rename file rename(src_path, dest_path) # report progress print(f'.renamed {src_path} to {dest_path}') print('Done') # entry point if __name__ == '__main__': main() |
Running the program will rename all files in the tmp/ subdirectory to have a “-new” string in the filename.
1 2 3 4 5 6 7 8 9 10 11 12 |
... .renamed tmp/data-5832.csv to tmp/data-5832-new.csv .renamed tmp/data-2185.csv to tmp/data-2185-new.csv .renamed tmp/data-5826.csv to tmp/data-5826-new.csv .renamed tmp/data-2191.csv to tmp/data-2191-new.csv .renamed tmp/data-1498.csv to tmp/data-1498-new.csv .renamed tmp/data-0786.csv to tmp/data-0786-new.csv .renamed tmp/data-7957.csv to tmp/data-7957-new.csv .renamed tmp/data-6491.csv to tmp/data-6491-new.csv .renamed tmp/data-5198.csv to tmp/data-5198-new.csv .renamed tmp/data-4286.csv to tmp/data-4286-new.csv Done |
Renaming files is faster than copying files, although it will still take a moment as there are 10,000 files, each about 2 megabytes in size.
The overall execution time will depend on the speed of your hardware, specifically the speed of your hard drive.
On my system with an SSD hard drive, it takes about 1.36 seconds to complete (resetting the tmp/ directory between runs).
1 2 3 |
[Finished in 1.4s] [Finished in 1.4s] [Finished in 1.3s] |
How long does it take to run on your system?
Let me know in the comments below.
Now that we know how to rename thousands of files, let’s consider how we might speed it up using multiple threads.
How to Rename Files Concurrently
File renaming is performed by the operating system which then calls down to the hardware, probably to do something with the table of contents for files on the disk (e.g. a file allocation table or the modern equivalent).
As such, we might expect that file renaming is an IO-bound task, limited by the speed that files can be “renamed” by the disk, e.g. perhaps just updates to the hard drives file allocation table.
This suggests that threads may be appropriate to perform concurrent file renaming operations. This is because the CPython interpreter will release the global interpreter lock when performing IO operations, perhaps such as renaming a file. This is opposed to processes that are better suited to CPU-bound tasks, that execute as fast as the CPU can run them.
Given that renaming files does not involve creating new data or moving data around on the disk, file renaming is a generally fast process.
Ultimately we might expect that file renaming on the disk is performed sequentially. That is the disk is only able to rename one file at a time.
As such, our expectation may be that file renaming cannot be performed in parallel on a single disk and using threads will not offer any speed-up.
Let’s explore this in the next section.
Free Concurrent File I/O Course
Get FREE access to my 7-day email course on concurrent File I/O.
Discover patterns for concurrent file I/O, how save files with a process pool, how to copy files with a thread pool, and how to append to a file from multiple threads safely.
Rename Files Concurrently with Threads
The ThreadPoolExecutor provides a pool of worker threads that we can use to multithread the renaming of thousands of files.
Each file in the tmp/ directory will represent a task that will require the file to be renamed from the source name to the destination name using the os.rename() function.
First, we can define a function that represents the task. The function will take the name of the file and the source directory and will perform the rename operation and report progress to stdout.
The rename_file() function below implements this.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# rename a file def rename_file(filename, src): # separate filename into name and extension name, extension = splitext(filename) # devise new filename dest_filename = f'{name}-new{extension}' # devise source and destination relative paths src_path = join(src, filename) dest_path = join(src, dest_filename) # rename file rename(src_path, dest_path) # report progress print(f'.renamed {src_path} to {dest_path}') |
Next, we can call this function for each file in the tmp/ directory.
First, we can create the thread pool with ten worker threads, chosen arbitrarily.
We will use the context manager to ensure the thread pool is closed automatically once all pending and running tasks have been completed.
1 2 3 4 |
... # create the thread pool with ThreadPoolExecutor(10) as exe: # ... |
We can then call the submit() function for each file to be renamed, calling the rename_file() function with the filename and directory arguments.
This can be achieved in a list comprehension, and the resulting list of Future objects can be ignored as we do not require any result from the task.
1 2 3 |
... # submit all file rename tasks _ = [exe.submit(rename_file, name, src) for name in listdir(src)] |
Tying this together, the complete example of multithreaded renaming of files is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# SuperFastPython.com # rename all files in a directory concurrently with threads from os import listdir from os import rename from os.path import join from os.path import splitext from concurrent.futures import ThreadPoolExecutor # rename a file def rename_file(filename, src): # separate filename into name and extension name, extension = splitext(filename) # devise new filename dest_filename = f'{name}-new{extension}' # devise source and destination relative paths src_path = join(src, filename) dest_path = join(src, dest_filename) # rename file rename(src_path, dest_path) # report progress print(f'.renamed {src_path} to {dest_path}') # rename files in src def main(src='tmp'): # create the thread pool with ThreadPoolExecutor(10) as exe: # submit all file rename tasks _ = [exe.submit(rename_file, name, src) for name in listdir(src)] print('Done') # entry point if __name__ == '__main__': main() |
Running the example renames all files in the tmp/ sub-directory , as before.
1 2 3 4 5 6 7 8 9 10 11 12 |
... .renamed tmp/data-6491.csv to tmp/data-6491-new.csv .renamed tmp/data-7980.csv to tmp/data-7980-new.csv .renamed tmp/data-8457.csv to tmp/data-8457-new.csv .renamed tmp/data-8480.csv to tmp/data-8480-new.csv .renamed tmp/data-5832.csv to tmp/data-5832-new.csv .renamed tmp/data-3933.csv to tmp/data-3933-new.csv .renamed tmp/data-6446.csv to tmp/data-6446-new.csv .renamed tmp/data-5615.csv to tmp/data-5615-new.csv .renamed tmp/data-2146.csv to tmp/data-2146-new.csv .renamed tmp/data-2620.csv to tmp/data-2620-new.csv Done |
We may see a small increase in speed in performing the rename operations concurrently with threads.
On my system, the task completed in about 1.06 seconds, compared to about 1.36 seconds with the sequential case. This is about 1.28x faster.
1 2 3 |
[Finished in 1.1s] [Finished in 991ms] [Finished in 1.1s] |
How long does it take to run on your system?
Let me know in the comments below.
We can try increasing the number of worker threads to see if it offers any further benefit.
For example, we could try ten times more threads, e.g. 100 workers.
This requires simply changing the argument to the ThreadPoolExecutor class from 10 to 100.
1 2 3 4 |
... # create the thread pool with ThreadPoolExecutor(100) as exe: # ... |
Running the example with the increase in worker threads renames all files as before.
In this case, we may see a small further speed increase, or it may be within the margin of error.
On my system it took about 967.6 milliseconds (recall there are 1,000 milliseconds in a second), compared to about 1.36 seconds for the single-threaded version. That is about 1.41x faster.
1 2 3 |
[Finished in 947ms] [Finished in 956ms] [Finished in 1.0s] |
You might like to test additional configurations such as 500 or even 1,000 threads to see if it makes a reasonable difference in speed.
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Rename Files Concurrently with Threads in Batch
Each task in the thread pool adds overhead that slows down the overall task.
This includes function calls and object creation for each task, occurring internally within the ThreadPoolExecutor class.
We might reduce this overhead by performing file operations in a batch within each worker thread.
This can be achieved by updating the rename_file() function to receive a list of files to rename, and splitting up the files in the main() function into chunks to be submitted to worker threads for batch processing.
The hope is that this would reduce some of the overhead required for submitting so many small tasks by replacing it with a few larger batch tasks.
First, we can change the target function to receive a list of file names to rename, listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# rename a files def rename_files(filenames, src): # rename all files for filename in filenames: # separate filename into name and extension name, extension = splitext(filename) # devise new filename dest_filename = f'{name}-new{extension}' # devise source and destination relative paths src_path = join(src, filename) dest_path = join(src, dest_filename) # rename file rename(src_path, dest_path) # report progress print(f'.renamed {src_path} to {dest_path}') |
Next, in the main() function we can select a chunksize based on the number of worker threads and the number of files to rename.
First, we must create a list of all files to rename that we can split up later.
1 2 3 |
... # list of all files to rename files = [name for name in listdir(src)] |
In this case, we will use 100 worker threads with 10,000 files to rename. Dividing the files evenly between workers (10,000 / 100) or 100 files to rename for each worker.
1 2 3 4 |
... # determine chunksize n_workers = 100 chunksize = round(len(files) / n_workers) |
Next, we can create the thread pool with the parameterized number of worker threads.
1 2 3 4 |
... # create the thread pool with ThreadPoolExecutor(n_workers) as exe: ... |
We can then iterate over the list of files to rename and split them into chunks defined by a chunksize.
This can be achieved by iterating over the list of files and using the chunksize as the increment size in a for-loop. We can then split off a chunk of files to send to a worker thread as a task.
1 2 3 4 5 6 7 |
... # split the rename operations into chunks for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # submit file rename tasks _ = exe.submit(rename_files, filenames, src) |
And that’s it.
The complete example of renaming files in a batch concurrently using worker threads is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
# SuperFastPython.com # rename all files in a directory concurrently with threads in batch from os import listdir from os import rename from os.path import join from os.path import splitext from concurrent.futures import ThreadPoolExecutor # rename a files def rename_files(filenames, src): # rename all files for filename in filenames: # separate filename into name and extension name, extension = splitext(filename) # devise new filename dest_filename = f'{name}-new{extension}' # devise source and destination relative paths src_path = join(src, filename) dest_path = join(src, dest_filename) # rename file rename(src_path, dest_path) # report progress print(f'.renamed {src_path} to {dest_path}') # rename files in src def main(src='tmp'): # list of all files to rename files = [name for name in listdir(src)] # determine chunksize n_workers = 100 chunksize = round(len(files) / n_workers) # create the thread pool with ThreadPoolExecutor(n_workers) as exe: # split the rename operations into chunks for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # submit file rename tasks _ = exe.submit(rename_files, filenames, src) print('Done') # entry point if __name__ == '__main__': main() |
Running the example renames all 10,000 files in the tmp/ subdirectory as before.
1 2 3 4 5 6 7 8 9 10 11 12 |
... .renamed tmp/data-2992.csv to tmp/data-2992-new.csv .renamed tmp/data-5462.csv to tmp/data-5462-new.csv .renamed tmp/data-0288.csv to tmp/data-0288-new.csv .renamed tmp/data-3454.csv to tmp/data-3454-new.csv .renamed tmp/data-6821.csv to tmp/data-6821-new.csv .renamed tmp/data-0175.csv to tmp/data-0175-new.csv .renamed tmp/data-1828.csv to tmp/data-1828-new.csv .renamed tmp/data-8529.csv to tmp/data-8529-new.csv .renamed tmp/data-7281.csv to tmp/data-7281-new.csv .renamed tmp/data-9637.csv to tmp/data-9637-new.csv Done |
In this case, see a further minor decrease in overall execution time.
On my system, renaming all files takes about 823 milliseconds compared to about 967.6 milliseconds with 100 workers in non-batch mode and 1.36 seconds in the sequential case. This is about 1.65x faster compared to the sequential example.
1 2 3 |
[Finished in 800ms] [Finished in 802ms] [Finished in 867ms] |
Rename Files Concurrently with Processes
We can also rename files concurrently with processes instead of threads.
It is unclear whether processes can offer a speed benefit in this case. Given that we can get a benefit using threads, we know it is possible. Nevertheless, using processes requires data for each task to be serialised which introduces additional overhead that might negate any speed-up from performing file operations concurrently.
We can explore using processes to rename files concurrently using the ProcessPoolExecutor.
This can be achieved by switching out the ThreadPoolExecutor directly and specifying the number of worker processes.
We will use 8 processes in this case as I have 8 logical CPU cores, although you may want to try different configurations based on the number of logical CPU cores you have available.
1 2 3 4 |
... # create the thread pool with ProcessPoolExecutor(8) as exe: # ... |
Tying this together, the complete example of renaming files concurrently using the process pool is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# SuperFastPython.com # rename all files in a directory concurrently with processes from os import listdir from os import rename from os.path import join from os.path import splitext from concurrent.futures import ProcessPoolExecutor # rename a file def rename_file(filename, src): # separate filename into name and extension name, extension = splitext(filename) # devise new filename dest_filename = f'{name}-new{extension}' # devise source and destination relative paths src_path = join(src, filename) dest_path = join(src, dest_filename) # rename file rename(src_path, dest_path) # report progress print(f'.renamed {src_path} to {dest_path}', flush=True) # rename files in src def main(src='tmp'): # create the thread pool with ProcessPoolExecutor(8) as exe: # submit all file rename tasks _ = [exe.submit(rename_file, name, src) for name in listdir(src)] print('Done') # entry point if __name__ == '__main__': main() |
Running the example renames all 10,000 files in the tmp/ subdirectory as before.
1 2 3 4 5 6 7 8 9 10 11 |
.renamed tmp/data-8457.csv to tmp/data-8457-new.csv .renamed tmp/data-5601.csv to tmp/data-5601-new.csv .renamed tmp/data-8443.csv to tmp/data-8443-new.csv .renamed tmp/data-1329.csv to tmp/data-1329-new.csv .renamed tmp/data-5629.csv to tmp/data-5629-new.csv .renamed tmp/data-8319.csv to tmp/data-8319-new.csv .renamed tmp/data-6452.csv to tmp/data-6452-new.csv .renamed tmp/data-2813.csv to tmp/data-2813-new.csv .renamed tmp/data-2185.csv to tmp/data-2185-new.csv .renamed tmp/data-4286.csv to tmp/data-4286-new.csv Done |
In this case, we do not see a speed-up. Instead, the task is slower than even the sequential case.
On my system it takes about 2.7 seconds, compared to the 1.36 seconds for the sequential case. This is about 2.0x slower.
1 2 3 |
[Finished in 2.7s] [Finished in 2.7s] [Finished in 2.7s] |
Rename Files Concurrently with Processes in Batch
Any data sent to a worker process or received from a worker process must be serialised (pickled).
This adds overhead for each task executed by worker threads.
This is likely the cause of worse performance seen when using the process pool in the previous section.
We can address this by batching the rename tasks into chunks to be executed by each worker process, just as we did when we batched filenames for the thread pools in a previous section.
Again, this can be achieved by adapting the batched version of ThreadPoolExecutor from a previous section to use the ProcessPoolExecutor.
We will use 8 worker processes. With 10,000 files to be renamed, this means that each worker will receive a batch of 1,250 files to rename.
1 2 3 4 |
... # determine chunksize n_workers = 8 chunksize = round(len(files) / n_workers) |
We can then create the process pool using the parameterized number of workers.
1 2 3 4 |
... # create the thread pool with ProcessPoolExecutor(n_workers) as exe: # ... |
And that’s it.
The complete example of batch renaming files using process pools is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
# SuperFastPython.com # rename all files in a directory concurrently with processes in batch from os import listdir from os import rename from os.path import join from os.path import splitext from concurrent.futures import ProcessPoolExecutor # rename a files def rename_files(filenames, src): # rename all files for filename in filenames: # separate filename into name and extension name, extension = splitext(filename) # devise new filename dest_filename = f'{name}-new{extension}' # devise source and destination relative paths src_path = join(src, filename) dest_path = join(src, dest_filename) # rename file rename(src_path, dest_path) # report progress print(f'.renamed {src_path} to {dest_path}', flush=True) # rename files in src def main(src='tmp'): # list of all files to rename files = [name for name in listdir(src)] # determine chunksize n_workers = 8 chunksize = round(len(files) / n_workers) # create the thread pool with ProcessPoolExecutor(n_workers) as exe: # split the rename operations into chunks for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # submit file rename tasks _ = exe.submit(rename_files, filenames, src) print('Done') # entry point if __name__ == '__main__': main() |
Running the example renames all of the files as before.
1 2 3 4 5 6 7 8 9 10 11 12 |
... .renamed tmp/data-5832.csv to tmp/data-5832-new.csv .renamed tmp/data-2185.csv to tmp/data-2185-new.csv .renamed tmp/data-5826.csv to tmp/data-5826-new.csv .renamed tmp/data-2191.csv to tmp/data-2191-new.csv .renamed tmp/data-1498.csv to tmp/data-1498-new.csv .renamed tmp/data-0786.csv to tmp/data-0786-new.csv .renamed tmp/data-7957.csv to tmp/data-7957-new.csv .renamed tmp/data-6491.csv to tmp/data-6491-new.csv .renamed tmp/data-5198.csv to tmp/data-5198-new.csv .renamed tmp/data-4286.csv to tmp/data-4286-new.csv Done |
We do see a small improvement in performance as compared to the sequential case, although the results are not as good as the batch multithreaded case.
On my system it takes about 976 milliseconds, compared to about 823 milliseconds in the multithreaded case, and compared to 1.36 seconds for the sequential case. That is about 1.39x faster.
It is possible that further improvements can be achieved by tuning the division of rename tasks to the worker processes, e.g. optimising chunksize.
1 2 3 |
[Finished in 928ms] [Finished in 1.0s] [Finished in 1.0s] |
Rename Files Concurrently with AsyncIO
We can also rename files concurrently using asyncio.
Generally, Python does not support non-blocking IO operations when working with files. It is not provided in the asyncio module. This is because it is challenging to implement in a general cross-platform manner.
The third-party library aiofiles provides file operations for use in asyncio operations, but again, the operations are not true non-blocking IO and instead the concurrency is simulated using threads.
Nevertheless, we can use aiofiles in an asyncio program to rename files concurrently with coroutines.
First, we can update the rename_file() function to be an async function that can be waited upon and change the rename() function to use the aiofiles.os.rename() that is awaited upon.
The updated version of rename_file() is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# rename a files async def rename_file(filename, src): # separate filename into name and extension name, extension = splitext(filename) # devise new filename dest_filename = f'{name}-new{extension}' # devise source and destination relative paths src_path = join(src, filename) dest_path = join(src, dest_filename) # rename file await rename(src_path, dest_path) # report progress print(f'.renamed {src_path} to {dest_path}', flush=True) |
Next, we can update the main() function.
First, we can create a list of coroutines (asynchronous functions) that will execute concurrently. Each coroutine will be a call to the rename_file() function with a separate file to rename.
This can be achieved in a list comprehension, for example:
1 2 3 |
... # create one coroutine for each file to rename tasks = [rename_file(filename, src) for filename in listdir(src)] |
Recall, this defines a list of coroutines, and does not call the asynchronous function while constructing the list.
Next, we can execute the list of coroutines and wait for them to complete. This will attempt to perform all 10,000 file rename operations concurrently, allowing each task to yield control while waiting for the rename operation to complete via the await within the rename_file() function.
This can be achieved using the asyncio.gather() function that takes a collection of async function calls. Given that we have a list of these calls, we can unpack the function calls with the star (*) operator.
1 2 3 |
... # execute and wait for all coroutines await asyncio.gather(*tasks) |
Finally, we can start the asyncio event loop by calling the asyncio.run() function.
1 2 3 4 |
# entry point if __name__ == '__main__': # execute the asyncio run loop asyncio.run(main()) |
The complete example of renaming files concurrently with asyncio is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# SuperFastPython.com # rename all files in a directory concurrently with asyncio from os import listdir from os.path import join from os.path import splitext from aiofiles.os import rename import asyncio # rename a files async def rename_file(filename, src): # separate filename into name and extension name, extension = splitext(filename) # devise new filename dest_filename = f'{name}-new{extension}' # devise source and destination relative paths src_path = join(src, filename) dest_path = join(src, dest_filename) # rename file await rename(src_path, dest_path) # report progress print(f'.renamed {src_path} to {dest_path}', flush=True) # rename files in src async def main(src='tmp'): # create one coroutine for each file to rename tasks = [rename_file(filename, src) for filename in listdir(src)] # execute and wait for all coroutines await asyncio.gather(*tasks) print('Done') # entry point if __name__ == '__main__': # execute the asyncio run loop asyncio.run(main()) |
Running the example renames all 10,000 files as before.
1 2 3 4 5 6 7 8 9 10 11 |
.renamed tmp/data-2185.csv to tmp/data-2185-new.csv .renamed tmp/data-2152.csv to tmp/data-2152-new.csv .renamed tmp/data-4286.csv to tmp/data-4286-new.csv .renamed tmp/data-4292.csv to tmp/data-4292-new.csv .renamed tmp/data-0786.csv to tmp/data-0786-new.csv .renamed tmp/data-4537.csv to tmp/data-4537-new.csv .renamed tmp/data-0792.csv to tmp/data-0792-new.csv .renamed tmp/data-5198.csv to tmp/data-5198-new.csv .renamed tmp/data-6491.csv to tmp/data-6491-new.csv .renamed tmp/data-1498.csv to tmp/data-1498-new.csv Done |
The asyncio version is slightly slower than the sequential version.
On my system, it takes about 1.63 seconds, compared to about 1.36 for the sequential version which is about 1.2x slower.
It is possible that the task is slower because of the overhead of the asyncio event loop and any simulated concurrent achieved using threads provided by the aiofiles library.
This is not a surprising result, although it was an interesting exercise.
1 2 3 |
[Finished in 1.6s] [Finished in 1.6s] [Finished in 1.7s] |
Rename Files Concurrently with AsyncIO in Batch
Perhaps we can reduce some of the overhead in renaming files concurrently with asyncio by performing tasks in batch.
In the previous example, each coroutine performed a single IO-bound operation. We can update the rename_file() function to rename_files() to process multiple files in a batch as we did in previous batch processing examples.
This will reduce the number of coroutines in the asyncio runtime and may reduce the amount of data moving around. At the very least, it may be an interesting exercise.
The adaptation of the asyncio example to rename files in batch is very similar to the previous examples that we adapted to operate in batch.
The updated version of the task function that will process a list of files is listed below. The task will yield control for each rename operation, but only to other coroutines. The for-loop within a task will not progress while it is waiting.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# rename a files async def rename_files(filenames, src): # rename all files for filename in filenames: # separate filename into name and extension name, extension = splitext(filename) # devise new filename dest_filename = f'{name}-new{extension}' # devise source and destination relative paths src_path = join(src, filename) dest_path = join(src, dest_filename) # rename file await rename(src_path, dest_path) # report progress print(f'.renamed {src_path} to {dest_path}', flush=True) |
Next, we can choose the number of coroutines we wish to use and then divide the total number of files up evenly into chunks.
We will use 100 in this case, chosen arbitrarily.
1 2 3 4 |
... # determine chunksize n_workers = 100 chunksize = round(len(files) / n_workers) |
We can then iterate over the chunks and add each coroutine to a list for execution at the end.
1 2 3 4 5 6 7 8 |
... # split the rename operations into chunks tasks = list() for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # add coroutine to list of tasks tasks.append(rename_files(filenames, src)) |
Finally, we can execute all 100 coroutines concurrently.
1 2 3 |
... # execute and wait for all coroutines await asyncio.gather(*tasks) |
Tying this together, a complete example of batch file renaming using asyncio is listed below.
Again, we don’t expect a speed benefit over the single-threaded example, but perhaps a small lift in performance over the previous example where we had 10,000 coroutines operating concurrently.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
# SuperFastPython.com # rename all files in a directory concurrently with asyncio in batch from os import listdir from os.path import join from os.path import splitext from aiofiles.os import rename import asyncio # rename a files async def rename_files(filenames, src): # rename all files for filename in filenames: # separate filename into name and extension name, extension = splitext(filename) # devise new filename dest_filename = f'{name}-new{extension}' # devise source and destination relative paths src_path = join(src, filename) dest_path = join(src, dest_filename) # rename file await rename(src_path, dest_path) # report progress print(f'.renamed {src_path} to {dest_path}', flush=True) # rename files in src async def main(src='tmp'): # list of all files to rename files = [name for name in listdir(src)] # determine chunksize n_workers = 100 chunksize = round(len(files) / n_workers) # split the rename operations into chunks tasks = list() for i in range(0, len(files), chunksize): # select a chunk of filenames filenames = files[i:(i + chunksize)] # add coroutine to list of tasks tasks.append(rename_files(filenames, src)) # execute and wait for all coroutines await asyncio.gather(*tasks) print('Done') # entry point if __name__ == '__main__': # execute the asyncio run loop asyncio.run(main()) |
Running the example renames all 10,000 files in the tmp/ subdirectory as before.
1 2 3 4 5 6 7 8 9 10 11 12 |
... .renamed tmp/data-9616.csv to tmp/data-9616-new.csv .renamed tmp/data-4286.csv to tmp/data-4286-new.csv .renamed tmp/data-7010.csv to tmp/data-7010-new.csv .renamed tmp/data-5014.csv to tmp/data-5014-new.csv .renamed tmp/data-5462.csv to tmp/data-5462-new.csv .renamed tmp/data-5026.csv to tmp/data-5026-new.csv .renamed tmp/data-9700.csv to tmp/data-9700-new.csv .renamed tmp/data-3237.csv to tmp/data-3237-new.csv .renamed tmp/data-2664.csv to tmp/data-2664-new.csv .renamed tmp/data-3413.csv to tmp/data-3413-new.csv Done |
In this case, we may see a small improvement in performance over the non-batch asyncio example, although perhaps still slower than the single-threaded example we developed first.
On my system, the example takes about 1.5 seconds, which may be better than the non-batch version that took 1.63 seconds, or may be within the margin of error. That is about 1.1x slower.
1 2 3 |
[Finished in 1.5s] [Finished in 1.5s] [Finished in 1.5s] |
Further Reading
This section provides additional resources that you may find helpful.
Books
- Concurrent File I/O in Python, Jason Brownlee (my book!)
Guides
Python File I/O APIs
- Built-in Functions
- os - Miscellaneous operating system interfaces
- os.path - Common pathname manipulations
- shutil - High-level file operations
- zipfile — Work with ZIP archives
- Python Tutorial: Chapter 7. Input and Output
Python Concurrency APIs
- threading — Thread-based parallelism
- multiprocessing — Process-based parallelism
- concurrent.futures — Launching parallel tasks
- asyncio — Asynchronous I/O
File I/O in Asyncio
References
Takeaways
In this tutorial you discovered how to explore using multithreading for renaming thousands of files.
Do you have any questions?
Leave your question in a comment below and I will reply fast with my best advice.
Photo by Grant Durr on Unsplash
Jrees says
Really cool! I am working on a project where I need to rename files at work and this has given me a good start as to where to go. I currently am reading a pdf, matching to a regex code, and then renaming the file accordingly. I think it is far too slow because of the process of opening each file and then matching to the correct thing.My fastest test result was 0.43 seconds. I’m running on a Mac m2 Max with 32gb ram. Just got this thing and it is a beast!
Jason Brownlee says
Thank you.
Well done on your progress!