Last Updated on September 12, 2022
The ProcessPoolExecutor class in Python can be used to search multiple text files at the same time.
This can dramatically speed-up your program compared to searching text files sequentially, one-by-one.
In this tutorial, you will discover how to search text files concurrently using a pool of worker processes.
Let’s dive in.
How to Search Text Files One At A Time (slowly)
Searching ASCII text files is a common task we might perform.
Most modern operating systems provide some interface for searching the contents of files in a directory, but it is often terribly slow.
We might want to write a program to search the contents of a directory of text files to give more control over the process, such as over the search itself or how to use the results.
Let’s explore this project as the basis for how to use the ProcessPoolExecutor class effectively to execute tasks concurrently.
Directory of Text Files
Firstly, we need a directory of text files.
For the purposes of this tutorial, I recommend downloading the top 100 free books from Project Gutenberg:
Below is a .zip file that contains most of the top books from the site at the time of writing.
Create a new directory named ‘txt/‘ in your current working directory, e.g. where you will be creating python scripts.
Unzip the contents of the .zip file into this new directory, you should one file per book like the following:
1 2 3 4 5 6 |
10-0.txt 11-0.txt 16-0.txt 23-0.txt 35-0.txt ... |
Next, we can develop some code to load and search through each book file for a query word or token.
Load a Text File
First, let’s open and load a given file.
We can achieve this using the built-in open() function and specify the path of a file and the encoding, such as UTF-8.
We typically open files using the context manager, to ensure it is closed correctly once we leave the block.
Once open, we can read the contents and return it as a string via the read() function.
The load_txt_file() function below implements this, taking a file path and returning the contents of the file as a string.
1 2 3 4 |
# load the doc def load_txt_file(filepath): with open(filepath, encoding='utf-8') as file: return file.read() |
Search a Text File
Next, we can search the contents of the text file.
One approach would be to split the file into lines, then search each line. This can be achieved via the splitlines() function for strings.
1 2 3 |
... # split contents into lines lines = content.splitlines() |
We can then return only those lines that contain the given query.
This is the function where we might customize the manner in which we search the contents of the files, e.g. use regex or similar.
To make things simple, we can also make the search case insensitive by making both the query and the line of text lower case when checking for a match.
1 2 3 4 |
... # find all lines that contain the query lower_query = query.lower() matches = [line for line in lines if lower_query in line.lower()] |
Tying this together, the search_file_contents() function below takes the contents of a file and query and returns a list of all lines that contain a match.
1 2 3 4 5 6 7 |
# return all lines in the contents that contain the query def search_file_contents(content, query): # split contents into lines lines = content.splitlines() # find all lines that contain the query lower_query = query.lower() return [line for line in lines if lower_query in line.lower()] |
Task Function
We can then define a task function that performs the search operation for one file.
That is, given a file path and a query, load the file and perform the search and return all matches.
The search_txt_file() function below implements this.
1 2 3 4 5 6 |
# open a text file and return all lines that contain the query def search_txt_file(filepath, query): # open the file content = load_txt_file(filepath) # search the contents return search_file_contents(content, query) |
Search All Files in a Directory
Next, we can search each text file in a directory.
First, we can iterate over all files in a directory using the os.listdir() function that takes the path to the directory and returns each filename in the directory.
1 2 3 4 |
... # search each file in the directory for filename in listdir(dirpath): # ... |
For each filename, we can then construct a path to the file using the os.path.join() function to concatenate the directory path with the filename.
1 2 3 |
... # construct a path filepath = join(dirpath, filename) |
We can then call our search_txt_file() function with the file path and a query to get all matching lines.
1 2 3 |
... # get all results of query in the file results = search_txt_file(filepath, query) |
If we have results, we can report them with some indenting to make things readable.
1 2 3 4 5 6 |
... # report results if len(results) > 0: print(f'>{len(results)} results found in {filename}') for line in results: print(f'\t{line}') |
Tying this together, the search_txt_files() function below takes a directory path and a query and searches all text files in the directory and reports all lines in each file that match the query.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# search all txt files in a directory def search_txt_files(dirpath, query): # search each file in the directory for filename in listdir(dirpath): # construct a path filepath = join(dirpath, filename) # get all results of query in the file results = search_txt_file(filepath, query) # report results if len(results) > 0: print(f'>{len(results)} results found in {filename}') for line in results: print(f'\t{line}') |
Complete Example of Searching Text Files
We can tie all of this together into a complete example.
We need to specify the directory path and the query, then call our search_txt_files() function.
This can be done in the entry point, for example:
1 2 3 4 5 6 7 8 |
# entry point if __name__ == '__main__': # location of txt files DIRPATH = 'txt' # text to search for QUERY = 'jason' # search for query in all text files search_txt_files(DIRPATH, QUERY) |
The complete example of searing text files is below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
# SuperFastPython.com # search all text files in a directory for a query from os import listdir from os.path import join # load the doc def load_txt_file(filepath): with open(filepath, encoding='utf-8') as file: return file.read() # return all lines in the contents that contain the query def search_file_contents(content, query): # split contents into lines lines = content.splitlines() # find all lines that contain the query lower_query = query.lower() return [line for line in lines if lower_query in line.lower()] # open a text file and return all lines that contain the query def search_txt_file(filepath, query): # open the file content = load_txt_file(filepath) # search the contents return search_file_contents(content, query) # search all txt files in a directory def search_txt_files(dirpath, query): # search each file in the directory for filename in listdir(dirpath): # construct a path filepath = join(dirpath, filename) # get all results of query in the file results = search_txt_file(filepath, query) # report results if len(results) > 0: print(f'>{len(results)} results found in {filename}') for line in results: print(f'\t{line}') # entry point if __name__ == '__main__': # location of txt files DIRPATH = 'txt' # text to search for QUERY = 'jason' # search for query in all text files search_txt_files(DIRPATH, QUERY) |
Running the example searches the 80+ text files in the directory for the query “jason” (feel free to change the query to anything you like).
In this case, we can see 7 files containing lines with the query.
The example runs relatively quickly, completing in about 800 milliseconds on my system.
We can imagine that it would be dramatically slower if we had thousands or millions of text files to search.
It might be more interesting if we reported the book title (on the first line of the file) along with the filenames in the results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
>5 results found in 10-0.txt set all the city on an uproar, and assaulted the house of Jason, and 17:6 And when they found them not, they drew Jason and certain the world upside down are come hither also; 17:7 Whom Jason hath 17:9 And when they had taken security of Jason, and of the other, they 16:21 Timotheus my workfellow, and Lucius, and Jason, and Sosipater, >1 results found in 3600-0.txt Sometimes she plays the physician. Jason of Pheres being given over by >1 results found in 408-0.txt after which Jason and his Argonauts went vaguely wandering into the >2 results found in 3207-0.txt St. Luke) Jason and certain Brethren unto the Rulers of the City, also, whom Jason hath received. And these all do contrary to the Decrees >1 results found in 996-0.txt Rich spoils than Jason’s; who a point so keen >3 results found in 6130-0.txt and Jason in Apollonius, and several others in the same manner. To Jason, shepherd of his people, bore,) Where Jason’s son the price demanded gave; >1 results found in 1727-0.txt to Jason. |
Next, let’s look at how we can adapt our program for searching text files concurrently with the ProcessPoolExecutor.
Run loops using all CPUs, download your FREE book to learn how.
Search Multiple Text Files Concurrently
The program for searching text files can be adapted to use the ProcessPoolExecutor with just a few changes.
The search_txt_file() can be our task function and we will submit one per file in the directory. This will load the contents of the file from disk and search the contents in a separate python process, then return the matches.
One approach is to use the submit() function to dispatch the tasks to the process pool. This will give one Future object for each task. We can then use as_completed() to report results as tasks are completed, making the program responsive.
To implement this approach we will have to make some changes to the search_txt_file() target task function.
First, we will need to construct the file path within the function, meaning the filename and directory path will need to be provided as arguments.
Secondly, we can return both the matches found in the contents of the file and the file name, both details that will be needed for reporting. We could avoid this change if the search_txt_files() that dispatches tasks associates the filename with each Future object, e.g. via a dictionary.
The updated version of the search_txt_file() with these changes is listed below.
1 2 3 4 5 6 7 8 9 |
# open a text file and return all lines that contain the query def search_txt_file(filename, dirpath, query): # construct a path filepath = join(dirpath, filename) # open the file content = load_txt_file(filepath) # search the contents matches = search_file_contents(content, query) return (filename, matches) |
Next, we need to submit calls to this updated search_txt_file() function to the process pool for asynchronous execution.
We can create the process pool with the default number of processes, e.g. matching the number of logical CPU cores Python can detect in your system.
1 2 3 4 |
... # create process pool with ProcessPoolExecutor() as exe: # ... |
Next, we can submit one call to search_txt_file() for each filename returned from os.listdir(dirpath) and collect the Future objects into a list.
1 2 3 |
... # search text files asynchronously futures = [exe.submit(search_txt_file, f, dirpath, query) for f in listdir(dirpath)] |
Next, we can then process the results as they become available via iterating the Future objects in the order they are completed via the as_completed() function.
1 2 3 4 |
... # process results for future in as_completed(futures): # ... |
For each Future object, we can first get results as a tuple of filename and query matches, then report results with print() statements as we did before in the serial version.
1 2 3 4 5 6 7 8 |
... # get results filename, matches = future.result() # report results if len(matches) > 0: print(f'>{len(matches)} results found in {filename}') for line in matches: print(f'\t{line}') |
Tying this together, the updated version of the search_txt_files() function that searches text files for a query asynchronously using the process pool is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# search all txt files in a directory def search_txt_files(dirpath, query): # create process pool with ProcessPoolExecutor() as exe: # search text files asynchronously futures = [exe.submit(search_txt_file, f, dirpath, query) for f in listdir(dirpath)] # process results for future in as_completed(futures): # get results filename, matches = future.result() # report results if len(matches) > 0: print(f'>{len(matches)} results found in {filename}') for line in matches: print(f'\t{line}') |
And that’s it.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# SuperFastPython.com # search all text files in a directory for a query concurrently from os import listdir from os.path import join from concurrent.futures import ProcessPoolExecutor from concurrent.futures import as_completed # load the doc def load_txt_file(filepath): with open(filepath, encoding='utf-8') as file: return file.read() # return all lines in the contents that contain the query def search_file_contents(content, query): # split contents into lines lines = content.splitlines() # find all lines that contain the query lower_query = query.lower() return [line for line in lines if lower_query in line.lower()] # open a text file and return all lines that contain the query def search_txt_file(filename, dirpath, query): # construct a path filepath = join(dirpath, filename) # open the file content = load_txt_file(filepath) # search the contents matches = search_file_contents(content, query) return (filename, matches) # search all txt files in a directory def search_txt_files(dirpath, query): # create process pool with ProcessPoolExecutor() as exe: # search text files asynchronously futures = [exe.submit(search_txt_file, f, dirpath, query) for f in listdir(dirpath)] # process results for future in as_completed(futures): # get results filename, matches = future.result() # report results if len(matches) > 0: print(f'>{len(matches)} results found in {filename}') for line in matches: print(f'\t{line}') # entry point if __name__ == '__main__': # location of txt files DIRPATH = 'txt' # text to search for QUERY = 'jason' # search for query in all text files search_txt_files(DIRPATH, QUERY) |
Running the example submits a call into the process pool for each file in the directory.
The program reports the same results as the serial example above, although much faster.
In this case, the program takes approximately 400 milliseconds to complete, about half the time as the serial case.
How long does it take to run on your system?
Let me know in the comments below.
Experiment with different queries, I’d love to hear what you discover.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
>1 results found in 3600-0.txt Sometimes she plays the physician. Jason of Pheres being given over by >5 results found in 10-0.txt set all the city on an uproar, and assaulted the house of Jason, and 17:6 And when they found them not, they drew Jason and certain the world upside down are come hither also; 17:7 Whom Jason hath 17:9 And when they had taken security of Jason, and of the other, they 16:21 Timotheus my workfellow, and Lucius, and Jason, and Sosipater, >1 results found in 408-0.txt after which Jason and his Argonauts went vaguely wandering into the >2 results found in 3207-0.txt St. Luke) Jason and certain Brethren unto the Rulers of the City, also, whom Jason hath received. And these all do contrary to the Decrees >3 results found in 6130-0.txt and Jason in Apollonius, and several others in the same manner. To Jason, shepherd of his people, bore,) Where Jason’s son the price demanded gave; >1 results found in 1727-0.txt to Jason. >1 results found in 996-0.txt Rich spoils than Jason’s; who a point so keen |
Extensions
This section lists ideas for extending the tutorial.
- Show book title. Update the example to report the book title for text files that have a match.
- Map Future objects to filenames. Update the search_txt_files() to map filenames to Future objects so that results can be reported and search_txt_file() only needs to return matches.
- Download books, then search. Update the example to first download books from gutenberg.org using a ThreadPoolExeuctor then search the contents of each book using a ProcessPoolExecutor.
Share your extensions in the comments below, it would be great to see what you come up with.
Free Python ProcessPoolExecutor Course
Download your FREE ProcessPoolExecutor PDF cheat sheet and get BONUS access to my free 7-day crash course on the ProcessPoolExecutor API.
Discover how to use the ProcessPoolExecutor class including how to configure the number of workers and how to execute tasks asynchronously.
Further Reading
This section provides additional resources that you may find helpful.
Books
- ProcessPoolExecutor Jump-Start, Jason Brownlee (my book!)
- Concurrent Futures API Interview Questions
- ProcessPoolExecutor PDF Cheat Sheet
I also recommend specific chapters from the following books:
- Effective Python, Brett Slatkin, 2019.
- See Chapter 7: Concurrency and Parallelism
- Python in a Nutshell, Alex Martelli, et al., 2017.
- See: Chapter: 14: Threads and Processes
Guides
- Python ProcessPoolExecutor: The Complete Guide
- Python ThreadPoolExecutor: The Complete Guide
- Python Multiprocessing: The Complete Guide
- Python Pool: The Complete Guide
APIs
References
- Thread (computing), Wikipedia.
- Process (computing), Wikipedia.
- Thread Pool, Wikipedia.
- Futures and promises, Wikipedia.
Overwhelmed by the python concurrency APIs?
Find relief, download my FREE Python Concurrency Mind Maps
Takeaways
In this tutorial you discovered how to search text files concurrently using a pool of worker processes.
Do you have any questions?
Leave your question in a comment below and I will reply fast with my best advice.
Photo by Christopher Jolly on Unsplash
Do you have any questions?