Search Text Files Concurrently with the ProcessPoolExecutor in Python

Last Updated on September 12, 2022

The ProcessPoolExecutor class in Python can be used to search multiple text files at the same time.

This can dramatically speed-up your program compared to searching text files sequentially, one-by-one.

In this tutorial, you will discover how to search text files concurrently using a pool of worker processes.

Let’s dive in.

Table of Contents

How to Search Text Files One At A Time (slowly)

Searching ASCII text files is a common task we might perform.

Most modern operating systems provide some interface for searching the contents of files in a directory, but it is often terribly slow.

We might want to write a program to search the contents of a directory of text files to give more control over the process, such as over the search itself or how to use the results.

Let’s explore this project as the basis for how to use the ProcessPoolExecutor class effectively to execute tasks concurrently.

Directory of Text Files

Firstly, we need a directory of text files.

For the purposes of this tutorial, I recommend downloading the top 100 free books from Project Gutenberg:

Top Free Ebooks on Project Gutenberg

Below is a .zip file that contains most of the top books from the site at the time of writing.

Top 100 most viewed books (top_ebooks_project_gutenberg.zip)

Create a new directory named ‘txt/‘ in your current working directory, e.g. where you will be creating python scripts.

Unzip the contents of the .zip file into this new directory, you should one file per book like the following:

10-0.txt

11-0.txt

16-0.txt

23-0.txt

35-0.txt

...

Next, we can develop some code to load and search through each book file for a query word or token.

Load a Text File

First, let’s open and load a given file.

We can achieve this using the built-in open() function and specify the path of a file and the encoding, such as UTF-8.

We typically open files using the context manager, to ensure it is closed correctly once we leave the block.

Once open, we can read the contents and return it as a string via the read() function.

The load_txt_file() function below implements this, taking a file path and returning the contents of the file as a string.

# load the doc

def load_txt_file(filepath):

with open(filepath, encoding='utf-8') as file:

return file.read()

Search a Text File

Next, we can search the contents of the text file.

One approach would be to split the file into lines, then search each line. This can be achieved via the splitlines() function for strings.

...

# split contents into lines

lines = content.splitlines()

We can then return only those lines that contain the given query.

This is the function where we might customize the manner in which we search the contents of the files, e.g. use regex or similar.

To make things simple, we can also make the search case insensitive by making both the query and the line of text lower case when checking for a match.

...

# find all lines that contain the query

lower_query = query.lower()

matches = [line for line in lines if lower_query in line.lower()]

Tying this together, the search_file_contents() function below takes the contents of a file and query and returns a list of all lines that contain a match.

# return all lines in the contents that contain the query

def search_file_contents(content, query):

# split contents into lines

lines = content.splitlines()

# find all lines that contain the query

lower_query = query.lower()

return [line for line in lines if lower_query in line.lower()]

Task Function

We can then define a task function that performs the search operation for one file.

That is, given a file path and a query, load the file and perform the search and return all matches.

The search_txt_file() function below implements this.

# open a text file and return all lines that contain the query

def search_txt_file(filepath, query):

# open the file

content = load_txt_file(filepath)

# search the contents

return search_file_contents(content, query)

Search All Files in a Directory

Next, we can search each text file in a directory.

First, we can iterate over all files in a directory using the os.listdir() function that takes the path to the directory and returns each filename in the directory.

...

# search each file in the directory

for filename in listdir(dirpath):

# ...

For each filename, we can then construct a path to the file using the os.path.join() function to concatenate the directory path with the filename.

...

# construct a path

filepath = join(dirpath, filename)

We can then call our search_txt_file() function with the file path and a query to get all matching lines.

...

# get all results of query in the file

results = search_txt_file(filepath, query)

If we have results, we can report them with some indenting to make things readable.

...

# report results

if len(results) > 0:

print(f'>{len(results)} results found in {filename}')

for line in results:

print(f'\t{line}')

Tying this together, the search_txt_files() function below takes a directory path and a query and searches all text files in the directory and reports all lines in each file that match the query.

# search all txt files in a directory

def search_txt_files(dirpath, query):

# search each file in the directory

for filename in listdir(dirpath):

# construct a path

filepath = join(dirpath, filename)

# get all results of query in the file

results = search_txt_file(filepath, query)

# report results

if len(results) > 0:

print(f'>{len(results)} results found in {filename}')

for line in results:

print(f'\t{line}')

Complete Example of Searching Text Files

We can tie all of this together into a complete example.

We need to specify the directory path and the query, then call our search_txt_files() function.

This can be done in the entry point, for example:

# entry point

if __name__ == '__main__':

# location of txt files

DIRPATH = 'txt'

# text to search for

QUERY = 'jason'

# search for query in all text files

search_txt_files(DIRPATH, QUERY)

The complete example of searing text files is below.

# SuperFastPython.com

# search all text files in a directory for a query

from os import listdir

from os.path import join

# load the doc

def load_txt_file(filepath):

with open(filepath, encoding='utf-8') as file:

return file.read()

# return all lines in the contents that contain the query

def search_file_contents(content, query):

# split contents into lines

lines = content.splitlines()

# find all lines that contain the query

lower_query = query.lower()

return [line for line in lines if lower_query in line.lower()]

# open a text file and return all lines that contain the query

def search_txt_file(filepath, query):

# open the file

content = load_txt_file(filepath)

# search the contents

return search_file_contents(content, query)

# search all txt files in a directory

def search_txt_files(dirpath, query):

# search each file in the directory

for filename in listdir(dirpath):

# construct a path

filepath = join(dirpath, filename)

# get all results of query in the file

results = search_txt_file(filepath, query)

# report results

if len(results) > 0:

print(f'>{len(results)} results found in {filename}')

for line in results:

print(f'\t{line}')

# entry point

if __name__ == '__main__':

# location of txt files

DIRPATH = 'txt'

# text to search for

QUERY = 'jason'

# search for query in all text files

search_txt_files(DIRPATH, QUERY)

Running the example searches the 80+ text files in the directory for the query “jason” (feel free to change the query to anything you like).

In this case, we can see 7 files containing lines with the query.

The example runs relatively quickly, completing in about 800 milliseconds on my system.

We can imagine that it would be dramatically slower if we had thousands or millions of text files to search.

It might be more interesting if we reported the book title (on the first line of the file) along with the filenames in the results.

>5 results found in 10-0.txt

set all the city on an uproar, and assaulted the house of Jason, and

17:6 And when they found them not, they drew Jason and certain

the world upside down are come hither also; 17:7 Whom Jason hath

17:9 And when they had taken security of Jason, and of the other, they

16:21 Timotheus my workfellow, and Lucius, and Jason, and Sosipater,

>1 results found in 3600-0.txt

Sometimes she plays the physician. Jason of Pheres being given over by

>1 results found in 408-0.txt

after which Jason and his Argonauts went vaguely wandering into the

>2 results found in 3207-0.txt

St. Luke) Jason and certain Brethren unto the Rulers of the City,

also, whom Jason hath received. And these all do contrary to the Decrees

>1 results found in 996-0.txt

Rich spoils than Jason’s; who a point so keen

>3 results found in 6130-0.txt

and Jason in Apollonius, and several others in the same manner.

To Jason, shepherd of his people, bore,)

Where Jason’s son the price demanded gave;

>1 results found in 1727-0.txt

to Jason.

Next, let’s look at how we can adapt our program for searching text files concurrently with the ProcessPoolExecutor.

Run loops using all CPUs, download your FREE book to learn how.

Search Multiple Text Files Concurrently

The program for searching text files can be adapted to use the ProcessPoolExecutor with just a few changes.

The search_txt_file() can be our task function and we will submit one per file in the directory. This will load the contents of the file from disk and search the contents in a separate python process, then return the matches.

One approach is to use the submit() function to dispatch the tasks to the process pool. This will give one Future object for each task. We can then use as_completed() to report results as tasks are completed, making the program responsive.

To implement this approach we will have to make some changes to the search_txt_file() target task function.

First, we will need to construct the file path within the function, meaning the filename and directory path will need to be provided as arguments.

Secondly, we can return both the matches found in the contents of the file and the file name, both details that will be needed for reporting. We could avoid this change if the search_txt_files() that dispatches tasks associates the filename with each Future object, e.g. via a dictionary.

The updated version of the search_txt_file() with these changes is listed below.

# open a text file and return all lines that contain the query

def search_txt_file(filename, dirpath, query):

# construct a path

filepath = join(dirpath, filename)

# open the file

content = load_txt_file(filepath)

# search the contents

matches = search_file_contents(content, query)

return (filename, matches)

Next, we need to submit calls to this updated search_txt_file() function to the process pool for asynchronous execution.

We can create the process pool with the default number of processes, e.g. matching the number of logical CPU cores Python can detect in your system.

...

# create process pool

with ProcessPoolExecutor() as exe:

# ...

Next, we can submit one call to search_txt_file() for each filename returned from os.listdir(dirpath) and collect the Future objects into a list.

...

# search text files asynchronously

futures = [exe.submit(search_txt_file, f, dirpath, query) for f in listdir(dirpath)]

Next, we can then process the results as they become available via iterating the Future objects in the order they are completed via the as_completed() function.

...

# process results

for future in as_completed(futures):

# ...

For each Future object, we can first get results as a tuple of filename and query matches, then report results with print() statements as we did before in the serial version.

...

# get results

filename, matches = future.result()

# report results

if len(matches) > 0:

print(f'>{len(matches)} results found in {filename}')

for line in matches:

print(f'\t{line}')

Tying this together, the updated version of the search_txt_files() function that searches text files for a query asynchronously using the process pool is listed below.

# search all txt files in a directory

def search_txt_files(dirpath, query):

# create process pool

with ProcessPoolExecutor() as exe:

# search text files asynchronously

futures = [exe.submit(search_txt_file, f, dirpath, query) for f in listdir(dirpath)]

# process results

for future in as_completed(futures):

# get results

filename, matches = future.result()

# report results

if len(matches) > 0:

print(f'>{len(matches)} results found in {filename}')

for line in matches:

print(f'\t{line}')

And that’s it.

The complete example is listed below.

# SuperFastPython.com

# search all text files in a directory for a query concurrently

from os import listdir

from os.path import join

from concurrent.futures import ProcessPoolExecutor

from concurrent.futures import as_completed

# load the doc

def load_txt_file(filepath):

with open(filepath, encoding='utf-8') as file:

return file.read()

# return all lines in the contents that contain the query

def search_file_contents(content, query):

# split contents into lines

lines = content.splitlines()

# find all lines that contain the query

lower_query = query.lower()

return [line for line in lines if lower_query in line.lower()]

# open a text file and return all lines that contain the query

def search_txt_file(filename, dirpath, query):

# construct a path

filepath = join(dirpath, filename)

# open the file

content = load_txt_file(filepath)

# search the contents

matches = search_file_contents(content, query)

return (filename, matches)

# search all txt files in a directory

def search_txt_files(dirpath, query):

# create process pool

with ProcessPoolExecutor() as exe:

# search text files asynchronously

futures = [exe.submit(search_txt_file, f, dirpath, query) for f in listdir(dirpath)]

# process results

for future in as_completed(futures):

# get results

filename, matches = future.result()

# report results

if len(matches) > 0:

print(f'>{len(matches)} results found in {filename}')

for line in matches:

print(f'\t{line}')

# entry point

if __name__ == '__main__':

# location of txt files

DIRPATH = 'txt'

# text to search for

QUERY = 'jason'

# search for query in all text files

search_txt_files(DIRPATH, QUERY)

Running the example submits a call into the process pool for each file in the directory.

The program reports the same results as the serial example above, although much faster.

In this case, the program takes approximately 400 milliseconds to complete, about half the time as the serial case.

How long does it take to run on your system?
Let me know in the comments below.

Experiment with different queries, I’d love to hear what you discover.

>1 results found in 3600-0.txt

Sometimes she plays the physician. Jason of Pheres being given over by

>5 results found in 10-0.txt

set all the city on an uproar, and assaulted the house of Jason, and

17:6 And when they found them not, they drew Jason and certain

the world upside down are come hither also; 17:7 Whom Jason hath

17:9 And when they had taken security of Jason, and of the other, they

16:21 Timotheus my workfellow, and Lucius, and Jason, and Sosipater,

>1 results found in 408-0.txt

after which Jason and his Argonauts went vaguely wandering into the

>2 results found in 3207-0.txt

St. Luke) Jason and certain Brethren unto the Rulers of the City,

also, whom Jason hath received. And these all do contrary to the Decrees

>3 results found in 6130-0.txt

and Jason in Apollonius, and several others in the same manner.

To Jason, shepherd of his people, bore,)

Where Jason’s son the price demanded gave;

>1 results found in 1727-0.txt

to Jason.

>1 results found in 996-0.txt

Rich spoils than Jason’s; who a point so keen

Download Now: Free ProcessPoolExecutor PDF Cheat Sheet

Extensions

This section lists ideas for extending the tutorial.

Show book title. Update the example to report the book title for text files that have a match.
Map Future objects to filenames. Update the search_txt_files() to map filenames to Future objects so that results can be reported and search_txt_file() only needs to return matches.
Download books, then search. Update the example to first download books from gutenberg.org using a ThreadPoolExeuctor then search the contents of each book using a ProcessPoolExecutor.

Share your extensions in the comments below, it would be great to see what you come up with.

Free Python ProcessPoolExecutor Course

Download your FREE ProcessPoolExecutor PDF cheat sheet and get BONUS access to my free 7-day crash course on the ProcessPoolExecutor API.

Discover how to use the ProcessPoolExecutor class including how to configure the number of workers and how to execute tasks asynchronously.

Learn more

Takeaways

In this tutorial you discovered how to search text files concurrently using a pool of worker processes.

Do you have any questions?
Leave your question in a comment below and I will reply fast with my best advice.

Photo by Christopher Jolly on Unsplash

Search Text Files Concurrently with the ProcessPoolExecutor in Python

How to Search Text Files One At A Time (slowly)

Directory of Text Files

Load a Text File

Search a Text File

Task Function

Search All Files in a Directory

Complete Example of Searching Text Files

Search Multiple Text Files Concurrently

Extensions

Further Reading

Takeaways

Related Tutorials:

Parallel Loops in Python

Loving the Tutorials?

Get The Book:

Don't Dabble!

Learn All Of Python Concurrency

No more idle CPUs

Learn the ProcessPoolExecutor Systematically

Additional menu

How to Search Text Files One At A Time (slowly)

Directory of Text Files

Load a Text File

Search a Text File

Task Function

Search All Files in a Directory

Complete Example of Searching Text Files

Search Multiple Text Files Concurrently

Extensions

Further Reading

Takeaways

Share this:

Related Tutorials:

About Jason Brownlee

Parallel Loops in Python

Reader Interactions

Do you have any questions?Cancel reply

Footer

Learn the ProcessPoolExecutor Systematically