Python

How to Create a Matching Program with Python

It’s a common problem, especially on the web, to validate entered form data and to filter out the important information from the input.

Of course, this is also possible with normal string operations, but this problem can be solved elegantly and with relatively little code using regular expressions.

 

Our sample Python program is supposed to read all relevant information from a kind of electronic business card and prepare it in a machine-readable form. The business card is saved in a text file in the following format:

 

Name: John Doe

Addr: Samplestreet 123

      12345 Sampletown

P:    +1 781 228 5070

 

The program should now read this text file, extract the information it contains, and prepare it to form a dictionary like the following:

 

{

   'Name': ('John', 'Doe'),

   'Addr': ('Samplestreet', '123', '12345', 'Sampletown'),

   'P': ('+1', '781', '228', '5070')

}

 

We assume that the text file always contains only one data record.

 

Let’s first go into more detail about how the sample program works. The business card consists of various pieces of information, always provided with a heading or category ("Name", "Addr", and "P"). Separating the category from the information isn’t a complicated matter as the colon doesn’t occur within the category names and thus the first occurrence of a colon in a line always marks the transition between category and information. The third line is a problem because no explicit heading is given here. In such a case, the line is appended to the information of the previous heading. In this way, a dictionary can be created that maps the headings to the relevant information.

 

Let's now move on to the implementation. To do this, we first write a function that reads the data line by line and formats it into a dictionary:

 

def read_file(filename):

   d = {}

     with open(filename) as f:

       for line in f:

           if ":" in line:

           key, d[key] = (s.strip() for s in line.split(":",1))

           elif "key" in locals():

           d[key] += "\n{}".format(line.strip())

   return d

 

The read_file function is passed the filename string with a path specification. Within the function, the file is read line by line. Each line is divided into two parts, category and information, on the basis of the first colon, and by using the strip method, they’re stripped of superfluous blanks. Then the heading and information are written to dictionary d, and the current heading is additionally referenced by key.

 

Wherever there was no colon in a line, the information was wrapped to several lines. For us, this means that we first also apply the strip method to the complete line contents and then append it to the already existing value in the dictionary under the key heading. For this purpose, the key reference must exist, of course. Because it’s only created within the if statement, it’s assumed that a line with a colon must come before a line without a colon. Although there is no meaningful file in which this assumption doesn’t hold, we explicitly check the elif branch to see if the key reference does exist.

 

The result of this function is a dictionary with the headings as keys and the associated information (in the form of strings) as values. The second function in the example parses the data using regular expressions and then stores it as a tuple in the dictionary. To do this, we first create a dictionary called regexp that provides a regular expression for each heading that can be used to validate the information:

 

regexp = {

   "Name": r"([A-Za-z]+)\s([A-Za-z]+)",

   "Addr": r"([A-Za-z]+)\s(\d+)\s*(\d{5})\s([A-Za-z]+)",

   "P": r"(\+\d{1,3})\s(\d{3})\s(\d{3})\s(\d{4,})"

}

 

These regular expressions have several groups to make it easier to split the information into the different individual pieces of information.

 

The function used to analyze the data looks as follows:

 

def analyze_data(data, regexp):

   for key in data:

       if key not in regexp:

          return False

       m = re.match(regexp[key], data[key])

       if not m:

           return False

       data[key] = m.groups()

   return True

 

The analyze_data function is passed two dictionaries as parameters: first the regexp dictionary that was just created, and second, the dictionary created by the read_file function, which contains the read data.

 

The function iterates over the data dictionary in a for loop and applies the regular expression to the read string using the re.match function, matching the current heading. The returned match object is referenced by m.

 

Then, we test whether re.match returned the value None. If so, the analyze_data function will return False. Otherwise, the current value of the data dictionary is overwritten with the substring that matched each regular expression group. The group method of the match object returns a tuple of strings. After running through the analyze_data function, the dictionary contains the required data in formatted form.

 

Last but not least, the code that triggers the reading and preparation of the data is still missing:

 

data = read_file("id.txt")

if analyze_data(data, regexp):

   print(data)

else:

   print("The data is incorrect")

 

Depending on the truth value returned by the analyze_data function, either the prepared data or an error message will be output.

 

Hopefully, these two examples have helped you get a hands-on introduction to the world of regular expressions. Note that the presented program actually works, but it’s far from perfect. Feel free to expand or adapt it as you wish. For example, the regular expressions don’t yet allow umlauts or punctuation marks in the street name. And you could also add an email address to the business card and program.

 

Editor’s note: This post has been adapted from a section of the book Python 3: The Comprehensive Guide by Johannes Ernesti and Peter Kaiser.

Recommendation

Python 3
Python 3

Ready to master Python? Learn to write effective code with this award-winning comprehensive guide, whether you’re a beginner or a professional programmer. Review core Python concepts, including functions, modularization, and object orientation, and walk through the available data types. Then dive into more advanced topics, such as using Django and working with GUIs. With plenty of code examples throughout, this hands-on reference guide has everything you need to become proficient in Python!

Learn More
Rheinwerk Computing
by Rheinwerk Computing

Rheinwerk Computing is an imprint of Rheinwerk Publishing and publishes books by leading experts in the fields of programming, administration, security, analytics, and more.

Comments