Featured

Regular Expressions in Python Programming

Let’s provide insight into the extensive possibilities of regular expressions for Python.

 

Regular expressions are used to search and change character strings via search patterns and are used in many programming languages; see https://en.wikipedia.org/wiki/Regular_expression.

 

Regular expressions are often used to check and evaluate entries or to search for files. If the program determines that an incorrect or incomplete entry has been made, the system can provide assistance and prompt you for a new entry.

 

In Python, the re module allows you to use regular expressions. The two programs we’ll write in this blog post demonstrate some of its extensive possibilities.

 

Finding Text Parts

In the next program, the findall() function is used to search for parts of texts. It finds all text segments that match the regular expression and returns a list of these text parts for further evaluation. The program is subdivided for a better overview. In addition, the individual blocks of the program and their associated outputs are numbered.

 

Let’s look at the first part of this program.

 

import re

 

tx = "house and mouse and louse"

print(tx)

print("1:", re.findall("mouse",tx))

print("2:", re.findall("[hm]ouse",tx))

print("3:", re.findall("[l-m]ouse",tx))

print("4:", re.findall("[^l-m]ouse",tx))

print("5:", re.findall(".ouse",tx))

print("6:", re.findall("^.ouse",tx))

print("7:", re.findall(".ouse$",tx))

print()

...

 

The following output, including the numbers, results from the first part of this program: house and mouse and louse

 

1: ['mouse']

2: ['house', 'mouse']

3: ['mouse', 'louse']

4: ['house']

5: ['house', 'mouse', 'louse']

6: ['house']

7: ['louse']

 

The first seven regular expressions refer to the house and mouse and louse text:

  1. The mouse string is searched for. All occurrences of this character string are returned.
  2. The texts parts that begin with an h or an m and end with ouse are searched for. In the rectangular brackets, we are basically saying “Search for h or m”.
  3. The text parts that begin with one of the characters from l to m and end with ouse are searched for. A range is specified in the rectangular brackets using the hyphen.
  4. The text parts that do not begin with one of the characters from l to m but end with ouse are searched for. The ^ character represents a logical negation in the context of a range.
  5. The system searches for text parts that begin with any character and end with ouse. The . character means “any character”.
  6. As before, but only valid for text parts that appear at the beginning of the text being analyzed. The ^ character means “at the beginning”.
  7. As before, but only valid for text parts that are at the end of the text being analyzed. The $ character means “at the end”.

Let’s now consider the second part of this program.

 

...

tx = "0172-445633"

print(tx)

print("8:", re.findall("[0-2]",tx))

print("9:", re.findall("[^0-2]",tx))

print("10:", re.findall("[047-]",tx))

print()

...

 

The following output results from the second part of this program:

 

0172-445633

8: ['0', '1', '2']

9: ['7', '-', '4', '4', '5', '6', '3', '3']

10: ['0', '7', '-', '4', '4']

 

The next three expressions refer to the text 0172-445633:

  1. One of the digits from 0 to 2 is searched for as a text segment. This time, the range includes digits.
  2. A character that is not in the number range from 0 to 2 is searched for as a text part. All digits from 3 and all non-digits are found.
  3. One of the characters from the specified set of characters is searched for as a text part. All characters or digits mentioned are found.

Finally, here is the last part of this program:

 

...

tx = "aa and aba and abba and abbba and aca"

print(tx)

print("11:", re.findall("ab*a",tx))

print("12:", re.findall("ab+a",tx))

print("13:", re.findall("ab?a",tx))

print("14:", re.findall("ab{2,3}a",tx))

 

The output of this final part of the program reads as follows:

 

aa and aba and abba and abbba and aca

11: ['aa', 'aba', 'abba', 'abbba']

12: ['aba', 'abba', 'abbba']

13: ['aa', 'aba']

14: ['abba', 'abbba']

 

The last four regular expressions refer to the text, aa and aba and abba and abbba and aca. This text illustrates the behavior regarding the number of occurrences of certain characters.

  1. All text parts are found that contain, in sequence, an a, any number (0 is also possible) of the character b, again an a. The * character (asterisk) means “any number of occurrences of the specific character”.
  2. All text parts are found that contain an a, at least one b, again an a. The + (plus) sign means “The number of occurrences of the specific character is 1 or greater”.
  3. All text parts are found that contain an a, none or one b, again an a. The question mark ? means in this context “The number of occurrences of the specified character is 0 or 1”.
  4. All text parts are found that contain an a, two or three times the character b, again an a. The curly brackets allow you to specify the desired number of occurrences.

Replacing Text Parts

The sub() function replaces all text parts that match the regular expression with another text. For ease of understanding, let’s use the same texts and regular expressions in this next program as in the previous program.

 

import re

 

tx = "house and mouse and louse"

print(tx)

print("1:", re.sub("mouse","x",tx))

print("2:", re.sub("[h|m]ouse","x",tx))

print("3:", re.sub("[l-m]ouse","x",tx))

print("4:", re.sub("[^l-m]ouse","x",tx))

print("5:", re.sub(".ouse","x",tx))

print("6:", re.sub("^.ouse","x",tx))

print("7:", re.sub(".ouse$","x",tx))

print()

 

tx = "0172-445633"

print(tx)

print("8:", re.sub("[0-2]","x",tx))

print("9:", re.sub("[^0-2]","x",tx))

print("10:", re.sub("[047-]","x",tx))

print()

 

tx = "aa and aba and abba and abbba and aca"

print(tx)

print("11:", re.sub("ab*a","x",tx))

print("12:", re.sub("ab+a","x",tx))

print("13:", re.sub("ab?a","x",tx))

print("14:", re.sub("ab{2,3}a","x",tx))

 

The output of this program reads as follows:

 

house and mouse and louse

1: house and x and louse

2: x and x and louse

3: house and x and x

4: x and mouse and louse

5: x and x and x

6: x and mouse and louse

7: house and mouse and x

 

 

0172-445633

8: xx7x-445633

9: 01x2xxxxxxx

10: x1x2xxx5633

 

 

aa and aba and abba and abbba and aca

11: x and x and x and x and aca

12: aa and x and x and x and aca

13: x and x and abba and abbba and aca

14: aa and aba and x and x and aca

 

All text parts found are replaced by x for clarification.

 

Editor’s note: This post has been adapted from a section of the book Getting Started with Python by Thomas Theis.

Recommendation

Getting Started with Python
Getting Started with Python

If you want to program with Python, you’ve come to the right place! Take your first steps with this Python crash course that teaches you to use core language elements, from variables to branches to loops. Follow expert guidance to work with data types, functions, and modules—and learn how to manage errors and exceptions along the way. Apply Python programming to develop databases, graphical user interfaces, widgets, and more. Practice your skills with example exercises, and start developing your own applications with Python today!

Learn More
Rheinwerk Computing
by Rheinwerk Computing

Rheinwerk Computing is an imprint of Rheinwerk Publishing and publishes books by leading experts in the fields of programming, administration, security, analytics, and more.

Comments