Let’s provide insight into the extensive possibilities of regular expressions for Python.
Regular expressions are used to search and change character strings via search patterns and are used in many programming languages; see https://en.wikipedia.org/wiki/Regular_expression.
Regular expressions are often used to check and evaluate entries or to search for files. If the program determines that an incorrect or incomplete entry has been made, the system can provide assistance and prompt you for a new entry.
In Python, the re module allows you to use regular expressions. The two programs we’ll write in this blog post demonstrate some of its extensive possibilities.
In the next program, the findall() function is used to search for parts of texts. It finds all text segments that match the regular expression and returns a list of these text parts for further evaluation. The program is subdivided for a better overview. In addition, the individual blocks of the program and their associated outputs are numbered.
Let’s look at the first part of this program.
import re
tx = "house and mouse and louse"
print(tx)
print("1:", re.findall("mouse",tx))
print("2:", re.findall("[hm]ouse",tx))
print("3:", re.findall("[l-m]ouse",tx))
print("4:", re.findall("[^l-m]ouse",tx))
print("5:", re.findall(".ouse",tx))
print("6:", re.findall("^.ouse",tx))
print("7:", re.findall(".ouse$",tx))
print()
...
The following output, including the numbers, results from the first part of this program: house and mouse and louse
1: ['mouse']
2: ['house', 'mouse']
3: ['mouse', 'louse']
4: ['house']
5: ['house', 'mouse', 'louse']
6: ['house']
7: ['louse']
The first seven regular expressions refer to the house and mouse and louse text:
Let’s now consider the second part of this program.
...
tx = "0172-445633"
print(tx)
print("8:", re.findall("[0-2]",tx))
print("9:", re.findall("[^0-2]",tx))
print("10:", re.findall("[047-]",tx))
print()
...
The following output results from the second part of this program:
0172-445633
8: ['0', '1', '2']
9: ['7', '-', '4', '4', '5', '6', '3', '3']
10: ['0', '7', '-', '4', '4']
The next three expressions refer to the text 0172-445633:
Finally, here is the last part of this program:
...
tx = "aa and aba and abba and abbba and aca"
print(tx)
print("11:", re.findall("ab*a",tx))
print("12:", re.findall("ab+a",tx))
print("13:", re.findall("ab?a",tx))
print("14:", re.findall("ab{2,3}a",tx))
The output of this final part of the program reads as follows:
aa and aba and abba and abbba and aca
11: ['aa', 'aba', 'abba', 'abbba']
12: ['aba', 'abba', 'abbba']
13: ['aa', 'aba']
14: ['abba', 'abbba']
The last four regular expressions refer to the text, aa and aba and abba and abbba and aca. This text illustrates the behavior regarding the number of occurrences of certain characters.
The sub() function replaces all text parts that match the regular expression with another text. For ease of understanding, let’s use the same texts and regular expressions in this next program as in the previous program.
import re
tx = "house and mouse and louse"
print(tx)
print("1:", re.sub("mouse","x",tx))
print("2:", re.sub("[h|m]ouse","x",tx))
print("3:", re.sub("[l-m]ouse","x",tx))
print("4:", re.sub("[^l-m]ouse","x",tx))
print("5:", re.sub(".ouse","x",tx))
print("6:", re.sub("^.ouse","x",tx))
print("7:", re.sub(".ouse$","x",tx))
print()
tx = "0172-445633"
print(tx)
print("8:", re.sub("[0-2]","x",tx))
print("9:", re.sub("[^0-2]","x",tx))
print("10:", re.sub("[047-]","x",tx))
print()
tx = "aa and aba and abba and abbba and aca"
print(tx)
print("11:", re.sub("ab*a","x",tx))
print("12:", re.sub("ab+a","x",tx))
print("13:", re.sub("ab?a","x",tx))
print("14:", re.sub("ab{2,3}a","x",tx))
The output of this program reads as follows:
house and mouse and louse
1: house and x and louse
2: x and x and louse
3: house and x and x
4: x and mouse and louse
5: x and x and x
6: x and mouse and louse
7: house and mouse and x
0172-445633
8: xx7x-445633
9: 01x2xxxxxxx
10: x1x2xxx5633
aa and aba and abba and abbba and aca
11: x and x and x and x and aca
12: aa and x and x and x and aca
13: x and x and abba and abbba and aca
14: aa and aba and x and x and aca
All text parts found are replaced by x for clarification.
Editor’s note: This post has been adapted from a section of the book Getting Started with Python by Thomas Theis.