Let’s provide insight into the extensive possibilities of regular expressions for Python.
Regular expressions are used to search and change character strings via search patterns and are used in many programming languages; see https://en.wikipedia.org/wiki/Regular_expression.
Regular expressions are often used to check and evaluate entries or to search for files. If the program determines that an incorrect or incomplete entry has been made, the system can provide assistance and prompt you for a new entry.
In Python, the re module allows you to use regular expressions. The two programs we’ll write in this blog post demonstrate some of its extensive possibilities.
Finding Text Parts
In the next program, the findall() function is used to search for parts of texts. It finds all text segments that match the regular expression and returns a list of these text parts for further evaluation. The program is subdivided for a better overview. In addition, the individual blocks of the program and their associated outputs are numbered.
Let’s look at the first part of this program.
import re
tx = "house and mouse and louse"
print(tx)
print("1:", re.findall("mouse",tx))
print("2:", re.findall("[hm]ouse",tx))
print("3:", re.findall("[l-m]ouse",tx))
print("4:", re.findall("[^l-m]ouse",tx))
print("5:", re.findall(".ouse",tx))
print("6:", re.findall("^.ouse",tx))
print("7:", re.findall(".ouse$",tx))
print()
...
The following output, including the numbers, results from the first part of this program: house and mouse and louse
1: ['mouse']
2: ['house', 'mouse']
3: ['mouse', 'louse']
4: ['house']
5: ['house', 'mouse', 'louse']
6: ['house']
7: ['louse']
The first seven regular expressions refer to the house and mouse and louse text:
- The mouse string is searched for. All occurrences of this character string are returned.
- The texts parts that begin with an h or an m and end with ouse are searched for. In the rectangular brackets, we are basically saying “Search for h or m”.
- The text parts that begin with one of the characters from l to m and end with ouse are searched for. A range is specified in the rectangular brackets using the hyphen.
- The text parts that do not begin with one of the characters from l to m but end with ouse are searched for. The ^ character represents a logical negation in the context of a range.
- The system searches for text parts that begin with any character and end with ouse. The . character means “any character”.
- As before, but only valid for text parts that appear at the beginning of the text being analyzed. The ^ character means “at the beginning”.
- As before, but only valid for text parts that are at the end of the text being analyzed. The $ character means “at the end”.
Let’s now consider the second part of this program.
...
tx = "0172-445633"
print(tx)
print("8:", re.findall("[0-2]",tx))
print("9:", re.findall("[^0-2]",tx))
print("10:", re.findall("[047-]",tx))
print()
...
The following output results from the second part of this program:
0172-445633
8: ['0', '1', '2']
9: ['7', '-', '4', '4', '5', '6', '3', '3']
10: ['0', '7', '-', '4', '4']
The next three expressions refer to the text 0172-445633:
- One of the digits from 0 to 2 is searched for as a text segment. This time, the range includes digits.
- A character that is not in the number range from 0 to 2 is searched for as a text part. All digits from 3 and all non-digits are found.
- One of the characters from the specified set of characters is searched for as a text part. All characters or digits mentioned are found.
Finally, here is the last part of this program:
...
tx = "aa and aba and abba and abbba and aca"
print(tx)
print("11:", re.findall("ab*a",tx))
print("12:", re.findall("ab+a",tx))
print("13:", re.findall("ab?a",tx))
print("14:", re.findall("ab{2,3}a",tx))
The output of this final part of the program reads as follows:
aa and aba and abba and abbba and aca
11: ['aa', 'aba', 'abba', 'abbba']
12: ['aba', 'abba', 'abbba']
13: ['aa', 'aba']
14: ['abba', 'abbba']
The last four regular expressions refer to the text, aa and aba and abba and abbba and aca. This text illustrates the behavior regarding the number of occurrences of certain characters.
- All text parts are found that contain, in sequence, an a, any number (0 is also possible) of the character b, again an a. The * character (asterisk) means “any number of occurrences of the specific character”.
- All text parts are found that contain an a, at least one b, again an a. The + (plus) sign means “The number of occurrences of the specific character is 1 or greater”.
- All text parts are found that contain an a, none or one b, again an a. The question mark ? means in this context “The number of occurrences of the specified character is 0 or 1”.
- All text parts are found that contain an a, two or three times the character b, again an a. The curly brackets allow you to specify the desired number of occurrences.
Replacing Text Parts
The sub() function replaces all text parts that match the regular expression with another text. For ease of understanding, let’s use the same texts and regular expressions in this next program as in the previous program.
import re
tx = "house and mouse and louse"
print(tx)
print("1:", re.sub("mouse","x",tx))
print("2:", re.sub("[h|m]ouse","x",tx))
print("3:", re.sub("[l-m]ouse","x",tx))
print("4:", re.sub("[^l-m]ouse","x",tx))
print("5:", re.sub(".ouse","x",tx))
print("6:", re.sub("^.ouse","x",tx))
print("7:", re.sub(".ouse$","x",tx))
print()
tx = "0172-445633"
print(tx)
print("8:", re.sub("[0-2]","x",tx))
print("9:", re.sub("[^0-2]","x",tx))
print("10:", re.sub("[047-]","x",tx))
print()
tx = "aa and aba and abba and abbba and aca"
print(tx)
print("11:", re.sub("ab*a","x",tx))
print("12:", re.sub("ab+a","x",tx))
print("13:", re.sub("ab?a","x",tx))
print("14:", re.sub("ab{2,3}a","x",tx))
The output of this program reads as follows:
house and mouse and louse
1: house and x and louse
2: x and x and louse
3: house and x and x
4: x and mouse and louse
5: x and x and x
6: x and mouse and louse
7: house and mouse and x
0172-445633
8: xx7x-445633
9: 01x2xxxxxxx
10: x1x2xxx5633
aa and aba and abba and abbba and aca
11: x and x and x and x and aca
12: aa and x and x and x and aca
13: x and x and abba and abbba and aca
14: aa and aba and x and x and aca
All text parts found are replaced by x for clarification.
Editor’s note: This post has been adapted from a section of the book Getting Started with Python by Thomas Theis.
Comments