Regular Expressions, often shortened as regex, are a sequence of characters used to check whether a pattern exists in a given text (string) or not. If you’ve ever used search engines, search and replace tools of word processors and text editors - you’ve already seen regular expressions in use. They are used at the server side to validate the format of email addresses or passwords during registration, used for parsing text data files to find, replace, or delete certain string, etc. They help in manipulating textual data, which is often a prerequisite for data science projects involving text mining.
In Python, regular expressions are supported by the re module. That means that if you want to start using them in your Python scripts, you have to import this module with the help of import:
import re
we can easily tackle many basic patterns in Python using ordinary characters. Ordinary characters are the simplest regular expressions. They match themselves exactly and do not have a special meaning in their regular expression syntax. Examples are ‘A’, ‘a’, ‘X’, ‘5’.
pattern = r”Cookie” sequence = “Cookie” if re.match(pattern, sequence): print(“Match!”) else: print(“Not a match!”)
Special characters are characters that do not match themselves as seen but have a special meaning when used in a regular expression. For simple understanding, they can be thought of as reserved metacharacters that denote something else and not what they look like.
A period. Matches any single character except the newline character.
re.search(r’Co.k.e’, ‘Cookie’).group() /// Cookie
re.search(r’^Eat’, “Eat cake!”).group() /// Eat
re.search(r’cake$’, “Cake! Let’s eat cake”).group() /// Cake
re.search(r’[0-6]’, ‘Number: 5’).group() /// ‘5’
re.search(r’Number: [^5]’, ‘Number: 0’).group() /// ‘Number: 0’
re.search(r’Just a \regular character’, ‘Just a \regular character’).group() /// ‘Just a \regular character’
print(“Lowercase w:”, re.search(r’Co\wk\we’, ‘Cookie’).group()) /// Lowercase w: Cookie
print(“Lowercase s:”, re.search(r’Eat\scake’, ‘Eat cake’).group()) /// Lowercase s: Eat cake
print(“How many cookies do you want? “, re.search(r’\d+’, ‘100 cookies’).group()) /// How many cookies do you want?
It becomes quite tedious if you are looking to find long patterns in a sequence. Fortunately, the re module handles repetitions using the following special characters:
Checks if the preceding character appears one or more times starting from that position.
re.search(r’Co+kie’, ‘Cooookie’).group() /// ‘Cooookie’
re.search(r’Caokie’, ‘Cookie’).group() /// ‘Cookie’
re.search(r’Colou?r’, ‘Color’).group() /// ‘Color’
The group feature of regular expression allows you to pick up parts of the matching text. Parts of a regular expression pattern bounded by parenthesis () are called groups. The parenthesis does not change what the expression matches, but rather forms groups within the matched sequence. You have been using the group() function all along in this tutorial’s examples. The plain match.group() without any argument is still the whole matched text as usual.
When a special character matches as much of the search sequence (string) as possible, it is said to be a “Greedy Match”. It is the normal behavior of a regular expression, but sometimes this behavior is not desired:
Regular expressions are handled as strings by Python. However, with compile(), you can computer a regular expression pattern into a regular expression object. When you need to use an expression several times in a single program, using compile() to save the resulting regular expression object for reuse is more efficient than saving it as a string. This is because the compiled versions of the most recent patterns passed to compile() and the module-level matching functions are cached.
With this function, you scan through the given string/sequence, looking for the first location where the regular expression produces a match. It returns a corresponding match object if found, else returns None if no position in the string matches the pattern. Note that None is different from finding a zero-length match at some point in the string.
Returns a corresponding match object if zero or more characters at the beginning of string match the pattern. Else it returns None, if the string does not match the given pattern.
Finds all the possible matches in the entire sequence and returns them as a list of strings. Each returned string represents one match.
Similar to findall() - it finds all the possible matches in the entire sequence but returns regex match objects as an iterator.
sub() is the substitute function. It returns the string obtained by replacing or substituting the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern is not found, then the string is returned unchanged.
The subn() is similar to sub(). However, it returns a tuple containing the new string value and the number of replacements that were performed in the statement.
This splits the strings wherever the pattern matches and returns a list. If the optional argument maxsplit is nonzero, then the maximum ‘maxsplit’ number of splits are performed.
An expression’s behavior can be modified by specifying a flag value. You can add flags as an extra argument to the different functions that you have seen in this tutorial. Some of the more useful ones are: