Python Regular Expression


Regular expression (regex for short) is a sequence of characters which allows us to specify a pattern of text to search for. In this tutorial, we’re gonna look at way to work with Python Regular Expression.

Related Posts:
Python Regular Expression to extract phone number from text
Python Regular Expression to extract email from text

How to use Python Regular Expression

We have some steps to using regular expressions:

Import the regex module

All Python regex functions in re module. Remember to import it at the beginning of Python code or any time IDLE is restarted.

Create Regex object

We create a Regex object by passing a string value representing regular expression to re.compile().

For example, to match the phone number pattern:

Get Match object

Regex object has search() method that searches the string that matches to the regex. It returns:
None if the regex pattern is not found
– a Match object if the pattern is found

Get matched text

We call Match object’s group() method to get the actual matched text from the searched string.

At a glance:

Basic Python Regular Expression example

Use parentheses to group

– We add parentheses to create groups in the regex: (group_1)text(group_2)text(group3).

– Then We use the group() of Match object method to grab the matching text. The first set of parentheses in a regex string is group(1), the second set is group(2)group(0) or group() will return the entire matched text.

– We can get all groups at once with groups() method that returns a tuple of multiple values:

Match multiple groups

We use | character to match one of many expressions.
For example, regular expression r'grokonez|JavaSampleApproach' will match either 'grokonez' or 'JavaSampleApproach'.

When both expressions occur in the searched string, the first occurrence of matching text will be returned.

We can also match one of patterns (as part of regular expression) with | character.

Using findall() method, we can find all matching occurrences that’s shown later in this tutorial.

Match optionally

We can make regex find a match when text is there or not by using ? character.
The group preceding ? character will be an optional part of the pattern.
Remember that the first occurrence of matching text will be returned.

Match zero or more

The group that precedes the star * can occur any number of times (zero or more) in the text.

Match one or more

Unlike the star, we use the plus + character to indicate that the group preceding a plus must appear at least once.

Match with specific repetition

We can specify the number of times that a group repeats by using a number in curly brackets.

We can also limit the number of occurrences with the second number in curly brackets.

Greedy and Nongreedy matching

(gkz){3,5} can match 3, 4, or 5 instances of 'gkz' in the string 'gkzgkzgkzgkzgkz'.

By default, Python regular expression are greedy, which means that the longest string will be matched.

To match the shortest string (nongreedy), we use a question mark ? character right after the curly brackets:

Get all matches

Regex object has findall() method that returns list of all matches (each string representing one match) in the searched string.

Remember that in the code above, we use square brackets []. If there are groups (with ()) in the regular expression, findall() will return a list of tuples.

Python Regex Symbols

Basic character classes

In the code above, we have used \d for any numeric digit. \d is shorthand for the regular expression (0|1|2|3|4|5|6|7|8|9).

There are other shorthand character classes:

\d a numeric digit from 0 to 9
\D NOT a numeric digit from 0 to 9
\w letter, numeric digit, underscore character
\W NOT a letter, numeric digit, underscore character
\s space, tab, newline character
\S NOT a space, tab, newline

Regular expression \w+:\s\d+ will match text:
+ has one or more letter/digit/underscore characters (\w+)
+ followed by a : character
+ followed by a space, tab, or newline character (\s)
+ ends with one or more numeric digits (\d+)

Custom character classes
Define custom character class

We can define our own character class using square brackets.

For example, [aeiouAEIOU] will match any vowel (lowercase and uppercase):

Character class in Range

We can include ranges of letters or numbers by using a hyphen:
– Character class [1-7]: only the number from 1 to 7.
– Character class [b-f]: only the letter from b to f (b,c,d,e,f).
– Character class [a-zA-Z0-9]: all lowercase letters, uppercase letters, and numbers.

Negative character class

To make negative character class, we use a caret symbol ^ right after the character class’s opening bracket [.

Caret symbol & Dollar sign
Caret symbol

To indicate that a match must occur at the beginning of the searched text, we use caret symbol ^ as the first character of the regex.

Dollar sign

To indicate that a match must occur at the end of the searched text, we use dollar sign $ as the last character of the regex.

Match entire string

We can use ^ and $ together to indicate that the entire string must match the regex:

Wildcard character

The wildcard character . (dot) in regex match any character (except newline character).

Match everything except newline

We have known that:
. (dot) character means: any single character except the newline
* star character means zero or more

=> So dot-star .* is for everything (except newline character).

Match everything

We can pass re.DOTALL to compile() method as the second argument to make the dot character match all characters (including newline).

Greedy & Nongreedy

By default, dot-star works in greedy mode: match as much text as possible.
To match text in nongreedy mode, use it with question mark .*?.

Python Regex Symbols Review
? zero or one
+ one or more (nongreedy: +?)
* zero or more (nongreedy: *?)
{n} exactly n times
{n,} n or more
{,n} 0 to n
{n,m} at least n & at most m (nongreedy: {n,m}?)
^text must begin with text
text$ must end with text
. any character, except newline
\d, \w, \s digit, word, or space character
\D, \W, \S anything except digit, word, or space character
[abc] any character of a, b, c
[^abc] any character, except a, b, c

Python Regular Expression with Flags

Many Python Regex methods and functions use Flag arguments which can change the regex pattern effectively:
re.A: ASCII-only matching
re.I: ignore case
re.L: locale dependent
re.M: multi-line
re.S: dot matches all
re.U: Unicode matching
re.X: verbose (allow comment)

Case-insensitive Regex

If we want to match text without caring about uppercase or lowercase, just use re.I in re.compile() method.

Combine Flags

If we want to ignore capitalization and include newline to match the dot character, just combine the re.I and re.S (or re.IGNORECASE & re.DOTALL) using the pipe character |:

By grokonez | January 3, 2019.

Related Posts

Got Something To Say:

Your email address will not be published. Required fields are marked *