Regular expression (regex for short) is a sequence of characters which allows us to specify a pattern of text to search for. In this tutorial, we’re gonna look at way to work with Python Regular Expression.
Related Posts:
– Python Regular Expression to extract phone number from text
– Python Regular Expression to extract email from text
Contents
How to use Python Regular Expression
We have some steps to using regular expressions:
Import the regex module
All Python regex functions in re
module. Remember to import it at the beginning of Python code or any time IDLE is restarted.
>>> import re |
Create Regex object
We create a Regex
object by passing a string value representing regular expression to re.compile()
.
For example, to match the phone number pattern:
>>> phoneRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') |
Get Match object
Regex object has search()
method that searches the string that matches to the regex. It returns:
– None
if the regex pattern is not found
– a Match
object if the pattern is found
>>> mo = phoneRegex.search('My phone number is 123-555-4242.') |
Get matched text
We call Match
object’s group()
method to get the actual matched text from the searched string.
>>> print('Found: ' + mo.group()) |
At a glance:
>>> import re >>> phoneRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') >>> mo = phoneRegex.search('My phone number is 123-555-4242.') >>> print('Found: ' + mo.group()) Found: 123-555-4242 |
Basic Python Regular Expression example
Use parentheses to group
– We add parentheses to create groups in the regex: (group_1)text(group_2)text(group3)
.
– Then We use the group()
of Match
object method to grab the matching text. The first set of parentheses in a regex string is group(1)
, the second set is group(2)
… group(0)
or group()
will return the entire matched text.
>>> regex = re.compile(r'(\d\d)(\d)-(\d\d\d-\d\d\d\d)') >>> mo = regex.search('My phone number is 123-555-4242.') >>> mo.group(1) '12' >>> mo.group(2) '3' >>> mo.group(3) '555-4242' >>> mo.group(0) '123-555-4242' >>> mo.group() '123-555-4242' |
– We can get all groups at once with groups()
method that returns a tuple of multiple values:
>>> mo.groups() ('12', '3', '555-4242') |
Match multiple groups
We use |
character to match one of many expressions.
For example, regular expression r'grokonez|JavaSampleApproach'
will match either 'grokonez'
or 'JavaSampleApproach'
.
>>> regex = re.compile(r'grokonez|\w\w\w\wSampleApproach') >>> mo = regex.search('What does grokonez means?') >>> mo.group() 'grokonez' |
When both expressions occur in the searched string, the first occurrence of matching text will be returned.
>>> regex = re.compile(r'grokonez|\w\w\w\wSampleApproach') >>> mo = regex.search('JavaSampleApproach.com was the predecessor website to grokonez.com.') >>> mo.group() 'JavaSampleApproach' |
We can also match one of patterns (as part of regular expression) with |
character.
>>> regex = re.compile(r'(grokonez|gkz|grokee).com') >>> mo = regex.search('JavaSampleApproach.com was the predecessor website to grokonez.com.') >>> mo.group() 'grokonez.com' >>> mo = regex.search('gkz.com and grokonez.com are one.') >>> mo.group() 'gkz.com' |
Using findall()
method, we can find all matching occurrences that’s shown later in this tutorial.
Match optionally
We can make regex find a match when text is there or not by using ?
character.
The group preceding ?
character will be an optional part of the pattern.
Remember that the first occurrence of matching text will be returned.
>>> regex = re.compile(r'(gro)?konez.com') >>> mo = regex.search('gkz.com and grokonez.com are one.') >>> mo.group() 'grokonez.com' >>> mo = regex.search('konez.com is the parent site of grokonez.com.') >>> mo.group() 'konez.com' |
Match zero or more
The group that precedes the star *
can occur any number of times (zero or more) in the text.
>>> regex = re.compile(r'gro(ko)*nez.com') >>> mo = regex.search('gkz.com and grokonez.com are one.') >>> mo.group() 'grokonez.com' >>> mo = regex.search('gkz.com and gronez.com.') >>> mo.group() 'gronez.com' >>> mo = regex.search('gkz.com and grokokokonez.com are one.') >>> mo.group() 'grokokokonez.com' |
Match one or more
Unlike the star, we use the plus +
character to indicate that the group preceding a plus must appear at least once.
>>> regex = re.compile(r'gro(ko)+nez.com') >>> mo = regex.search('gkz.com and gronez.com.') >>> mo == None True >>> mo = regex.search('gkz.com and grokonez.com are one.') >>> mo.group() 'grokonez.com' >>> mo = regex.search('gkz.com and grokokokonez.com are one.') >>> mo.group() 'grokokokonez.com' |
Match with specific repetition
We can specify the number of times that a group repeats by using a number in curly brackets.
>>> regex = re.compile(r'gro(ko){3}nez.com') >>> mo = regex.search('gkz.com and grokokokonez.com are one.') >>> mo.group() 'grokokokonez.com' >>> mo = regex.search('gkz.com and grokonez.com are one.') >>> mo == None True |
We can also limit the number of occurrences with the second number in curly brackets.
>>> regex = re.compile(r'gro(ko){3,5}nez.com') # 3 'ko' >>> mo = regex.search('grokokokonez.com.') >>> mo.group() 'grokokokonez.com' # 4 'ko' >>> mo = regex.search('grokokokokonez.com.') >>> mo.group() 'grokokokokonez.com' # 5 'ko' >>> mo = regex.search('grokokokokokonez.com.') >>> mo.group() 'grokokokokokonez.com' # 6 'ko' >>> mo = regex.search('grokokokokokokonez.com.') >>> mo.group() >>> mo == None True |
Greedy and Nongreedy matching
(gkz){3,5}
can match 3, 4, or 5 instances of 'gkz'
in the string 'gkzgkzgkzgkzgkz'
.
By default, Python regular expression are greedy, which means that the longest string will be matched.
>>> regex = re.compile(r'(gkz){3,5}') >>> mo = regex.search('gkzgkzgkzgkzgkz.') >>> mo.group() 'gkzgkzgkzgkzgkz' #instead of 'gkzgkzgkz' (3) |
To match the shortest string (nongreedy), we use a question mark ?
character right after the curly brackets:
>>> regex = re.compile(r'(gkz){3,5}?') >>> mo = regex.search('gkzgkzgkzgkzgkz.') >>> mo.group() 'gkzgkzgkz' |
Get all matches
Regex object has findall()
method that returns list of all matches (each string representing one match) in the searched string.
>>> regex = re.compile(r'gro[ko]*nez.com') >>> regex.findall('gkz.com, grokonez.com, grokokonez.com, grokokokonez.com are one.') ['grokonez.com', 'grokokonez.com', 'grokokokonez.com'] |
Remember that in the code above, we use square brackets []
. If there are groups (with ()
) in the regular expression, findall()
will return a list of tuples.
>>> regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)') >>> regex.findall('Cell: 123-678-6789 Work: 123-555-9999') [('123', '678-6789'), ('123', '555-9999')] |
Python Regex Symbols
Basic character classes
In the code above, we have used \d
for any numeric digit. \d
is shorthand for the regular expression (0|1|2|3|4|5|6|7|8|9)
.
There are other shorthand character classes:
\d | a numeric digit from 0 to 9 |
\D | NOT a numeric digit from 0 to 9 |
\w | letter, numeric digit, underscore character |
\W | NOT a letter, numeric digit, underscore character |
\s | space, tab, newline character |
\S | NOT a space, tab, newline |
Regular expression \w+:\s\d+
will match text:
+ has one or more letter/digit/underscore characters (\w+
)
+ followed by a :
character
+ followed by a space, tab, or newline character (\s
)
+ ends with one or more numeric digits (\d+
)
>>> regex = re.compile(r'\w+:\s\d+') >>> text = 'The zoo has cats: 12, dogs: 8, elephants: 6...' >>> regex.findall(text) ['cats: 12', 'dogs: 8', 'elephants: 6'] >>> regex = re.compile(r'(\w+):\s(\d+)') >>> text = 'The zoo has cats: 12, dogs: 8, elephants: 6...' >>> regex.findall(text) [('cats', '12'), ('dogs', '8'), ('elephants', '6')] |
Custom character classes
Define custom character class
We can define our own character class using square brackets.
For example, [aeiouAEIOU]
will match any vowel (lowercase and uppercase):
>>> regex = re.compile(r'[aeouiAEOUI]') >>> regex.findall('grokonez Programming Tutorials') ['o', 'o', 'e', 'o', 'a', 'i', 'u', 'o', 'i', 'a'] |
Character class in Range
We can include ranges of letters or numbers by using a hyphen:
– Character class [1-7]
: only the number from 1
to 7
.
– Character class [b-f]
: only the letter from b
to f
(b,c,d,e,f).
– Character class [a-zA-Z0-9]
: all lowercase letters, uppercase letters, and numbers.
>>> regex = re.compile(r'[a-f]') >>> regex.findall('grokonez Programming tutorials') ['e', 'a', 'a'] |
Negative character class
To make negative character class, we use a caret symbol ^
right after the character class’s opening bracket [
.
>>> regex = re.compile(r'[^3-7]') >>> regex.findall('123456789') ['1', '2', '8', '9'] # consonant >>> regex = re.compile(r'[^aeiouAEIOU]') >>> regex.findall('grokonez') ['g', 'r', 'k', 'n', 'z'] |
Caret symbol & Dollar sign
Caret symbol
To indicate that a match must occur at the beginning of the searched text, we use caret symbol ^
as the first character of the regex.
>>> regex = re.compile(r'^grok') >>> regex.search('grokonez Programming Tutorials') <_sre.SRE_Match object; span=(0, 4), match='grok'> >>> regex.search('Learn programming with grokonez') == None True |
Dollar sign
To indicate that a match must occur at the end of the searched text, we use dollar sign $
as the last character of the regex.
>>> regex = re.compile(r'konez$') >>> regex.search('Learn programming with grokonez') <_sre.SRE_Match object; span=(26, 31), match='konez'> >>> regex.search('grokonez Programming Tutorials') == None True |
Match entire string
We can use ^
and $
together to indicate that the entire string must match the regex:
>>> regex = re.compile(r'^\d+$') >>> regex.search('The quantity is 3457439') == None True >>> regex.search('3457439') <_sre.SRE_Match object; span=(0, 7), match='3457439'> >>> regex.search('345 7439') == None True |
Wildcard character
The wildcard character .
(dot) in regex match any character (except newline character).
>>> regex = re.compile(r'.ro') >>> regex.findall('introduction to grokonez robust programming tutorials') ['tro', 'gro', ' ro', 'pro'] |
Match everything except newline
We have known that:
– .
(dot) character means: any single character except the newline
– *
star character means zero or more
=> So dot-star .*
is for everything (except newline character).
>>> regex = re.compile('Name: (.*) - Location: (.*)') >>> mo = regex.search('Name: grokoneer - Location: US') >>> mo.groups() ('grokoneer', 'US') |
Match everything
We can pass re.DOTALL
to compile()
method as the second argument to make the dot character match all characters (including newline).
>>> regex = re.compile('.*') >>> mo = regex.search('grokonez\nProgramming tutorials') >>> mo.group() 'grokonez' >>> regex = re.compile('.*', re.DOTALL) >>> mo = regex.search('grokonez\nProgramming tutorials') >>> mo.group() 'grokonez\nProgramming tutorials' |
Greedy & Nongreedy
By default, dot-star works in greedy mode: match as much text as possible.
To match text in nongreedy mode, use it with question mark .*?
.
# greedy >>> regex = re.compile(r'<!--.*-->') >>> mo = regex.search('<!--grokonez code-->regular expression testing code-->') >>> mo.group() '<!--grokonez code-->regular expression testing code-->' # nongreedy >>> regex = re.compile(r'<!--.*?-->') >>> mo = regex.search('<!--grokonez code-->regular expression testing code-->') >>> mo.group() '<!--grokonez code-->' |
Python Regex Symbols Review
? | zero or one |
+ | one or more (nongreedy: +? ) |
* | zero or more (nongreedy: *? ) |
{n} | exactly n times |
{n,} | n or more |
{,n} | 0 to n |
{n,m} | at least n & at most m (nongreedy: {n,m}? ) |
^text | must begin with text |
text$ | must end with text |
. | any character, except newline |
\d , \w , \s | digit, word, or space character |
\D , \W , \S | anything except digit, word, or space character |
[abc] | any character of a, b, c |
[^abc] | any character, except a, b, c |
Python Regular Expression with Flags
Many Python Regex methods and functions use Flag arguments which can change the regex pattern effectively:
– re.A
: ASCII-only matching
– re.I
: ignore case
– re.L
: locale dependent
– re.M
: multi-line
– re.S
: dot matches all
– re.U
: Unicode matching
– re.X
: verbose (allow comment)
Case-insensitive Regex
If we want to match text without caring about uppercase or lowercase, just use re.I
in re.compile()
method.
>>> regex = re.compile(r'grokonez', re.I) >>> regex.search('Grokonez programming tutorials') <_sre.SRE_Match object; span=(0, 8), match='Grokonez'> >>> regex.search('GroKonez Python tutorials') <_sre.SRE_Match object; span=(0, 8), match='GroKonez'> >>> regex.search('GROKONEZ tutorials') <_sre.SRE_Match object; span=(0, 8), match='GROKONEZ'> |
Combine Flags
If we want to ignore capitalization and include newline to match the dot character, just combine the re.I
and re.S
(or re.IGNORECASE
& re.DOTALL
) using the pipe character |
:
>>> regex = re.compile('name: j.*location: us', re.I | re.S) >>> text = ''' ... name: Jack ... location: US ... ''' >>> regex.search(text).group() 'name: Jack\nlocation: US' |