Python Concepts/Regular Expressions

Objective



 * What is a regular expression?


 * How to test a string for content that matches a regular expression?


 * How to retrieve content that matches a regular expression?


 * How to split a string at points in the string that match a given regular expression?


 * How to replace parts of a string that match a given regular expression?

Lesson
A regular expression is a string. Python's re (regular expression) methods scan a string supplied to determine if the string supplied contains text matching the regular expression. If such text is found in the string supplied, the required action of the method may be to report it, to split the string at the text matching the regular expression, or to replace the text matching the regular expression.

A regular expression may be as simple as a few characters to be interpreted literally, eg,  A regular expression may contain special characters that tell the regular expression method how to interpret the literal characters. eg, The expression matches 'a' + 'b' + any number of 'c'.

Matching literal characters
A regular expression may be as simple as one character. Search for  within the string

Search for  within the string

Method re.findall(....) produces a list of all matches found:

Modifying the search
Flag  permits comments in the regular expression. Flags are combined with '|'. is read as 're.IGNORECASE or re.VERBOSE' (inclusive or).

Matching groups of characters
Regular expressions can become complicated and unintelligible quickly. It may help to name the more common expressions. By naming expressions you can specify exactly what you want.

To match 'ee':

The special characters  cause the resulting RE to match from m to n repetitions of the preceding RE. Common matchings are:

To match one or more of

Matching members of a set
The string  means match   exactly. If  are members of a set (within brackets ), the expression   means 'a' or 'b' or 'c'.

Alpha-numeric
Find all groups of alpha characters:

Find all groups of numeric characters:

Find all words in the string that contain the letters

Find all words in the string that contain at least 5 letters:

It's OK to be lazy. The important thing is to define the pattern accurately and then let the re method make sense of it. However, with a little practice you will probably write the above search as:

Non alpha-numeric
The caret  at the beginning of a set negates all the members of the set. means any character that is not ('a' or 'b' or 'c').

Find all groups that contain non numeric characters:

Find all groups containing non alpha characters:

Non white space
Find all blocks of non white space:

Find all blocks of non white space that contain at least 4 letters:

Matching white space
White space is any one of

The regular expression that means 'any white character' is. It may help to name the most common regular expressions:

Some special characters that tell the methods how to interpret the other characters in the regular expression are:

Searching for white space:

Iterating over matches found:

Anchoring the pattern:

Searching for white space at extremities of string:

Splitting on white space
Remove white space from beginning of s1, but preserve white space at beginning of line 1a:

Remove white space from end of s2, but preserve white space at end of line 2b:

Split s3 into paragraphs:

Produce s4, equivalent to s1 without extraneous white space:

Special characters
Special characters are sometimes called metacharacters:

. ^ $ * + ? { } [ ] \ |

Special characters
Brackets contain members of a class:

Special characters
Braces indicate a range:

e{17} # Match exactly 'e' * 17

[0123456789]{3,} # Match 3 or more numeric characters.

[abc]{3,5} # Match 3 or 4 or 5 of ('a' or 'b' or 'c')

p{,3} # Match 0 or 1 or 2 or 3 of 'p'.

Special characters
Parentheses define a group. The method matches whatever regular expression is inside the parentheses. The contents of the group can be matched later in the expression with the  special sequence, or can be retrieved after the method terminates with match objects   or

Matching contents of group
The  special sequence matches contents of the group of the same number, where  is limited to $$99,$$ and a sequence such as \$$nnn,$$ where $$n$$ is an octal digit, will be interpreted as the character with octal value \$$nnn.$$

Groups are numbered starting from 1.

Named groups
In the following regular expression the syntax  identifies a named group.

An attempt to use optional parameter  produces strange results:

Optional parameter  works well provided that additional white space does not change the definition of a token:

Special characters
Special character  means 'any number of'. The following are equivalent:

p*   p{0,}                      # Any number of 'p'. [01234567890]* [01234567890]{0,} # Any number of numeric.

Special character  means '1 or more of'. The following are equivalent:

p+   p{1,}                      # 1 or more of 'p'. [01234567890]+ [01234567890]{1,} # 1 or more of numeric.

Special character  means '0 or 1 of'. The following are equivalent:

p?   p{0,1}                      # 0 or 1 of 'p'. [01234567890]? [01234567890]{0,1} # 0 or 1 of numeric.

Special characters
Special character  anchors the search at the beginning of the string.

Special character  anchors the search at the end of the string.

When both  are used, the regular expression must match the whole string.

Special character
When the caret is the first character in a set, it negates the whole set.

Special character
In the default mode, this matches any character except a newline. It is equivalent to:

Display all lines in the string s1:

and
Special character  means any non white space character. Special character  means any white space character. The following match :

and
Special character  means any non numeric character. Special character  means any numeric character. The following match :

and
Special character  means any non word character, where "word" is a word in Python. Special character  means any word character. The following 134 characters match :

Some words in English carry an accent:  Special character   matches all letters in these words.

To limit special character  to ASCII characters:

To produce words instead of match objects:

International characters
The methods work with international characters:

Find all words that contain the letter 'α' (Greek alpha):

List all the words in the string:

The special character  matches any word character in both English and Greek.

Matching literally
Within a set (within brackets ) special characters lose their special significance. To search for a  literally search for

Characters listed individually within brackets :

The caret  has a special meaning when it is the first in a set. Match all characters not in the set. To match a caret:

or put it after first place in the set:

Characters that may have a special meaning within a set include For consistent results every time:

You can see that regular expressions can become complicated and unintelligible quickly.

Pattern escaped:

Matching dates
A date has format  or   Liberal use of white space is acceptable, as is a month of 3 characters. The following are acceptable dates: 3 /9  / 1923 11/ 22/  1987 Aug23,2017 Septe 4,  2001

The ultimate regular expression will be

\b       # word boundary \d{1,2}  # 1 or 2 numeric \s*      # any white / \s*      # any white \d{1,2}  # 1 or 2 numeric \s*      # any white / \s*      # any white \d{4}    # 4 numeric \b       # word boundary \b       # word boundary [ABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyz]{2,} # upper + 2 or more lower \s*      # any white \d{1,2}  # 1 or 2 numeric \s*      # any white , \s*      # any white \d{4}    # 4 numeric \b       # word boundary

The above verbose format is much more readable than:

r\b\d{1,2}\s*/\s*\d{1,2}\s*/\s*\d{4}\b|\b[ABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyz]{2,}\s*\d{1,2}\s*,\s*\d{4}\b

7/4 / 1776 3/2/2001 12 / 19 / 2007 Jul4,1776 July 4, 1776 August13 ,2003 Nove 22, 2007 February14,1776

Unix dates
A "Unix date" has format

$ date Wed Feb 14 08:24:24 CST 2018

In this section a regular expression to match a Unix date will accept

Wed Feb 14 08:24:24 CST 2018 Wednes Feb 14 08:24:24 CST 2018 # More than 3 letters in name of day. Wed Febru 14 08:24:24 CST 2018 # More than 3 letters in name of month. Wed Feb 14 8:24 : 24 CST 2018 # White space in hh:mm:ss. wed FeB 14 8:24 : 24 cSt 2018 # Bad punctuation.

Build parts of the regular expression.

January|Januar|Janua|Janu|Jan| February|Februar|Februa|Febru|Febr|Feb| March|Marc|Mar| April|Apri|Apr| May| June|Jun| July|Jul| August|Augus|Augu|Aug| September|Septembe|Septemb|Septem|Septe|Sept|Sep| October|Octobe|Octob|Octo|Oct| November|Novembe|Novemb|Novem|Nove|Nov| December|Decembe|Decemb|Decem|Dece|Dec

Sunday|Sunda|Sund|Sun| Monday|Monda|Mond|Mon| Tuesday|Tuesda|Tuesd|Tues|Tue| Wednesday|Wednesda|Wednesd|Wednes|Wedne|Wedn|Wed| Thursday|Thursda|Thursd|Thurs|Thur|Thu| Friday|Frida|Frid|Fri| Saturday|Saturda|Saturd|Satur|Satu|Sat

Build the regular expression.

\b # Word boundary. (?P Sunday|Sunda|Sund|Sun| Monday|Monda|Mond|Mon| Tuesday|Tuesda|Tuesd|Tues|Tue| Wednesday|Wednesda|Wednesd|Wednes|Wedne|Wedn|Wed| Thursday|Thursda|Thursd|Thurs|Thur|Thu| Friday|Frida|Frid|Fri| Saturday|Saturda|Saturd|Satur|Satu|Sat ) \s+ (?P January|Januar|Janua|Janu|Jan| February|Februar|Februa|Febru|Febr|Feb| March|Marc|Mar| April|Apri|Apr| May| June|Jun| July|Jul| August|Augus|Augu|Aug| September|Septembe|Septemb|Septem|Septe|Sept|Sep| October|Octobe|Octob|Octo|Oct| November|Novembe|Novemb|Novem|Nove|Nov| December|Decembe|Decemb|Decem|Dece|Dec ) \s+ (?P ([1-9])  |  ([12][0-9])  |  (3[01])  ) # 1 through 31 \s+ (?P  ((0{0,1}|1)[0-9])  |  (2[0-3])  ) # (0 or 00) through 23 \s*\:\s* (?P [0-5]{0,1}[0-9]  ) # (0 or 00) through 59 \s*\:\s* (?P [0-5]{0,1}[0-9]  ) # (0 or 00) through 59 \s+ (?P [ECMP][SD]T  ) \s+ (?P (19[0-9][0-9])  |  (20[01][0-9])  ) # 1900 through 2019 \b

Regular expression  contains 16 groups of which 8 are named groups. The named groups make the expression easier to comprehend without comments. is a relatively simple expression. Without named groups and appropriate formatting as above, regular expressions quickly become incomprehensible.

List all valid dates in string  above.

A listcomp accepts free-format Python:

<_sre.SRE_Match object; span=(1, 34), match='MON Februar 12 0:30 : 19 CST 2018'> MON Februar 12 0:30 : 19 CST 2018 {'day': 'MON', 'month': 'Februar', 'date': '12', 'hours': '0', 'minutes': '30', 'seconds': '19', 'time_zone': 'CST', 'year': '2018'}

<_sre.SRE_Match object; span=(155, 251), match='Thursda              feb             29         > # Output here is clipped. Thursda              feb             29                  00:30:19           CST            1944   # Correct data here. {'day': 'Thursda', 'month': 'feb', 'date': '29', 'hours': '00', 'minutes': '30', 'seconds': '19', 'time_zone': 'CST', 'year': '1944'}

To access the groupdict of a field that matches:

d2 = {'day': 'MON', 'month': 'Februar', 'date': '12', 'hours': '0', 'minutes': '30', 'seconds': '19', 'time_zone': 'CST', 'year': '2018'}

A little philosophy
Because this example is contained within the page "Regular Expressions," there is much decision making contained within  For an example of code in which there is less decision making in the regular expression and more decision making in listcomp  see an earlier version of Unix Dates.

The code in this section:

1) focuses on matching alpha-numeric patterns. Verification that February 29, 1944 was in fact a Thursday is outside the scope of this section.

2) does not consider the possibility of leap seconds. Saturday December 31 23:59:60 UTC 2016 was a legitimate time. It seems that accurate time (and how to display it) is a field of science unto itself and not yet standardized.

3) is not complete until properly tested. Testing the code could consume 3-10 times as much effort as writing it.

4) highlights that a listcomp is an ideal place for (almost) format-free Python code.

5) shows that, as a regular expression becomes more complicated, you may have to write Python code just to produce the regular expression.

Integers
Examples of integers are:  Python's regular expressions scan strings, therefore in this context means string representing  Python's  function tolerates some white space, therefore the following are examples of

Do not rely on Python's  function to determine what a string represents:

Searching for integers: Method str.strip produces (almost) a clean int:

Method str.replace hides errors:

To produce a clean int:

Floats
Examples of point floats:

Examples of exponent floats:

An exponent float can contain an  as   where   is the and  the

If not exponent float, it must be point float. This means at least one  and one digit.

Matching any float
This example shows how substrings that match may be retrieved quickly and accurately from named groups.

The above  contains four named groups. As a match for  it is simple and correct but insufficient. Find all the floats in string

m = <_sre.SRE_Match object; span=(2, 7), match='+ 5e2'>

Information available in all groups: m[0] = '+ 5e2'  # Substring that matched reg_exp. m[1] = '+'      # Same as m['sign_of_float']. m[2] = '5'      # Same as m['significand']. m[3] = 'e2'     # Group not named. m[4] = None     # Same as m['sign_of_exponent']. m[5] = '2'      # Same as m['exponent'].

m = <_sre.SRE_Match object; span=(49, 53), match=' 3.3'>

Information available in all groups: m[0] = ' 3.3'   # Substring that matched reg_exp. m[1] = None     # Same as m['sign_of_float']. m[2] = '3.3'    # Same as m['significand']. m[3] = None     # Group not named. m[4] = None     # Same as m['sign_of_exponent']. m[5] = None     # Same as m['exponent'].

m = <_sre.SRE_Match object; span=(55, 63), match='- 3.3E+1'>

Information available in all groups: m[0] = '- 3.3E+1 # Substring that matched reg_exp. m[1] = '-'      # Same as m['sign_of_float']. m[2] = '3.3'    # Same as m['significand']. m[3] = 'E+1'    # Group not named. m[4] = '+'      # Same as m['sign_of_exponent']. m[5] = '1'      # Same as m['exponent'].

Decoding a bytes object
L2 contains the contents of a  object presented in binary format:

L2 = ( ['11001110', '10010010', '11001110', '10111001', '11001110', '10111010', '11001110'] + ['10111001', '00100000', '11101100', '10011100', '10000100', '11101101', '10000010'] + ['10100100', '11101011', '10110000', '10110000', '11101100', '10011011', '10000000'] + ['00100000', '01010111', '01101001', '01101011', '01101001'] ) Produce list L4 that contains L2 in a format that conforms to standard

L4 = ( ['1100111010010010', '1100111010111001', '1100111010111010', '1100111010111001'] + # Russian ['00100000'] + # '\x20' is a space. ['111011001001110010000100', '111011011000001010100100', '111010111011000010110000', '111011001001101110000000'] + # Korean ['00100000'] + # '\x20' is a space. ['01010111', '01101001', '01101011', '01101001'] ) # English

Decode L4:

L5 = ['Β', 'ι', 'κ', 'ι', ' ', '위', '키', '배', '움', ' ', 'W', 'i', 'k', 'i']

Compiling regular expressions
If a regular expression is complicated or is to be used frequently, it can be compiled to produce a pattern object.

The regular expression  represents an integer. Produce a pattern object called 'integer'.

The compiled pattern called 'integer' has methods similar to  and

Displaying all matches
Displaying all matches manually, one after the other.

The method  accepts optional positional parameters:

Iterating through all matches.
or:

Output is same as above.

Splitting the string
Splitting the string

Without preserving substrings that match
In  below note that parentheses have been removed from the expressions   and

Replacing all substrings that match
Replacing all integers in string s1:

After splitting the string
is to be replaced by

is to be replaced by

is to be replaced by

is to be replaced by

Without splitting the string
s2 = '   123       -  456((     !!+++    2345 !! -2##'

4 matches found: <_sre.SRE_Match object; span=(4, 7), match='123'> <_sre.SRE_Match object; span=(14, 20), match='- 456'> <_sre.SRE_Match object; span=(31, 40), match='+   2345'> <_sre.SRE_Match object; span=(44, 46), match='-2'>

L4 = ['INT_1', 'INT_2', 'INT_3', 'INT_4']

s2 = '   123       -  456((     !!+++    2345 !! INT_4##' after replacing span (44, 46) s2 = '   123       -  456((     !!++INT_3 !! INT_4##' after replacing span (31, 40) s2 = '   123       INT_2((     !!++INT_3 !! INT_4##' after replacing span (14, 20) s2 = '   INT_1       INT_2((     !!++INT_3 !! INT_4##' after replacing span (4, 7)

Python: truly international
Python, emacs and the Wikiversity editor recognize an almost infinite number of international characters. Some of them look exactly like their english counterparts:

A few well chosen international characters can simplify the creation of a complicated regular expression. Let's revisit the matching of floats.

Simplify the pattern?
Under "Compiling regular expressions" above the expression for integer is: Why not simplify the expression and use:

Because this expression produces the following matches:

Matching a float?
The reference offers regular expression  to match a float. Does this expression provide a good match? Yes, but it also matches

Floats with extra zeroes
In the section "Floats" above there are several examples of regular expressions that match floats. However, they do not consider the possibility of extra leading and trailing zeroes.

How would you rewrite the expressions to remove unnecessary zeroes?

Further Reading or Review

 * Previous Lesson: Classes
 * This Lesson: Regular Expressions
 * Next Lesson: Iteration and Iterators
 * Course Home Page