Python Programming/RegEx

This lesson introduces Python regular expression processing.

Objectives and Skills
Objectives and skills for this lesson include:
 * Standard Library
 * Regular expression operations

Readings

 * 1)  Regular expression
 * 2) Python for Everyone: Regular expressions

Multimedia

 * 1) YouTube: Python for Informatics - Chapter 11 - Regular Expressions
 * 2) YouTube: Python - Regular Expressions
 * 3) YouTube: Python3 - Regular Expressions

The match Method
The match method looks for zero or more characters at the beginning of the given string that match the given regular expression and returns a match object if found, or None if there is no match.

Output: start: 0 end: 17 group: HTML text.

The search Method
The search method scans for the first match of the given regular expression in the given string and returns a match object if found, or None if there is no match.

Output: start: 16 end: 33 group: HTML text.

Greedy vs. Non-greedy
The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible. Adding ? after the quantifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.

Output: Greedy start: 0 end: 33 group: Heading HTML text.

Non-greedy start: 0 end: 4 group:

The findall Method
The findall method matches all occurrences of the given regular expression in the string and returns a list of matching strings.

Output: matches: [' ', ' ', ' ', ' ']

The sub Method
The sub method replaces every occurrence of a pattern with a string.

Output: string: HeadingHTML text.

The split Method
The split method splits string by the occurrences of pattern.

Output: string: cat: Frisky, dog: Spot, fish: Bubbles keys: ['cat', 'dog', 'fish', ''] values: ['', 'Frisky', 'Spot', 'Bubbles']

The compile Method
The compile method compiles a regular expression pattern into a regular expression object, which can be used for matching using its match and search methods. The expression’s behaviour can be modified by specifying a flags value.

Output: start: 11 end: 15 group:

Match Groups
Match groups match whatever regular expression is inside parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed.

Output: start: 3 end: 13 group: HTML text. groups: Frisky Spot Bubbles cat: Frisky dog: Spot fish: Bubbles

Tutorials

 * 1) Complete one or more of the following tutorials:
 * 2) * LearnPython
 * 3) ** Regular Expressions
 * 4) * TutorialsPoint
 * 5) ** Regular Expressions
 * 6) * RegexOne
 * 7) ** Learn Regular Expressions with simple, interactive exercises

Practice

 * 1) Create a Python program that asks the user to enter a line of comma-separated grade scores. Use RegEx methods to parse the line and add each item to a list. Display the list of entered scores sorted in descending order and then calculate and display the high, low, and average for the entered scores. Include try and except to handle input errors.
 * 2) Create a Python program that asks the user for a line of text that contains HTML tags, such as:     Use RegEx methods to search for and remove all HTML tags from the text, saving each removed tag in a list. Print the untagged text and then display the list of removed tags sorted in alphabetical order with duplicate tags removed. Include error handling in case an HTML tag isn't entered correctly (an unmatched ). Use a user-defined function for the actual string processing, separate from input and output.
 * 3) Create a Python program that asks the user to enter a line of dictionary keys and values in the form Key-1: Value 1, Key-2: Value 2, Key-3: Value 3. You may assume that keys will never contain spaces, but may contain hyphens. Values may contain spaces, but a comma will always separate one key-value pair from the next key-value pair. Use RegEx functions to parse the string and build a dictionary of key-value pairs. Then display the dictionary sorted in alphabetical order by key. Include input validation and error handling in case a user accidentally enters the same key more than once.

RegEx Concepts

 * A regular expression (abbreviated regex) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings.
 * Each character in a regular expression is either understood to be a metacharacter with its special meaning, or a regular character with its literal meaning.
 * In regex, | indicates either|or.
 * In regex, ? indicates there is zero or one of the preceding element.
 * In regex, * indicates there is zero or more of the preceding element.
 * In regex, + indicates there is one or more of the preceding element.
 * In regex, is used to group elements.
 * In regex, . matches any single character.
 * In regex, [] matches any single character contained within the brackets.
 * In regex, [^] matches any single character not contained within the brackets.
 * In regex, ^ matches the start of the string.
 * In regex, $ matches the end of the string.
 * In regex, \w matches a word.
 * In regex, \d matches a digit.
 * In regex, \s matches whitespace.

Python RegEx

 * The Python regular expression library is re.py, and accessed using.
 * The match method looks for zero or more characters at the beginning of the given string that match the given regular expression and returns a match object if found, or None if there is no match.
 * The search method scans for the first match of the given regular expression in the given string and returns a match object if found, or None if there is no match.
 * The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible. Adding ? after the quantifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
 * The findall method matches all occurrences of the given regular expression in the string and returns a list of matching strings.
 * The sub method replaces every occurrence of a pattern with a string.
 * The split method splits string by the occurrences of pattern.
 * The compile method compiles a regular expression pattern into a regular expression object, which can be used for matching using its match and search methods. The expression’s behaviour can be modified by specifying a flags value.
 * The compile method flags include re.IGNORECASE, re.MULTILINE, and re.DOTALL for case insensitivity and processing more than one line at a time.
 * Match groups match whatever regular expression is inside parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed.

Key Terms

 * brittle code
 * Code that works when the input data is in a particular format but is prone to breakage if there is some deviation from the correct format. We call this “brittle code” because it is easily broken.


 * greedy matching
 * The notion that the “+” and “*” characters in a regular expression expand outward to match the largest possible string.


 * grep
 * A command available in most Unix systems that searches through text files looking for lines that match regular expressions. The command name stands for "Generalized Regular Expression Parser".


 * regular expression
 * A language for expressing more complex search strings. A regular expression may contain special characters that indicate that a search only matches at the beginning or end of a line or many other similar capabilities.


 * wild card
 * A special character that matches any character. In regular expressions the wild-card character is the period.

Assessments

 * Flashcards: Quizlet: Python Regular Expressions
 * Quiz: Quizlet: Python Regular Expressions