Advanced Features

Alternations

Let’s write a regular expression that matches valid 24 hour times.

We could start by matching two digits followed by a colon and two more digits:

>>> re.search(r'\d{2}:\d{2}', "00:13")
<_sre.SRE_Match object; span=(0, 5), match='00:13'>

This does match all valid 24 hour times but it also matches a lot of invalid times:

>>> re.search(r'\d{2}:\d{2}', "24:13")
<_sre.SRE_Match object; span=(0, 5), match='24:13'>
>>> re.search(r'\d{2}:\d{2}', "24:60")
<_sre.SRE_Match object; span=(0, 5), match='24:60'>

Let’s fix the minutes first. We only want to match minutes where the first digit is 0 through 5. The second digit can be anything:

>>> re.search(r'\d{2}:[0-5]\d', "24:60")
>>> re.search(r'\d{2}:[0-5]\d', "24:13")
<_sre.SRE_Match object; span=(0, 5), match='24:13'>

Now let’s try fixing the hours. We definitely want to narrow our hours down to starting with 0, 1, or 2. This isn’t quite enough though:

>>> re.search(r'[0-2]\d:[0-5]\d', "23:00")
<_sre.SRE_Match object; span=(0, 5), match='23:00'>
>>> re.search(r'[0-2]\d:[0-5]\d', "33:00")
>>> re.search(r'[0-2]\d:[0-5]\d', "24:00")
<_sre.SRE_Match object; span=(0, 5), match='24:00'>

What we really need is a way to combine these two regular expressions:

>>> re.search(r'[01]\d:[0-5]\d', "24:00")
>>> re.search(r'2[0-3]:[0-5]\d', "24:00")

We actually can combine regular expressions in Python if we use the | command to provide alternatives:

>>> re.search(r'[01]\d:[0-5]\d|2[0-3]:[0-5]\d', "24:00")

Character classes allow us to provide multiple options for a single character match.

The | character allows us to give multiple options for a collection of characters.

We can also use | in groups, so we could simplify that regular expression even further:

>>> re.search(r'([01]\d|2[0-3]):[0-5]\d', "24:00")
>>> re.search(r'([01]\d|2[0-3]):[0-5]\d', "23:00")
<_sre.SRE_Match object; span=(0, 5), match='23:00'>
>>> re.search(r'([01]\d|2[0-3]):[0-5]\d', "23:59")
<_sre.SRE_Match object; span=(0, 5), match='23:59'>
>>> re.search(r'([01]\d|2[0-3]):[0-5]\d', "23:60")

Split

Let’s say we have a string of values that are delimited by commas with optional spaces after them for readability. For example:

>>> row = "column 1,column 2, column 3"

We could to something like this to match words separated by a comma and one or more spaces:

>>> row = "column 1,column 2, column 3"
>>> re.findall(r'(.*),\s*', row)
['column 1,column 2']

That doesn’t work because that . is matching everything including commas. Let’s match non-commas:

>>> re.findall(r'([^,]*),\s*', row)
['column 1', 'column 2']

This doesn’t match column 3 because there’s no comma after it. We can use alternation to match a comma and spaces or the end of the string.

We could use an alternative to match the end of the string which gives us pretty much what we want:

>>> re.findall(r'([^,]*)(?:,\s*|$)', row)
['column 1', 'column 2', 'column 3', '']

But there’s a simpler way to do this.

Python’s re module has a split function we can use to split a string based on a delimiter specified by a regular expression:

>>> re.split(r',\s*', row)
['column 1', 'column 2', 'column 3']

That’s a lot easier to read.

Note that this is different from regular string splitting because we’re defining a regular expression:

>>> row.split(', ')
['column 1,column 2', 'column 3']
>>> row.split(',')
['column 1', 'column 2', ' column 3']
>>> row.split(',\s*')
['column 1,column 2, column 3']
>>> re.split(r',\s*', row)
['column 1', 'column 2', 'column 3']

Compiled

Executing a search with the same regular expression multiple times is inefficient and it can also encourage unreadable code.

Python’s re module has a compile function that allows us to compile a regular expression for later use.

We could use it for searching, even searching multiple times:

>>> TIME_RE = re.compile(r'^([01]\d|2[0-3]):[0-5]\d$')
>>> TIME_RE.search("00:00")
<_sre.SRE_Match object; span=(0, 5), match='00:00'>
>>> TIME_RE.search("00:90")
>>> TIME_RE.search("23:59")
<_sre.SRE_Match object; span=(0, 5), match='23:59'>
>>> TIME_RE.search("29:00")

We can also use it for splitting:

>>> row = "column 1,column 2, column 3"
>>> COMMA_RE = re.compile(r',\s*')
>>> COMMA_RE.split(row)
['column 1', 'column 2', 'column 3']

The object returned from re.compile represents a compile regular expression pattern:

>>> TIME_RE
re.compile('^([01]\\d|2[0-3]):[0-5]\\d$')
>>> COMMA_RE
re.compile(',\\s*')
>>> type(TIME_RE)
<class '_sre.SRE_Pattern'>

Pretty much all of the regular expression functions in the re module have an equivalent method on this compiled regular expression object.

Greediness

What if we want to match all quoted phrases in a string?

We could do something like this:

>>> re.search(r'".*"', 'Maya Angelou said "nothing will work unless you do"')
<_sre.SRE_Match object; span=(18, 51), match='"nothing will work unless you do"'>

This works but it would match too much when there are multiple quoted phrases.

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> re.findall(r'"(.*)"', sentence)
['why?" and I say "I don\'t know']

The problem is that regular expressions are greedy.

Whenever we use the *, +, ?, or {n,m} operators to repeat something the regular expression engine will try to repeat the match as many times as possible and backtrack to find fewer matches only when something goes wrong with the matching.

For example:

>>> re.findall('hi*', 'hiiiii')
['hiiiii']
>>> re.findall('hi?', 'hiiiii')
['hi']
>>> re.findall('hi+', 'hiiiii')
['hiiiii']
>>> re.findall('hi{2,}', 'hiiiii')
['hiiiii']
>>> re.findall('hi{1,3}', 'hiiiii')
['hiii']

We can make each of these operators non-greedy by putting a question mark after it:

>>> re.findall('hi*?', 'hiiiii')
['h']
>>> re.findall('hi??', 'hiiiii')
['h']
>>> re.findall('hi+?', 'hiiiii')
['hi']
>>> re.findall('hi{2,}?', 'hiiiii')
['hii']
>>> re.findall('hi{1,3}?', 'hiiiii')
['hi']

That ? might seem a little confusing since we already use a ? to match something 0 or 1 times. This ? is different though: we’re using it to modify these repetitions to be non-greedy so they match as few times as possible.

Let’s use a non-greedy pattern to match only until the next quote character:

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> re.findall(r'"(.*?)"', sentence)
['why?', "I don't know"]

More Regular Expression Exercises

Decimal Numbers

Write a function to match decimal numbers.

We want to allow an optional - and we want to match numbers with or without one decimal point.

Tip

Modify the is_number function in the validation module.

Example usage:

>>> is_number("5")
True
>>> is_number("5.")
True
>>> is_number(".5.")
False
>>> is_number(".5")
True
>>> is_number("01.5")
True
>>> is_number("-123.859")
True
>>> is_number("-123.859.")
False
>>> is_number(".")
False

Abbreviate

Make a function that creates an acronym from a phrase.

Tip

Modify the abbreviate function in the search module.

Example usage:

>>> abbreviate('Graphics Interchange Format')
'GIF'
>>> abbreviate('frequently asked questions')
'FAQ'
>>> abbreviate('cascading style sheets')
'CSS'
>>> abbreviate('Joint Photographic Experts Group')
'JPEG'
>>> abbreviate('content management system')
'CMS'
>>> abbreviate('JavaScript Object Notation')
'JSON'
>>> abbreviate('HyperText Markup Language')
'HTML'

Hex Colors

Write a function to match hexadecimal color codes. Hex color codes consist of an octothorpe symbol followed by either 3 or 6 hexadecimal digits (that’s 0 to 9 or a to f).

Tip

Modify the is_hex_color function in the validation module.

Example usage:

>>> is_hex_color("#639")
True
>>> is_hex_color("#6349")
False
>>> is_hex_color("#63459")
False
>>> is_hex_color("#634569")
True
>>> is_hex_color("#663399")
True
>>> is_hex_color("#000000")
True
>>> is_hex_color("#00")
False
>>> is_hex_color("#FFffFF")
True
>>> is_hex_color("#decaff")
True
>>> is_hex_color("#decafz")
False

Valid Date

Create a function that returns True if given a date in YYYY-MM-DD format.

For this exercise we’re more worried about accepting valid dates than we are about excluding invalid dates.

A regular expression is often used as a first wave of validation. Complete validation of dates should be done in our code outside of regular expressions.

Tip

Create this is_valid_date function in the validation module.

Example usage:

>>> is_valid_date("2016-01-02")
True
>>> is_valid_date("1900-01-01")
True
>>> is_valid_date("2016-02-99")
False
>>> is_valid_date("20-02-20")
False
>>> is_valid_date("1980-30-05")
False