The Basics¶

Regular expressions are a mini programming language used for searching through text.

You can use regular expressions to:

validate text
search for things in text
normalize text

Today we’re going to learn how to use regular expressions to validate text and searching for things in text.

Raw Strings¶

Python strings use backslashes as escape characters:

>>> file_name = "C:\projects\nathan"
>>> file_name
'C:\\projects\nathan'
>>> print(file_name)
C:\projects
athan

That \n represents a newline character. If we want to use a literal \ followed by a literal n we have to escape the backslash with another backslash.

>>> file_name = "C:\\projects\\nathan"
>>> file_name
'C:\\projects\\nathan'
>>> print(file_name)
C:\projects\nathan

We can turn off character escaping completely by using “raw” strings, which you can make by prefixing your string with an r character:

>>> file_name = r"C:\projects\nathan"
>>> file_name
'C:\\projects\\nathan'
>>> print(file_name)
C:\projects\nathan

When making regular expressions it’s a very good idea to always use raw strings. This is because the special sequences that regular expressions use sometimes also work as escape characters.

Because of this, we will exclusively use raw strings when creating regular expressions.

Searching¶

The re module in Python’s standard library includes tools for using regular expressions.

Let’s import the re module like this:

>>> import re

Let’s make a string:

>>> greeting = "hello world"

We’re going to use this string for testing our regular expressions.

Let’s ask whether our string includes the letter x. We can use the search function for this:

>>> re.search(r"x", greeting)

Nothing was returned. Specifically, None was returned:

>>> print(re.search(r"x", greeting))
None

This means greeting does not include the letter x.

Let’s try a string that does have the letter x:

>>> re.search(r"x", "exit")
<_sre.SRE_Match object; span=(1, 2), match='x'>

We got a match object back. That means we got a match!

This first example wasn’t particularly interesting because we could do this same thing with the in operator on strings:

>>> 'x' in greeting
False
>>> 'x' in 'exit'
True

Character Class¶

Let’s say we want to ask whether our string includes a vowel.

Without regular expressions this could get quite verbose:

>>> 'a' in greeting or 'e' in greeting or 'i' in greeting
True

We could make that shorter with a list comprehension, but this is still a little verbose:

>>> any(c in greeting for c in 'aeiou')
True

With our regular expression search function, we can do this:

>>> re.search(r'[aeiou]', greeting)
<_sre.SRE_Match object; span=(1, 2), match='e'>

If we provide a word without vowels (note that we’re not counting y as a vowel here) we’ll get None:

>>> re.search(r'[aeiou]', 'rhythm')

This is called a character class. We can make character classes with square brackets. A character class matches any single character we put inside it.

We could match any digit like this:

>>> re.search(r'[0123456789]', 'rhythm')
>>> re.search(r'[0123456789]', '$100')
<_sre.SRE_Match object; span=(1, 2), match='1'>

Character classes also support ranges of characters. We can denote a range of characters with a -:

>>> re.search(r'[0-9]', 'rhythm')
>>> re.search(r'[0-9]', '$100')
<_sre.SRE_Match object; span=(1, 2), match='1'>

Ranges allow us to match a number of ASCII-betically consecutive characters (this means it’s like looking at an ASCII table and fetching every character between two others).

Ranges can get pretty advanced because of that, but you’ll usually only see ranges for digits and uppercase or lowercase characters:

>>> re.search(r'[0-9]', greeting)
>>> re.search(r'[a-z]', greeting)
<_sre.SRE_Match object; span=(0, 1), match='h'>
>>> re.search(r'[A-Z]', greeting)

Note that you can put multiple ranges in a character class and you can even mix and match ranges and other characters in character classes.

This matches a letter, digit, or underscore character:

>>> re.search(r'[a-zA-Z0-9_]', greeting)
<_sre.SRE_Match object; span=(0, 1), match='h'>

We can also invert a character class by starting it with a ^ (a caret):

>>> re.search(r'[^0-9]', 'rhythm')
<_sre.SRE_Match object; span=(0, 1), match='r'>
>>> re.search(r'[^0-9]', '$100')
<_sre.SRE_Match object; span=(0, 1), match='$'>

Why did that second one match?

We’re asking whether our string includes a non-digit character. That dollar sign is a non-digit character.

If we remove the dollar sign, we won’t get a match:

>>> re.search(r'[^0-9]', '100')

Anchors¶

What if we want to match strings that start with an a character?

So far we haven’t seen a way to match at the start of a string. We only know how to look for any characters in a string:

>>> re.search(r'a', 'hiya')
<_sre.SRE_Match object; span=(3, 4), match='a'>

We can use ^ to match the beginning of a string:

>>> re.search(r'^a', 'hiya')
>>> re.search(r'^a', 'abcd')
<_sre.SRE_Match object; span=(0, 1), match='a'>

Notice that ^ doesn’t actually match a character. This is an anchor character. It matches a location, not a character.

The other popular anchor character is $ which matches the end of the string:

>>> re.search(r'a$', 'hiya')
<_sre.SRE_Match object; span=(3, 4), match='a'>
>>> re.search(r'a$', 'abcd')

Metacharacters¶

Most characters in a regular expression just match themselves. Metacharacters are characters that have a special meaning.

So far we’ve seen that square brackets ([ and ]), caret (^), and dollar sign ($) have special meaning. These are all metacharacters.

If you want to represent a metacharacter literally, you can use a backslash to escape the character:

>>> re.search(r"\[hello\]", "h")
>>> re.search(r"\[hello\]", "[hello]")
<_sre.SRE_Match object; span=(0, 7), match='[hello]'>

If we want to match a single dollar sign, we’ll want to escape it like this:

>>> re.search(r"$", "100")
<_sre.SRE_Match object; span=(3, 3), match=''>
>>> re.search(r"\$", "100")
>>> re.search(r"\$", "$100")
<_sre.SRE_Match object; span=(0, 1), match='$'>

You can find a list of regular expression metacharacters in the documentation.

One of the most common metacharacters is .. This matches any single character (except for a newline character by default).

>>> re.search(r'.', greeting)
<_sre.SRE_Match object; span=(0, 1), match='h'>
>>> re.search(r'.', 'a')
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>> re.search(r'.', '')

We can use this to match any three-character sequence that starts with an a and ends with a z:

>>> re.search(r'a.z', 'abz')
<_sre.SRE_Match object; span=(0, 3), match='abz'>
>>> re.search(r'a.z', 'wa zo')
<_sre.SRE_Match object; span=(1, 4), match='a z'>
>>> re.search(r'a.z', 'wazo')
>>> re.search(r'a.z', 'wa  zo')

Quantifiers¶

What if we want to match any string that starts with an a and ends with a z?

We haven’t learned a way to do this so far. The problem is we need to match strings that are any number of characters long:

>>> re.search(r'^az$', 'abz')
>>> re.search(r'^a.z$', 'abz')
<_sre.SRE_Match object; span=(0, 3), match='abz'>
>>> re.search(r'^a..z$', 'abz')
>>> re.search(r'^a...z$', 'abz')

We can use * for this:

>>> re.search(r'^a.*z$', 'abz')
<_sre.SRE_Match object; span=(0, 3), match='abz'>
>>> re.search(r'a.*z$', 'az')
<_sre.SRE_Match object; span=(0, 2), match='az'>
>>> re.search(r'^a.*z$', 'a and z')
<_sre.SRE_Match object; span=(0, 7), match='a and z'>
>>> re.search(r'^a.*z$', 'a and c')

This * character makes the match command before it (the . character in this case) match 0 or more times.

So this matches strings that consist of exclusively digit characters (it also matches the empty string):

>>> re.search(r'^[0-9]*$', greeting)
>>> re.search(r'^[0-9]*$', '$100')
>>> re.search(r'^[0-9]*$', '100')
<_sre.SRE_Match object; span=(0, 3), match='100'>
>>> re.search(r'^[0-9]*$', '')
<_sre.SRE_Match object; span=(0, 0), match=''>

This kind of metacharacter is often called a quantifier or modifier character. Instead of matching a character if modifies the match before it.

Here’s another quantifier character:

>>> re.search(r'^[a-z]+$', greeting)
>>> re.search(r'^[a-z]+$', 'hello')
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'^[a-z]+$', '')

The + character modifies the previous match to match 1 or more times. Unlike * this cannot match zero times.

There’s also ? which matches zero or 1 times. We can use this for matching something that’s optional. For example we could look for the word color spelled with or without a “u”:

>>> re.search(r'colou?r', 'what a nice color')
<_sre.SRE_Match object; span=(12, 17), match='color'>
>>> re.search(r'colou?r', 'what a nice colour')
<_sre.SRE_Match object; span=(12, 18), match='colour'>
>>> re.search(r'colou?r', 'what a nice shade')

Validation Exercises¶

Hint

Match objects are always “truthy” and None is always “falsey”. Truthy meas when you convert something to a boolean, it’ll be True.

You can convert the result of re.search to a boolean to get True or False for a match or non-match like this:

>>> bool(re.search(r'hello', sentence))
True
>>> bool(re.search(r'hi', sentence))
False

Has Vowels¶

Create a function has_vowel, that accepts a string and returns True if the string contains a vowel (a, e, i, o, or u) returns False otherwise.

Tip

Modify the has_vowel function in the validation module.

Your function should work like this:

>>> has_vowel("rhythm")
False
>>> has_vowel("exit")
True

Is Integer¶

Create a function is_integer that accepts a string and returns True if the string represents an integer.

By our definition, an integer:

Consists of 1 or more digits
May optionally begin with -
Does not contain any other non-digit characters.

Tip

Modify the is_integer function in the validation module.

Your function should work like this:

>>> is_integer("")
False
>>> is_integer(" 5")
False
>>> is_integer("5000")
True
>>> is_integer("-999")
True
>>> is_integer("+999")
False
>>> is_integer("00")
True
>>> is_integer("0.0")
False

Is Fraction¶

Create a function is_fraction that accepts a string and returns True if the string represents a fraction.

By our definition a fraction consists of:

An optional - character
Followed by 1 or more digits
Followed by a /
Followed by 1 or more digits, at least one of which is non-zero (the denominator cannot be the number 0).

Tip

Modify the is_fraction function in the validation module.

Your function should work like this:

>>> is_fraction("")
False
>>> is_fraction("5000")
False
>>> is_fraction("-999/1")
True
>>> is_fraction("+999/1")
False
>>> is_fraction("00/1")
True
>>> is_fraction("/5")
False
>>> is_fraction("5/0")
False
>>> is_fraction("5/010")
True
>>> is_fraction("5/105")
True
>>> is_fraction("5 / 1")
False