Substitutions

Basic Substitution

Let’s say we have some text that was written by a LaTeX user who uses two backticks `` and two apostrophe characters to represent left and right double quotes.

>>> sentence = "This string uses ``smart'' quotes."

We want to convert all of these sets of double backticks and double apostrophes to double quote characters.

We could do something like this:

>>> sentence.replace("``", '"').replace("''", '"')
'This string uses "smart" quotes.'

But we can also use regular expressions to accomplish the same task. For this we’ll use the sub function which stands for “substitution”:

>>> re.sub(r"``|''", '"', sentence)
'This string uses "smart" quotes.'

The sub function takes three arguments:

  1. The regular expression to match
  2. The replacement string
  3. The string to operate on

Normalization

Let’s look at a task that wouldn’t have been well-suited to a string replacement.

Let’s make a regular expression that removes spaces after any commas.

We can do this by looking for commas with optional spaces after them and replaces that with just a comma:

>>> row = "column 1,column 2, column 3"
>>> re.sub(r',\s*', ',', row)
'column 1,column 2,column 3'

Using Captures in Substitutions

Let’s say we’re working with a document that was created in the US using MM/DD/YYYY format and we want to convert it to YYYY-MM-DD.

This isn’t just a simple replacement of / with - because the order of the numbers changes.

We can solve this by referencing our capturing groups in substitutions. Each group can be referenced with a backslash and the group number (\N).

>>> sentence = "from 12/22/1629 to 11/14/1643"
>>> re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', sentence)
'from 1629-12-22 to 1643-11-14'

These references to capture groups are called back-references.

Using Captures

We can actually use back-references in the regular expression pattern also. Let’s look at an example.

Let’s modify our quotation matcher from earlier to search for either double- or single-quoted strings. Let’s try doing it this way:

>>> re.findall(r'["\'](.*?)["\']', "she said 'not really'")
['not really']

This would match unmatched quotes though:

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> re.findall(r'["\'](.*?)["\']', sentence)
['why?', 'I don']

We need the end quote to be the same as the beginning quote. We can do this with a backreference:

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> re.findall(r'(["\'])(.*?)\1', sentence)
[('"', 'why?'), ('"', "I don't know")]

The result from this isn’t exactly what we wanted. We’re getting both the quote character and the matching quotation.

Unfortunately we can’t make that first group that matches our quote non-capturing because need to reference it in our string.

We can retrieve just the quotation by using a list comprehension with findall:

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> matches = re.findall(r'(["\'])(.*?)\1', sentence)
>>> [q for _, q in matches]
['why?', "I don't know"]

We could instead use a list comprehension with finditer:

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> matches = re.finditer(r'(["\'])(.*?)\1', sentence)
>>> [m.group(2) for m in matches]
['why?', "I don't know"]

Note

We could have also used zip:

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> _, matches = zip(*re.findall(r'(["\'])(.*?)\1', sentence))
>>> matches
('why?', "I don't know")

Capture Exercises

Palindromes

Using the dictionary file, find all five letter palindromes.

Tip

Modify the palindrome5 function in the search module.

Double Double

Find all words that have a consecutive repeated letter two times with only one other letter between them.

Tip

Modify the double_double function in the search module.

For example, these words should be matched:

  • freebee
  • assessed
  • voodoo

Repetitive Words

Find all words that consist of the same letters repeated two times.

Tip

Modify the repeaters function in the search module.

Examples:

  • tutu
  • cancan
  • murmur

Named Capture Groups

Capture groups are neat but sometimes it can be a little confusing figuring out what the group numbers are.

Sometimes it’s also a little confusing when you’re switching around numeric backreferences and trying to figure out which one is which.

Named capture groups can help us here.

Let’s use these on our date substitution:

>>> sentence = "from 12/22/1629 to 11/14/1643"
>>> re.sub(r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})', r'\g<year>-\g<month>-\g<day>', sentence)
'from 1629-12-22 to 1643-11-14'

That syntax is a little weird. The ?P after the parenthesis allows us to specify a group name in brackets (< ... >). That group name can be referenced later using \g and brackets.

We can also us named groups without substitutions.

>>> sentence = "from 12/22/1629 to 11/14/1643"
>>> m = re.search(r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})', sentence)
>>> m.groups()
('12', '22', '1629')
>>> m.groupdict()
{'day': '22', 'month': '12', 'year': '1629'}

The groups act just like before, but we can also use groupdict to get dictionaries containing the named groups.

Unfortunately, re.findall doesn’t act any different with named groups:

>>> re.findall(r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})', sentence)
[('12', '22', '1629'), ('11', '14', '1643')]

We could use re.finditer to get match objects and use groupdict to get the dictionary for each one though:

>>> matches = re.finditer(r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})', sentence)
>>> [m.groupdict() for m in matches]
[{'day': '22', 'month': '12', 'year': '1629'}, {'day': '14', 'month': '11', 'year': '1643'}]

Substitution Functions

What if we want to allow our month/day/year substitution to support 2 digit years?

As humans we are pretty good at knowing how to do this conversion, but we’d need to do some kind of conditional algorithm to determine how to handle the conversion.

The sub function actually allows us to specify a function instead of a replacement string. If a function is specified, it’ll be called to create the replacement string for each match.

def replace_date(match):
    month, day, year = match.groups()
    if len(year) == 4:
        year = year
    elif '00' <= year < '50':
        year = '20' + year
    elif '50' <= year <= '99':
        year = '19' + year
    return '-'.join((year, month, day))

DATE_RE = re.compile(r'\b(\d{2})/(\d{2})/(\d{2}|\d{4})\b')

We could can now test this out like this:

>>> sentence = "from 12/22/1629 to 11/14/1643"
>>> DATE_RE.sub(replace_date, sentence)
'from 1629-12-22 to 1643-11-14'
>>> DATE_RE.sub(replace_date, "Nevermind (09/24/91) and Lemonade (04/23/16)")
'Nevermind (1991-09-24) and Lemonade (2016-04-23)'

Substitutions don’t usually need functions, but if you need to do a complex substitution it can come in handy.

Substitution Exercises

Normalize JPEG Extension

Make a function that accepts a JPEG filename and returns a new filename with jpg lowercased without an e.

Tip

Modify the normalize_jpeg function in the substitution module.

Hint

Lookup how to pass flags to the re.sub function.

Example usage:

>>> normalize_jpeg('avatar.jpeg')
'avatar.jpg'
>>> normalize_jpeg('Avatar.JPEG')
'Avatar.jpg'
>>> normalize_jpeg('AVATAR.Jpg')
'AVATAR.jpg'

Normalize Whitespace

Make a function that replaces all instances of one or more whitespace characters with a single space.

Tip

Modify the normalize_whitespace function in the substitution module.

Example usage:

>>> normalize_whitespace("hello  there")
"hello there"
>>> normalize_whitespace("""Hold fast to dreams
... For if dreams die
... Life is a broken-winged bird
... That cannot fly.
...
... Hold fast to dreams
... For when dreams go
... Life is a barren field
... Frozen with snow.""")
'Hold fast to dreams For if dreams die Life is a broken-winged bird That cannot fly. Hold fast to dreams For when dreams go Life is a barren field Frozen with snow.'

Compress blank lines

Write a function that accepts a string and an integer N and compresses runs of N or more consecutive empty lines into just N empty lines.

Tip

Modify the compress_blank_lines function in the substitution module.

Example usage:

>>> compress_blank_lines("a\n\nb", max_blanks=1)
'a\n\nb'
>>> compress_blank_lines("a\n\nb", max_blanks=0)
'ab'
>>> compress_blank_lines("a\n\nb", max_blanks=2)
'a\n\nb'
>>> compress_blank_lines("a\n\n\n\nb\n\n\nc", max_blanks=2)
'a\n\n\nb\n\n\nc'

Normalize URL

I own the domain treyhunner.com. I prefer to link to my website as https://treyhunner.com, but I have some links that use http or use a www subdomain.

Write a function that normalizes all www.treyhunner.com and treyhunner.com links to use HTTPS and remove the www subdomain.

Tip

Modify the normalize_domain function in the substitution module.

Example usage:

>>> normalize_domain("http://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/")
'https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/'
>>> normalize_domain("https://treyhunner.com/2016/02/how-to-merge-dictionaries-in-python/")
'https://treyhunner.com/2016/02/how-to-merge-dictionaries-in-python/'
>>> normalize_domain("http://www.treyhunner.com/2015/11/counting-things-in-python/")
'https://treyhunner.com/2015/11/counting-things-in-python/'
>>> normalize_domain("http://www.treyhunner.com")
'https://treyhunner.com'
>>> normalize_domain("http://trey.in/give-a-talk")
'http://trey.in/give-a-talk'

Linebreaks

Write a function that accepts a string and converts linebreaks to HTML in the following way:

  • text is surrounded by paragraphs
  • text with two or more line breaks between is considered two separate paragraphs
  • text with a single line break between is separated by a <br>

Tip

Modify the convert_linebreaks function in the substitution module.

Example usage:

>>> convert_linebreaks("hello")
'<p>hello</p>'
>>> convert_linebreaks("hello\nthere")
'<p>hello<br>there</p>'
>>> convert_linebreaks("hello\n\nthere")
'<p>hello</p><p>there</p>'
>>> convert_linebreaks("hello\nthere\n\nworld")
'<p>hello<br>there</p><p>world</p>'

Lookahead

Let’s make a regular expressions that finds all words that appear more than once in a string.

For all purposes, we’ll treat a word as one or more “word” characters surrounded by word breaks:

>>> sentence = "Oh what a day, what a lovely day!"
>>> re.findall(r'\b\w+\b', sentence)
['Oh', 'what', 'a', 'day', 'what', 'a', 'lovely', 'day']

To find words that appear twice we could try doing this:

>>> re.findall(r'\b(\w+)\b.*\b\1\b', sentence)
['what']

That finds “what” but it doesn’t find “a” or “day”. The reason for this is that this match consumes every character between the first two “what”s.

Regular expressions only run through a string one time when searching.

We need a way to find out that there word occurs a second time without actually consuming any more characters. For this we can use a lookahead.

>>> re.findall(r'\b(\w+)\b(?=.*\b\1\b)', sentence)
['what', 'a', 'day']

We’ve used a positive lookahead here. That means that it’ll match successfully if our word is followed by any characters as well as itself later on. The (?=...) doesn’t actually consume any characters though. Let’s talk about what that means.

When we match a character, we consume it: meaning we restart our matching after that character. Here we can see finding letters followed by x actually consumes the x as well:

>>> re.findall(r'(.)x', 'axxx')
['a', 'x']

So this is repeatedly matching any letter and the letter x. Notice that because both of the two letters are consumed, when an x is followed by another x, only one of them is matched because both get consumed during the match.

If we use a lookahead for the letter x, it won’t be consumed so we’ll properly be matching each letter followed by an x (including other x‘s) this way:

>>> re.findall(r'(.)(?=x)', 'axxx')
['a', 'x', 'x']

Note that anchors like ^, $, and \b do not consume characters either.

Negative Lookahead

What if we want to write a regular expression that makes sure our string contains at least two different letters.

>>> re.search(r'[a-z].*[a-z]', 'aa', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 2), match='aa'>

That doesn’t work because it doesn’t make sure the letters are different.

We need some way to tell the regular expression engine that the second letter should not be the same as the first.

We already know how to write a regular expression that makes sure the two letters are the same:

>>> re.search(r'([a-z]).*\1', 'aa', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 2), match='aa'>
>>> re.search(r'([a-z]).*\1', 'a a', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 3), match='a a'>
>>> re.search(r'([a-z]).*\1', 'a b', re.IGNORECASE)

We can use a negative lookahead to make sure the two letters found are different.

>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'aa', re.IGNORECASE)
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'ab', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 2), match='ab'>
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'a b', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 3), match='a b'>
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'a a', re.IGNORECASE)
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'a ab', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 4), match='a ab'>

Lookahead Exercises

All Vowels

Find all words that are at most 9 letters long and contain every vowel (a, e, i, o, u) in any order.

Tip

Modify the have_all_vowels function in the lookahead module.

Unique Letters

Find all words that have at least 10 letters and do not have any repeating letters.

Tip

Modify the no_repeats function in the lookahead module.

HTML Encode Ampersands

Replace all & characters which are not part of HTML escape sequences by an HTML-encoded ampersand (&amp;).

Tip

Modify the encode_ampersands function in the lookahead module.

Example usage:

>>> encode_ampersands("This &amp; that & that &#38; this.")
'This &amp; that &amp; that &#38; this.'
>>> encode_ampersands("A&W")
'A&amp;W'

Pig Latin

Create a function that translates English phrases to pig latin.

Tip

Modify the to_pig_latin function in the lookahead module.

Example usage:

>>> to_pig_latin("pig")
'igpay'
>>> to_pig_latin("trust")
'usttray'
>>> to_pig_latin("quack")
'ackquay'
>>> to_pig_latin("squeak")
'eaksquay'
>>> to_pig_latin("enqueue")
'enqueueay'
>>> to_pig_latin("sequoia")
'equoiasay'

Camel Case to Underscore

Make a function that converts camelCase strings to under_score strings.

Tip

Modify the camel_to_underscore function in the lookahead module.