Substitutions¶
Basic Substitution¶
Let’s say we have some text that was written by a LaTeX user who uses two backticks ``
and two apostrophe characters to represent left and right double quotes.
>>> sentence = "This string uses ``smart'' quotes."
We want to convert all of these sets of double backticks and double apostrophes to double quote characters.
We could do something like this:
>>> sentence.replace("``", '"').replace("''", '"')
'This string uses "smart" quotes.'
But we can also use regular expressions to accomplish the same task. For this we’ll use the sub
function which stands for “substitution”:
>>> re.sub(r"``|''", '"', sentence)
'This string uses "smart" quotes.'
The sub
function takes three arguments:
- The regular expression to match
- The replacement string
- The string to operate on
Normalization¶
Let’s look at a task that wouldn’t have been well-suited to a string replacement.
Let’s make a regular expression that removes spaces after any commas.
We can do this by looking for commas with optional spaces after them and replaces that with just a comma:
>>> row = "column 1,column 2, column 3"
>>> re.sub(r',\s*', ',', row)
'column 1,column 2,column 3'
Using Captures in Substitutions¶
Let’s say we’re working with a document that was created in the US using MM/DD/YYYY format and we want to convert it to YYYY-MM-DD.
This isn’t just a simple replacement of /
with -
because the order of the numbers changes.
We can solve this by referencing our capturing groups in substitutions. Each group can be referenced with a backslash and the group number (\N
).
>>> sentence = "from 12/22/1629 to 11/14/1643"
>>> re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', sentence)
'from 1629-12-22 to 1643-11-14'
These references to capture groups are called back-references.
Using Captures¶
We can actually use back-references in the regular expression pattern also. Let’s look at an example.
Let’s modify our quotation matcher from earlier to search for either double- or single-quoted strings. Let’s try doing it this way:
>>> re.findall(r'["\'](.*?)["\']', "she said 'not really'")
['not really']
This would match unmatched quotes though:
>>> sentence = """You said "why?" and I say "I don't know"."""
>>> re.findall(r'["\'](.*?)["\']', sentence)
['why?', 'I don']
We need the end quote to be the same as the beginning quote. We can do this with a backreference:
>>> sentence = """You said "why?" and I say "I don't know"."""
>>> re.findall(r'(["\'])(.*?)\1', sentence)
[('"', 'why?'), ('"', "I don't know")]
The result from this isn’t exactly what we wanted. We’re getting both the quote character and the matching quotation.
Unfortunately we can’t make that first group that matches our quote non-capturing because need to reference it in our string.
We can retrieve just the quotation by using a list comprehension with findall
:
>>> sentence = """You said "why?" and I say "I don't know"."""
>>> matches = re.findall(r'(["\'])(.*?)\1', sentence)
>>> [q for _, q in matches]
['why?', "I don't know"]
We could instead use a list comprehension with finditer
:
>>> sentence = """You said "why?" and I say "I don't know"."""
>>> matches = re.finditer(r'(["\'])(.*?)\1', sentence)
>>> [m.group(2) for m in matches]
['why?', "I don't know"]
Note
We could have also used zip
:
>>> sentence = """You said "why?" and I say "I don't know"."""
>>> _, matches = zip(*re.findall(r'(["\'])(.*?)\1', sentence))
>>> matches
('why?', "I don't know")
Capture Exercises¶
Palindromes¶
Using the dictionary file, find all five letter palindromes.
Tip
Modify the palindrome5
function in the search
module.
Double Double¶
Find all words that have a consecutive repeated letter two times with only one other letter between them.
Tip
Modify the double_double
function in the search
module.
For example, these words should be matched:
- freebee
- assessed
- voodoo
Repetitive Words¶
Find all words that consist of the same letters repeated two times.
Tip
Modify the repeaters
function in the search
module.
Examples:
- tutu
- cancan
- murmur
Named Capture Groups¶
Capture groups are neat but sometimes it can be a little confusing figuring out what the group numbers are.
Sometimes it’s also a little confusing when you’re switching around numeric backreferences and trying to figure out which one is which.
Named capture groups can help us here.
Let’s use these on our date substitution:
>>> sentence = "from 12/22/1629 to 11/14/1643"
>>> re.sub(r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})', r'\g<year>-\g<month>-\g<day>', sentence)
'from 1629-12-22 to 1643-11-14'
That syntax is a little weird. The ?P
after the parenthesis allows us to specify a group name in brackets (<
... >
). That group name can be referenced later using \g
and brackets.
We can also us named groups without substitutions.
>>> sentence = "from 12/22/1629 to 11/14/1643"
>>> m = re.search(r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})', sentence)
>>> m.groups()
('12', '22', '1629')
>>> m.groupdict()
{'day': '22', 'month': '12', 'year': '1629'}
The groups act just like before, but we can also use groupdict
to get dictionaries containing the named groups.
Unfortunately, re.findall
doesn’t act any different with named groups:
>>> re.findall(r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})', sentence)
[('12', '22', '1629'), ('11', '14', '1643')]
We could use re.finditer
to get match objects and use groupdict
to get the dictionary for each one though:
>>> matches = re.finditer(r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})', sentence)
>>> [m.groupdict() for m in matches]
[{'day': '22', 'month': '12', 'year': '1629'}, {'day': '14', 'month': '11', 'year': '1643'}]
Substitution Functions¶
What if we want to allow our month/day/year substitution to support 2 digit years?
As humans we are pretty good at knowing how to do this conversion, but we’d need to do some kind of conditional algorithm to determine how to handle the conversion.
The sub
function actually allows us to specify a function instead of a replacement string. If a function is specified, it’ll be called to create the replacement string for each match.
def replace_date(match):
month, day, year = match.groups()
if len(year) == 4:
year = year
elif '00' <= year < '50':
year = '20' + year
elif '50' <= year <= '99':
year = '19' + year
return '-'.join((year, month, day))
DATE_RE = re.compile(r'\b(\d{2})/(\d{2})/(\d{2}|\d{4})\b')
We could can now test this out like this:
>>> sentence = "from 12/22/1629 to 11/14/1643"
>>> DATE_RE.sub(replace_date, sentence)
'from 1629-12-22 to 1643-11-14'
>>> DATE_RE.sub(replace_date, "Nevermind (09/24/91) and Lemonade (04/23/16)")
'Nevermind (1991-09-24) and Lemonade (2016-04-23)'
Substitutions don’t usually need functions, but if you need to do a complex substitution it can come in handy.
Substitution Exercises¶
Normalize JPEG Extension¶
Make a function that accepts a JPEG filename and returns a new filename with jpg lowercased without an e
.
Tip
Modify the normalize_jpeg
function in the substitution
module.
Hint
Lookup how to pass flags to the re.sub
function.
Example usage:
>>> normalize_jpeg('avatar.jpeg')
'avatar.jpg'
>>> normalize_jpeg('Avatar.JPEG')
'Avatar.jpg'
>>> normalize_jpeg('AVATAR.Jpg')
'AVATAR.jpg'
Normalize Whitespace¶
Make a function that replaces all instances of one or more whitespace characters with a single space.
Tip
Modify the normalize_whitespace
function in the substitution
module.
Example usage:
>>> normalize_whitespace("hello there")
"hello there"
>>> normalize_whitespace("""Hold fast to dreams
... For if dreams die
... Life is a broken-winged bird
... That cannot fly.
...
... Hold fast to dreams
... For when dreams go
... Life is a barren field
... Frozen with snow.""")
'Hold fast to dreams For if dreams die Life is a broken-winged bird That cannot fly. Hold fast to dreams For when dreams go Life is a barren field Frozen with snow.'
Compress blank lines¶
Write a function that accepts a string and an integer N
and compresses runs of N
or more consecutive empty lines into just N
empty lines.
Tip
Modify the compress_blank_lines
function in the substitution
module.
Example usage:
>>> compress_blank_lines("a\n\nb", max_blanks=1)
'a\n\nb'
>>> compress_blank_lines("a\n\nb", max_blanks=0)
'ab'
>>> compress_blank_lines("a\n\nb", max_blanks=2)
'a\n\nb'
>>> compress_blank_lines("a\n\n\n\nb\n\n\nc", max_blanks=2)
'a\n\n\nb\n\n\nc'
Normalize URL¶
I own the domain treyhunner.com. I prefer to link to my website as https://treyhunner.com
, but I have some links that use http
or use a www
subdomain.
Write a function that normalizes all www.treyhunner.com
and treyhunner.com
links to use HTTPS and remove the www
subdomain.
Tip
Modify the normalize_domain
function in the substitution
module.
Example usage:
>>> normalize_domain("http://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/")
'https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/'
>>> normalize_domain("https://treyhunner.com/2016/02/how-to-merge-dictionaries-in-python/")
'https://treyhunner.com/2016/02/how-to-merge-dictionaries-in-python/'
>>> normalize_domain("http://www.treyhunner.com/2015/11/counting-things-in-python/")
'https://treyhunner.com/2015/11/counting-things-in-python/'
>>> normalize_domain("http://www.treyhunner.com")
'https://treyhunner.com'
>>> normalize_domain("http://trey.in/give-a-talk")
'http://trey.in/give-a-talk'
Linebreaks¶
Write a function that accepts a string and converts linebreaks to HTML in the following way:
- text is surrounded by paragraphs
- text with two or more line breaks between is considered two separate paragraphs
- text with a single line break between is separated by a
<br>
Tip
Modify the convert_linebreaks
function in the substitution
module.
Example usage:
>>> convert_linebreaks("hello")
'<p>hello</p>'
>>> convert_linebreaks("hello\nthere")
'<p>hello<br>there</p>'
>>> convert_linebreaks("hello\n\nthere")
'<p>hello</p><p>there</p>'
>>> convert_linebreaks("hello\nthere\n\nworld")
'<p>hello<br>there</p><p>world</p>'
Lookahead¶
Let’s make a regular expressions that finds all words that appear more than once in a string.
For all purposes, we’ll treat a word as one or more “word” characters surrounded by word breaks:
>>> sentence = "Oh what a day, what a lovely day!"
>>> re.findall(r'\b\w+\b', sentence)
['Oh', 'what', 'a', 'day', 'what', 'a', 'lovely', 'day']
To find words that appear twice we could try doing this:
>>> re.findall(r'\b(\w+)\b.*\b\1\b', sentence)
['what']
That finds “what” but it doesn’t find “a” or “day”. The reason for this is that this match consumes every character between the first two “what”s.
Regular expressions only run through a string one time when searching.
We need a way to find out that there word occurs a second time without actually consuming any more characters. For this we can use a lookahead.
>>> re.findall(r'\b(\w+)\b(?=.*\b\1\b)', sentence)
['what', 'a', 'day']
We’ve used a positive lookahead here. That means that it’ll match successfully if our word is followed by any characters as well as itself later on. The (?=...)
doesn’t actually consume any characters though. Let’s talk about what that means.
When we match a character, we consume it: meaning we restart our matching after that character. Here we can see finding letters followed by x
actually consumes the x
as well:
>>> re.findall(r'(.)x', 'axxx')
['a', 'x']
So this is repeatedly matching any letter and the letter x
. Notice that because both of the two letters are consumed, when an x
is followed by another x
, only one of them is matched because both get consumed during the match.
If we use a lookahead for the letter x
, it won’t be consumed so we’ll properly be matching each letter followed by an x
(including other x
‘s) this way:
>>> re.findall(r'(.)(?=x)', 'axxx')
['a', 'x', 'x']
Note that anchors like ^
, $
, and \b
do not consume characters either.
Negative Lookahead¶
What if we want to write a regular expression that makes sure our string contains at least two different letters.
>>> re.search(r'[a-z].*[a-z]', 'aa', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 2), match='aa'>
That doesn’t work because it doesn’t make sure the letters are different.
We need some way to tell the regular expression engine that the second letter should not be the same as the first.
We already know how to write a regular expression that makes sure the two letters are the same:
>>> re.search(r'([a-z]).*\1', 'aa', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 2), match='aa'>
>>> re.search(r'([a-z]).*\1', 'a a', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 3), match='a a'>
>>> re.search(r'([a-z]).*\1', 'a b', re.IGNORECASE)
We can use a negative lookahead to make sure the two letters found are different.
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'aa', re.IGNORECASE)
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'ab', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 2), match='ab'>
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'a b', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 3), match='a b'>
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'a a', re.IGNORECASE)
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'a ab', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 4), match='a ab'>
Lookahead Exercises¶
All Vowels¶
Find all words that are at most 9 letters long and contain every vowel (a
, e
, i
, o
, u
) in any order.
Tip
Modify the have_all_vowels
function in the lookahead
module.
Unique Letters¶
Find all words that have at least 10 letters and do not have any repeating letters.
Tip
Modify the no_repeats
function in the lookahead
module.
HTML Encode Ampersands¶
Replace all &
characters which are not part of HTML escape sequences by an HTML-encoded ampersand (&
).
Tip
Modify the encode_ampersands
function in the lookahead
module.
Example usage:
>>> encode_ampersands("This & that & that & this.")
'This & that & that & this.'
>>> encode_ampersands("A&W")
'A&W'
Pig Latin¶
Create a function that translates English phrases to pig latin.
Tip
Modify the to_pig_latin
function in the lookahead
module.
Example usage:
>>> to_pig_latin("pig")
'igpay'
>>> to_pig_latin("trust")
'usttray'
>>> to_pig_latin("quack")
'ackquay'
>>> to_pig_latin("squeak")
'eaksquay'
>>> to_pig_latin("enqueue")
'enqueueay'
>>> to_pig_latin("sequoia")
'equoiasay'
Camel Case to Underscore¶
Make a function that converts camelCase strings to under_score strings.
Tip
Modify the camel_to_underscore
function in the lookahead
module.
Get Inline Markdown Links¶
Make a function that accepts a string and returns a list of all inline markdown links in the given string.
Inline markdown links look like this:
[text here](http://example.com)
Tip
Modify the get_inline_links
function in the lookahead
module.
Example usage:
>>> get_inline_links("""
... [Python](https://www.python.org)
... [Google](https://www.google.com)""")
[('Python', 'https://www.python.org'), ('Google', 'https://www.google.com')]
Broken Markdown Links¶
Make a function that accepts a string and returns a list of all reference-style markdown links that do not have a corresponding link definition.
Tip
Modify the find_broken_links
function in the lookahead
module.
Example usage:
>>> find_broken_links("""
... [working link][Python]
... [broken link][Google]
... [python]: https://www.python.org/""")
[('broken link', 'Google')]
As a bonus, make your function also work with implicit link names. For example:
>>> find_broken_links("""
... [Python][]
... [Google][]
... [python]: https://www.python.org/""")
[('Google', 'Google')]
Get All Markdown Links¶
Modify your get_inline_links
function from the previous exercise to make a get_markdown_links
function which finds all markdown links.
This function should work for inline links as well as reference links (including reference links with implicit link names).
Tip
Modify the get_markdown_links
function in the lookahead
module.
Example usage:
>>> get_markdown_links("""
... [Python](https://www.python.org)
... [Google][]
... [Another link][example]
... [google]: https://www.google.com
... [example]: http://example.com""")
[('Python', 'https://www.python.org'), ('Google', 'https://www.google.com'), ('Another link', 'http://example.com')]