Escaping characters¶
Suppose you have a file of students and their grades on a recent exam.
If you want to find all students who received an A+
, you might try using the regular expression A+
, but this matches on Allison: C+
🤔
That's because +
is a special regular expression metacharacter meaning "one or more repetitions". In other
words, A+
means match one or more repetitions of A
.
In order to match a plus sign (ignoring the default interpretation of +
), you must escape it with a backslash.
For example A\+
means match the string A+
.
Similarly, \\
matches a single slash, \.
matches a period, \(
matches an open parenthesis, and so on.
Raw string literals ('r' prefix)¶
Python doesn't always interpret strings exactly as you write them. For example, in the string "hello\nworld"
, Python
interprets \n
as a newline character. This becomes evident when you print()
the string.
If you want to avoid this special interpretation of \n
, you can prefix the string with the letter r.
A string prefixed with the letter r like this is called a raw string literal.
!!! tip
!!! note "Raw string literals vs escaping characters"
Capture Groups¶
Capture groups let you select parts of a regular expression match. For example, given the string,
The expression "\d+ \w+"
matches these substrings
By wrapping \d+
and \w+
in parentheses, each match suddenly has two nested sub-matches. These are called "capture
groups".
In Python, if we identify the first match using re.search()
,
we can access its groups using the .groups()
method,
or we can access each group individually using the .group()
method
Nested and Non Capture Groups¶
Consider the following example that matches the capital Al or Bi followed by one or more word characters.
There are two capture groups in this example: (Al|Bi)
and (\w+)
. What if we wanted to make the entire pattern a
single capture group? You might try
but this returns two strings:
- one representing the outer capture group, identified by the outermost parentheses
{==(==}(Al|Bi)\w+{==)==}
- another representing the inner capture group, identified by the innermost parentheses
({==(==}Al|Bi{==)==}\w+)
.
This is known as a nested capture group.
The issue of selecting a single capture group remains! The problem stems from the Or operator, because we cannot drop
the parentheses surrounding (Al|Bi)
without changing the pattern's meaning.
To get around this issue, we can change (Al|Bi)
to (?:Al|Bi)
. The ?:
bit signals the expression as a non capture group.