pythonregex
Ben Gorman

Ben Gorman

Life's a garden. Dig it.

What's a regular expression?

A regular expression (aka regex) is a special syntax that lets you match strings based on conditions. For example, the regular expression \d+\s[a-z]+ matches strings that have

  • one or more digits (\d+)
  • followed by a single space (\s)
  • followed by one or more lowercase letters between a and z ([a-z]+)

Example

==20 quick== brown foxes jumped over ==2 lazy== dogs, ==8 sleepy== cats, and ==4 loud== crickets.

Table of regular expression patterns { .compact-table-sibling }

Pattern Description
[abc] a or b or c
[^abc] not (a or b or c)
[a-z] a or b ... or y or z
[1-9] 1 or 2 ... or 8 or 9
\d digits [0-9]
\D non-digits [^0-9]
\s whitespace [ \t\n\r\f\v]
\S non-whitespace [^ \t\n\r\f\v]
\w alphanumeric [a-zA-Z0-9_]
\W non-alphanumeric [^a-zA-Z0-9_]
. any character
x* zero or more repetitions of x
x+ one or more repetitions of x
x? zero or one repetitions of x
{m} m repetitions
{m,n} m to n repetitions
{m,n} m to n repetitions
\\, \., \* backslash, period, asterisk
\b word boundary
^hello starts with hello
bye$ ends with bye
(...) capture group
`(po go)`

How do regular expressions work in Python?

In Python, regular expressions are managed by the re module.

Table of regular expression functions in Python { .compact-table-sibling }

Function Description Return Value
re.findall(pattern, string, flags=0) Find all non-overlapping occurrences of pattern in string list of strings, or list of tuples if > 1 capture group
re.finditer(pattern, string, flags=0) Find all non-overlapping occurrences of pattern in string iterator yielding match objects
re.search(pattern, string, flags=0) Find first occurrence of pattern in string match object or None
re.split(pattern, string, maxsplit=0, flags=0) Split string by occurrences of pattern list of strings
re.sub(pattern, repl, string, count=0, flags=0) Replace pattern with repl new string with the replacement(s)

What about re.compile()?

The following regular expression searches have equivalent logic...

import re
pat = re.compile("[A-Z][a-z]+") 
pat.findall("Hi, I'm Bob.")
# ['Hi', 'Bob']
import re
re.findall(pattern="[A-Z][a-z]+", string="Hi, I'm Bob.")
# ['Hi', 'Bob']

but the first version compiles the regular expression into a re.Pattern object.

type(pat)  # <class 're.Pattern'>

This can boost performance in cases where you use the same regular expression repeatedly.