Working with Python

Regular Expressions

Regular expressions in Python use the standard module re.

Use raw strings for a regex pattern to avoid issues with escaping special characters. Recent versions of Python will issue a warning when not using a raw string for escaped characters.

Basic regular expression matching

The two main regex functions are re.search() and of re.match(). The difference is that re.search() scans the entire string to find a match for the expression anywhere, while re.match() only checks for a match at the beginning of the string. The majority of the time you’ll want to just use re.search().

import re

s = "Hello world"
m = re.match("Hello", s)
if m:
    print("Match found.")
else:
    print("No match.")

# Output: Match found.

Trying to match “world” highlights the difference between .search() and .match():

import re

s = "Hello world"
m = re.match("world", s)
if m:
    print("Match found.")
else:
    print("No match.")

# Output: No match.


m = re.search("world", s)
if m:
    print("Match found.")
else:
    print("No match.")

# Output: Match found.

Capturing Matches

Often times you don’t want to just match a pattern, but capture a value. Use parentheses () to group parts of a pattern and capture the value:

import re

m = re.search(r"(\d+)", s)
if m is not None:
    print(f"Search found: {m.group(1)} dogs")

Both re.match() and re.search() return a Match object if a match is found, or None otherwise. Use m.group() to get the matched substring, m.group(1) returns the substring matched by the first group, m.group(2) by the second, and so on.

Here’s an example with multiple capturing groups:

import re

s = "There are 13 dogs outside and 2 cats"

m = re.search(r"(\d+).*(\d+)", s)
if m is not None:
    print(f"Matched String: {m.group(0)}")
    print(f"First : {m.group(1)}")
    print(f"Second: {m.group(2)}")

# Output: Matched String: 13 dogs outside and 2
# Output: First : 13
# Output: Second: 2

Find All Matches

Use re.findadll(<pattern>, string) to return a list of all substrings that match pattern within a string.

s = "There are 13 dogs outside and 2 cats"
m = re.findall(r"\d+", s)
print(m)
# Output: ['13', '2']

👉 What do you think will match using re.findall(r"\d", s) without the + ?

Regular Expression Substitution

Use the re.sub() function to replace text based on a regular expression. re.sub() will replace all occurrences of the pattern unless a count argument is specified. It returns a new string with the replacements made.

import re

s = "There are 13 dogs outside and 2 cats."

# Replace all digits with "many"
s_replaced = re.sub(r"\d+", "many", s)
print(s_replaced)
# Output: There are many dogs outside and many cats

# Replace only the first occurrence
s_replaced_once = re.sub(r"\d+", "several", s, count=1)
print(s_replaced_once)
# Output: There are several dogs outside and 2 cats.

Split Strings Using Regular Expressions

Use re.split() to split a string based on regular expressions.

import re

s = "apple,orange;banana:grape"
fruits = re.split(r"[,;:]", s)
print(fruits)

# Output: ['apple', ' orange', 'banana', 'grape']

Regular Expression Patterns

The examples have used the \d+ command to match 1 or more digits. Here are the other standard regular expression patterns that Python supports.

Pattern	Description
`.`	Any character except newline
`\d`	Digits (0-9)
`\w`	Word character (letters, digits, underscore)
`\s`	Whitespace (spaces, tabs, newlines)
`^`	Start of string
`$`	End of string
`*`	0 or more
`+`	1 or more
`?`	0 or 1
`{n}`	Exactly n times
`a\|b`	Use pipe to match a or b

Characters

Examples matching against digit and word characters.

text = "User 42 logged in at 10:45"
digits = re.findall(r"\d", text)
print(digits)
# Output: ['4', '2', '1' '0', '4', '5']

text = "User 42 logged in at 10:45"
digits = re.findall(r"\w+", text)
print(digits)
# Output: ['User', '42', 'logged', 'in', 'at', '10', '45']

Anchors

Examples matching against anchor characters, start and end of strings.

text = "python is fun"
if re.match(r"^python", text):
    print("Starts with python")

if re.search(r"fun$", text):
    print("Ends with fun")

Quantifiers

Examples using quanitifiers, matching aginst certain numbers of characters.

text = "aaaab"
m = re.match(r"a{4}b", text) # match 4 a's
print(m.group())
# Output: aaaab

The . is a special character to match anything, using .* would always return true, since asterisk means match 0 or more. Here’s an example matching an HTML header that might contain various characters.

html = "<h1>My Page Title</h1>"
m = re.search(r"<h1>(.*)</h1>", html)
if match:
    print(m.group())
    print(m.group(1))
# Output: <h1>My Page Title</h1>
# Output: My Page Title

Use | to match either pattern:

text = "I like cats and dogs"
match = re.findall(r"cats|dogs", text)
print(match)
# Output: ['cats', 'dogs']