Python Regex Patterns — Core Concepts

Regular expressions let you define text-matching rules using a compact pattern language. Python’s re module compiles these patterns into efficient search engines that scan strings in a single pass.

While Python Regular Expressions covers the re module’s API, this article focuses on the patterns themselves — the syntax you actually write inside re.compile().

The Pattern Alphabet

Every regex pattern is built from a small set of symbols.

Literal characters match themselves. The pattern cat matches the exact sequence c-a-t.

Metacharacters have special meaning:

SymbolMeaningExample match
.Any character except newlinea.c → “abc”, “a3c”
\dAny digit (0-9)\d\d → “42”
\wWord character (letter, digit, underscore)\w+ → “hello_2”
\sWhitespace (space, tab, newline)\s+ → ” “
\bWord boundary\bcat\b → “cat” but not “catalog”

Capital versions negate: \D means non-digit, \W means non-word character, \S means non-whitespace.

Character Classes

Square brackets define custom character sets.

  • [aeiou] matches any single vowel
  • [0-9] matches any digit (same as \d)
  • [^abc] matches anything except a, b, or c
  • [a-zA-Z] matches any English letter

You can combine ranges: [a-zA-Z0-9_] is equivalent to \w for ASCII text.

Quantifiers

Quantifiers control how many times a piece must repeat.

QuantifierMeaning
*Zero or more
+One or more
?Zero or one
{3}Exactly 3
{2,5}Between 2 and 5
{3,}3 or more

By default, quantifiers are greedy — they grab as much text as possible. Add ? after a quantifier to make it lazy (minimal matching): .*? stops at the first opportunity.

Anchors

Anchors don’t match characters — they match positions.

  • ^ matches the start of a string (or line in multiline mode)
  • $ matches the end
  • \b matches a word boundary

Use anchors when you need the pattern to match the whole input, not just a substring. ^\d{5}$ ensures the entire string is exactly five digits.

Groups and Alternation

Parentheses create groups that let you:

  1. Extract parts of a match: (\d{3})-(\d{4}) captures area code and number separately
  2. Apply quantifiers to sequences: (ha)+ matches “ha”, “haha”, “hahaha”
  3. Alternate with the pipe |: (cat|dog) matches either word

Named groups make extracted data clearer: (?P<year>\d{4})-(?P<month>\d{2}) gives each capture a label.

Non-capturing groups (?:...) group without capturing, which is useful when you need grouping for alternation but don’t care about extracting the match.

Lookahead and Lookbehind

These assert what comes before or after your match without consuming characters.

  • (?=...) positive lookahead: \d+(?= dollars) matches digits followed by ” dollars”
  • (?!...) negative lookahead: \d+(?! cents) matches digits NOT followed by ” cents”
  • (?<=...) positive lookbehind: (?<=\$)\d+ matches digits preceded by ”$”
  • (?<!...) negative lookbehind: (?<!\\)\n matches newlines not preceded by backslash

Common Misconception

“Regex can parse any structured format.” Regex works on flat text patterns. It cannot reliably handle nested structures like HTML or JSON. For those, use a proper parser. Regex is the right tool for extracting simple, predictable patterns — not for replacing a full grammar.

One Thing to Remember

Regex patterns are assembled from a small toolkit — character classes, quantifiers, anchors, and groups — and nearly every pattern you’ll ever need is a combination of these four pieces.

pythonregexpatternstext-processing

See Also