Python Unicode Categories — Core Concepts

Every Unicode character has a General Category — a two-letter code that classifies it as a letter, number, punctuation, symbol, separator, or control character. Python exposes this through unicodedata.category(), enabling script-aware text processing that works across all writing systems.

The Category System

Unicode defines 30 categories grouped into 7 major classes:

Major classCode prefixExamples
LetterLLu (uppercase), Ll (lowercase), Lo (other letter)
MarkMMn (combining), Mc (spacing combining)
NumberNNd (decimal digit), Nl (letter number like Ⅳ)
PunctuationPPs (open bracket), Pe (close bracket), Po (other)
SymbolSSc (currency $€), Sm (math +×), So (other ♪)
SeparatorZZs (space), Zl (line sep), Zp (paragraph sep)
OtherCCc (control), Cf (format), Co (private use)

The two-letter code is always a major class letter followed by a subclass letter.

Accessing Categories in Python

The unicodedata.category() function returns the two-letter code for any character:

import unicodedata

unicodedata.category('A')   # 'Lu' — uppercase letter
unicodedata.category('a')   # 'Ll' — lowercase letter
unicodedata.category('5')   # 'Nd' — decimal digit
unicodedata.category('!')   # 'Po' — other punctuation
unicodedata.category(' ')   # 'Zs' — space separator
unicodedata.category('€')   # 'Sc' — currency symbol

You can test just the first letter for broad classification: category(c)[0] == 'L' checks if a character is any kind of letter.

Practical Applications

Filtering by type. Keep only letters and spaces from user input by checking if each character’s category starts with ‘L’ or equals ‘Zs’. This works for all scripts — Latin, Cyrillic, CJK, Arabic — without hardcoding character ranges.

Removing diacritics. Combining marks (category ‘Mn’) are the accent marks that attach to base characters. Decompose with NFD, filter out ‘Mn’ characters, and recompose. “café” becomes “cafe” universally.

Identifying invisible characters. Control characters (Cc), format characters (Cf), and zero-width characters hide in pasted text. Checking categories reveals them.

Script-independent digit detection. The ‘Nd’ category includes decimal digits from every numeral system — Arabic-Indic (٠-٩), Devanagari (०-९), Thai (๐-๙) — not just ASCII 0-9.

Categories vs Python String Methods

Python’s built-in methods like str.isalpha(), str.isdigit(), and str.isspace() already use Unicode categories internally:

MethodRoughly equivalent to
c.isalpha()Category starts with ‘L’
c.isdigit()Category is ‘Nd’ (mostly)
c.isspace()Category starts with ‘Z’ or is Cc whitespace
c.isprintable()Category is not ‘C’ (except some)

The key difference: unicodedata.category() gives you the exact classification, while string methods give you a boolean answer to a broader question. When you need to distinguish uppercase from lowercase from titlecase, or currency symbols from math symbols, you need the category code.

Common Misconception

“ASCII ranges are enough for character classification.” Checking 'a' <= c <= 'z' only covers 26 English letters. Unicode has over 130,000 letters across hundreds of scripts. Category-based checking handles all of them with one consistent API. If your code uses hardcoded ASCII ranges for character classification, it’s broken for international text.

One Thing to Remember

Unicode categories give every character a precise two-letter classification that Python’s unicodedata module exposes — use them instead of ASCII range checks for text processing that works in every language.

pythonunicodecategoriestext-processingunicodedata

See Also

  • Python Fuzzy Matching Fuzzywuzzy Find out how Python's FuzzyWuzzy library matches messy, misspelled text — like a friend who understands you even when you mumble.
  • Python Regex Lookahead Lookbehind Learn how Python regex can peek ahead or behind without grabbing text — like checking what's next in line without stepping forward.
  • Python Regex Named Groups Learn how Python regex named groups let you label the pieces you capture — like putting name tags on your search results.
  • Python Regex Patterns Discover how Python regex patterns work like a secret code for finding hidden text treasures in any document.
  • Python Regular Expressions Learn how Python can find tricky text patterns fast, like spotting every phone number hidden in a messy page.