Python Unicode Categories — Deep Dive
Unicode’s General Category property is the backbone of script-aware text processing. This deep dive explores every category in detail, builds practical tools that leverage categories, handles real-world edge cases, and optimizes for production workloads.
The Complete Category Table
Python’s unicodedata.category() returns one of 30 two-letter codes:
import unicodedata
from collections import Counter
# Count categories in the entire Basic Multilingual Plane
category_counts = Counter()
for codepoint in range(0x10000):
try:
c = chr(codepoint)
category_counts[unicodedata.category(c)] += 1
except ValueError:
pass
for cat, count in category_counts.most_common(10):
print(f"{cat}: {count:,} characters")
# Co (private use): 6,400
# Lo (other letter): 6,256+
# Cn (unassigned): varies by Unicode version
# ...
Letter Categories (L*)
examples = {
'Lu': 'A', # Uppercase letter
'Ll': 'a', # Lowercase letter
'Lt': 'Dž', # Titlecase letter (digraphs like Dž)
'Lm': 'ʰ', # Modifier letter (superscript h)
'Lo': 'あ', # Other letter (CJK, Hiragana, etc.)
}
for cat, char in examples.items():
assert unicodedata.category(char) == cat
print(f"{cat}: {char} ({unicodedata.name(char)})")
Important: CJK characters are ‘Lo’ (other letter), not ‘Lu’ or ‘Ll’. Case-based logic doesn’t apply to most of the world’s writing systems.
Mark Categories (M*)
Marks attach to preceding base characters:
# Combining acute accent
combining = '\u0301'
print(unicodedata.category(combining)) # 'Mn' (non-spacing mark)
# The mark modifies the preceding character
print('e' + combining) # é (visually)
print(len('e' + combining)) # 2 (two codepoints)
Number Categories (N*)
# Nd: decimal digits (positional value 0-9 in any script)
digits = '0٠०๐𝟎' # ASCII, Arabic-Indic, Devanagari, Thai, Math bold
for d in digits:
cat = unicodedata.category(d)
val = unicodedata.decimal(d, -1)
print(f"{d} → {cat}, decimal value: {val}")
# Nl: letter-like numbers
print(unicodedata.category('Ⅳ')) # 'Nl' (Roman numeral four)
print(unicodedata.numeric('Ⅳ')) # 4.0
# No: other numbers (fractions, superscripts)
print(unicodedata.category('½')) # 'No'
print(unicodedata.numeric('½')) # 0.5
Building a Category-Based Text Processor
import unicodedata
import re
class CategoryFilter:
"""Filter and transform text based on Unicode categories."""
def __init__(self, allowed_categories: set[str] | None = None,
allowed_major: set[str] | None = None):
self.allowed_categories = allowed_categories
self.allowed_major = allowed_major or set()
def filter(self, text: str) -> str:
result = []
for c in text:
cat = unicodedata.category(c)
if self.allowed_categories and cat in self.allowed_categories:
result.append(c)
elif cat[0] in self.allowed_major:
result.append(c)
return ''.join(result)
@staticmethod
def replace_category(text: str, category: str, replacement: str) -> str:
"""Replace all characters of a given category."""
return ''.join(
replacement if unicodedata.category(c) == category else c
for c in text
)
# Keep only letters and spaces
letters_spaces = CategoryFilter(
allowed_categories={'Zs'},
allowed_major={'L'}
)
print(letters_spaces.filter("Hello, World! 123 café"))
# "Hello World café"
# Replace all currency symbols with $
text = "€100, £50, ¥1000"
print(CategoryFilter.replace_category(text, 'Sc', '$'))
# "$100, $50, $1000"
Script Detection and Mixed-Script Analysis
def get_script(char: str) -> str:
"""Get the Unicode script of a character (simplified via block name)."""
try:
name = unicodedata.name(char)
except ValueError:
return 'UNKNOWN'
# Common script prefixes in character names
for script in ['LATIN', 'CYRILLIC', 'GREEK', 'ARABIC', 'HEBREW',
'CJK', 'HIRAGANA', 'KATAKANA', 'HANGUL', 'DEVANAGARI',
'THAI', 'GEORGIAN', 'ARMENIAN', 'ETHIOPIC']:
if script in name:
return script
return 'COMMON'
def analyze_scripts(text: str) -> dict[str, int]:
"""Count characters per script in a text."""
scripts = Counter()
for c in text:
if unicodedata.category(c)[0] == 'L':
scripts[get_script(c)] += 1
return dict(scripts)
# Detect mixed scripts (potential homoglyph attack)
print(analyze_scripts("admin"))
# {'LATIN': 5}
print(analyze_scripts("\u0430dmin")) # Cyrillic 'а' + Latin 'dmin'
# {'CYRILLIC': 1, 'LATIN': 4} — mixed script warning!
Handling Combining Characters Correctly
def grapheme_length(text: str) -> int:
"""Count user-perceived characters (grapheme clusters), not codepoints."""
count = 0
for c in text:
cat = unicodedata.category(c)
if cat[0] != 'M': # Not a combining mark
count += 1
return count
# "é" as e + combining accent
decomposed = "e\u0301"
print(len(decomposed)) # 2 codepoints
print(grapheme_length(decomposed)) # 1 grapheme
# Flag emoji: two regional indicators
flag = "🇺🇸"
print(len(flag)) # 2 codepoints (surrogate pairs in some encodings)
# Note: proper grapheme segmentation needs the regex or icu libraries
Stripping Marks for Normalization
def strip_marks(text: str) -> str:
"""Remove all combining marks from text."""
nfkd = unicodedata.normalize('NFKD', text)
return ''.join(c for c in nfkd if unicodedata.category(c) != 'Mn')
print(strip_marks("Ñoño résumé naïve"))
# "Nono resume naive"
Emoji Detection via Categories
Emojis span multiple Unicode categories, making detection tricky:
def is_emoji_like(c: str) -> bool:
"""Check if a character is likely an emoji or pictograph."""
cat = unicodedata.category(c)
if cat == 'So': # Symbol, Other — includes many emojis
cp = ord(c)
# Common emoji ranges
return (0x1F300 <= cp <= 0x1F9FF or # Misc Symbols & Pictographs
0x2600 <= cp <= 0x26FF or # Misc Symbols
0x2700 <= cp <= 0x27BF or # Dingbats
0x1FA00 <= cp <= 0x1FA6F or # Chess symbols
0x1FA70 <= cp <= 0x1FAFF) # Symbols Extended-A
return False
# For production emoji detection, use the `emoji` library
text = "Hello 🌍 World 🎉"
emojis = [c for c in text if is_emoji_like(c)]
print(emojis) # ['🌍', '🎉']
Performance: Category Lookups at Scale
unicodedata.category() is implemented in C and is fast, but for processing millions of characters, caching or batch approaches help:
import timeit
# Pre-build a set of all characters in desired categories
_LETTER_CHARS = frozenset(
chr(cp) for cp in range(0x110000)
if unicodedata.category(chr(cp))[0] == 'L'
)
def is_letter_cached(c: str) -> bool:
return c in _LETTER_CHARS
def is_letter_lookup(c: str) -> bool:
return unicodedata.category(c)[0] == 'L'
# The cached set is faster for repeated checks
text = "Hello世界مرحبا" * 10000
t_cached = timeit.timeit(lambda: [c for c in text if is_letter_cached(c)], number=10)
t_lookup = timeit.timeit(lambda: [c for c in text if is_letter_lookup(c)], number=10)
print(f"Cached: {t_cached:.3f}s, Lookup: {t_lookup:.3f}s")
# Cached set membership is ~2-3x faster for hot loops
Regex Unicode Categories
Python’s re module supports Unicode-aware character classes:
# \w matches any Unicode word character (L + N + underscore)
# But for specific categories, use unicodedata or the regex module
import regex # pip install regex
# The regex module supports \p{} Unicode property escapes
letters_only = regex.compile(r'[\p{L}\p{Zs}]+')
print(letters_only.findall("Hello, 世界! 42 café"))
# ['Hello', ' 世界', ' ', ' café']
# Filter by specific category
digits = regex.compile(r'\p{Nd}+')
print(digits.findall("Price: ¥100 or ٤٢٠"))
# ['100', '٤٢٠']
Edge Cases and Gotchas
Private Use Area (Co). Codepoints U+E000–U+F8FF have no standard meaning. Some fonts map custom glyphs here. Category filtering should explicitly handle or reject ‘Co’ characters.
Unassigned (Cn). Future Unicode versions may assign meaning to currently unassigned codepoints. Code that rejects ‘Cn’ today may need updating.
Surrogate pairs (Cs). In narrow Python builds (rare now), supplementary characters appear as surrogate pairs. unicodedata.category() returns ‘Cs’ for lone surrogates, which are invalid in well-formed UTF-8.
Format characters (Cf). Characters like the soft hyphen (U+00AD) and bidirectional marks (U+200E, U+200F) are ‘Cf’ and invisible in most contexts but affect rendering. Don’t strip them blindly from bidirectional text.
One Thing to Remember
Unicode categories are the most reliable way to classify characters across all writing systems — use unicodedata.category() for precise control, the regex module’s \p{} for pattern matching, and cached sets for high-throughput processing.
See Also
- Python Fuzzy Matching Fuzzywuzzy Find out how Python's FuzzyWuzzy library matches messy, misspelled text — like a friend who understands you even when you mumble.
- Python Regex Lookahead Lookbehind Learn how Python regex can peek ahead or behind without grabbing text — like checking what's next in line without stepping forward.
- Python Regex Named Groups Learn how Python regex named groups let you label the pieces you capture — like putting name tags on your search results.
- Python Regex Patterns Discover how Python regex patterns work like a secret code for finding hidden text treasures in any document.
- Python Regular Expressions Learn how Python can find tricky text patterns fast, like spotting every phone number hidden in a messy page.