Python Unicode Categories — Deep Dive

Master Unicode General Categories in Python — build category-aware text processors, detect script mixing, handle edge cases across writing systems, and optimize for performance.

Unicode’s General Category property is the backbone of script-aware text processing. This deep dive explores every category in detail, builds practical tools that leverage categories, handles real-world edge cases, and optimizes for production workloads.

The Complete Category Table

Python’s unicodedata.category() returns one of 30 two-letter codes:

import unicodedata
from collections import Counter

# Count categories in the entire Basic Multilingual Plane
category_counts = Counter()
for codepoint in range(0x10000):
    try:
        c = chr(codepoint)
        category_counts[unicodedata.category(c)] += 1
    except ValueError:
        pass

for cat, count in category_counts.most_common(10):
    print(f"{cat}: {count:,} characters")
# Co (private use): 6,400
# Lo (other letter): 6,256+
# Cn (unassigned): varies by Unicode version
# ...

Letter Categories (L*)

examples = {
    'Lu': 'A',         # Uppercase letter
    'Ll': 'a',         # Lowercase letter
    'Lt': 'ǅ',        # Titlecase letter (digraphs like Dž)
    'Lm': 'ʰ',        # Modifier letter (superscript h)
    'Lo': 'あ',        # Other letter (CJK, Hiragana, etc.)
}
for cat, char in examples.items():
    assert unicodedata.category(char) == cat
    print(f"{cat}: {char} ({unicodedata.name(char)})")

Important: CJK characters are ‘Lo’ (other letter), not ‘Lu’ or ‘Ll’. Case-based logic doesn’t apply to most of the world’s writing systems.

Mark Categories (M*)

Marks attach to preceding base characters:

# Combining acute accent
combining = '\u0301'
print(unicodedata.category(combining))  # 'Mn' (non-spacing mark)

# The mark modifies the preceding character
print('e' + combining)  # é (visually)
print(len('e' + combining))  # 2 (two codepoints)

Number Categories (N*)

# Nd: decimal digits (positional value 0-9 in any script)
digits = '0٠०๐𝟎'  # ASCII, Arabic-Indic, Devanagari, Thai, Math bold
for d in digits:
    cat = unicodedata.category(d)
    val = unicodedata.decimal(d, -1)
    print(f"{d} → {cat}, decimal value: {val}")

# Nl: letter-like numbers
print(unicodedata.category('Ⅳ'))   # 'Nl' (Roman numeral four)
print(unicodedata.numeric('Ⅳ'))    # 4.0

# No: other numbers (fractions, superscripts)
print(unicodedata.category('½'))   # 'No'
print(unicodedata.numeric('½'))    # 0.5

Building a Category-Based Text Processor

import unicodedata
import re

class CategoryFilter:
    """Filter and transform text based on Unicode categories."""

    def __init__(self, allowed_categories: set[str] | None = None,
                 allowed_major: set[str] | None = None):
        self.allowed_categories = allowed_categories
        self.allowed_major = allowed_major or set()

    def filter(self, text: str) -> str:
        result = []
        for c in text:
            cat = unicodedata.category(c)
            if self.allowed_categories and cat in self.allowed_categories:
                result.append(c)
            elif cat[0] in self.allowed_major:
                result.append(c)
        return ''.join(result)

    @staticmethod
    def replace_category(text: str, category: str, replacement: str) -> str:
        """Replace all characters of a given category."""
        return ''.join(
            replacement if unicodedata.category(c) == category else c
            for c in text
        )

# Keep only letters and spaces
letters_spaces = CategoryFilter(
    allowed_categories={'Zs'},
    allowed_major={'L'}
)
print(letters_spaces.filter("Hello, World! 123 café"))
# "Hello World  café"

# Replace all currency symbols with $
text = "€100, £50, ¥1000"
print(CategoryFilter.replace_category(text, 'Sc', '$'))
# "$100, $50, $1000"

Script Detection and Mixed-Script Analysis

def get_script(char: str) -> str:
    """Get the Unicode script of a character (simplified via block name)."""
    try:
        name = unicodedata.name(char)
    except ValueError:
        return 'UNKNOWN'

    # Common script prefixes in character names
    for script in ['LATIN', 'CYRILLIC', 'GREEK', 'ARABIC', 'HEBREW',
                   'CJK', 'HIRAGANA', 'KATAKANA', 'HANGUL', 'DEVANAGARI',
                   'THAI', 'GEORGIAN', 'ARMENIAN', 'ETHIOPIC']:
        if script in name:
            return script
    return 'COMMON'

def analyze_scripts(text: str) -> dict[str, int]:
    """Count characters per script in a text."""
    scripts = Counter()
    for c in text:
        if unicodedata.category(c)[0] == 'L':
            scripts[get_script(c)] += 1
    return dict(scripts)

# Detect mixed scripts (potential homoglyph attack)
print(analyze_scripts("admin"))
# {'LATIN': 5}

print(analyze_scripts("\u0430dmin"))  # Cyrillic 'а' + Latin 'dmin'
# {'CYRILLIC': 1, 'LATIN': 4} — mixed script warning!

Handling Combining Characters Correctly

def grapheme_length(text: str) -> int:
    """Count user-perceived characters (grapheme clusters), not codepoints."""
    count = 0
    for c in text:
        cat = unicodedata.category(c)
        if cat[0] != 'M':  # Not a combining mark
            count += 1
    return count

# "é" as e + combining accent
decomposed = "e\u0301"
print(len(decomposed))              # 2 codepoints
print(grapheme_length(decomposed))  # 1 grapheme

# Flag emoji: two regional indicators
flag = "🇺🇸"
print(len(flag))  # 2 codepoints (surrogate pairs in some encodings)
# Note: proper grapheme segmentation needs the regex or icu libraries

Stripping Marks for Normalization

def strip_marks(text: str) -> str:
    """Remove all combining marks from text."""
    nfkd = unicodedata.normalize('NFKD', text)
    return ''.join(c for c in nfkd if unicodedata.category(c) != 'Mn')

print(strip_marks("Ñoño résumé naïve"))
# "Nono resume naive"

Emoji Detection via Categories

Emojis span multiple Unicode categories, making detection tricky:

def is_emoji_like(c: str) -> bool:
    """Check if a character is likely an emoji or pictograph."""
    cat = unicodedata.category(c)
    if cat == 'So':  # Symbol, Other — includes many emojis
        cp = ord(c)
        # Common emoji ranges
        return (0x1F300 <= cp <= 0x1F9FF or  # Misc Symbols & Pictographs
                0x2600 <= cp <= 0x26FF or     # Misc Symbols
                0x2700 <= cp <= 0x27BF or     # Dingbats
                0x1FA00 <= cp <= 0x1FA6F or   # Chess symbols
                0x1FA70 <= cp <= 0x1FAFF)     # Symbols Extended-A
    return False

# For production emoji detection, use the `emoji` library
text = "Hello 🌍 World 🎉"
emojis = [c for c in text if is_emoji_like(c)]
print(emojis)  # ['🌍', '🎉']

Performance: Category Lookups at Scale

unicodedata.category() is implemented in C and is fast, but for processing millions of characters, caching or batch approaches help:

import timeit

# Pre-build a set of all characters in desired categories
_LETTER_CHARS = frozenset(
    chr(cp) for cp in range(0x110000)
    if unicodedata.category(chr(cp))[0] == 'L'
)

def is_letter_cached(c: str) -> bool:
    return c in _LETTER_CHARS

def is_letter_lookup(c: str) -> bool:
    return unicodedata.category(c)[0] == 'L'

# The cached set is faster for repeated checks
text = "Hello世界مرحبا" * 10000
t_cached = timeit.timeit(lambda: [c for c in text if is_letter_cached(c)], number=10)
t_lookup = timeit.timeit(lambda: [c for c in text if is_letter_lookup(c)], number=10)
print(f"Cached: {t_cached:.3f}s, Lookup: {t_lookup:.3f}s")
# Cached set membership is ~2-3x faster for hot loops

Regex Unicode Categories

Python’s re module supports Unicode-aware character classes:

# \w matches any Unicode word character (L + N + underscore)
# But for specific categories, use unicodedata or the regex module

import regex  # pip install regex

# The regex module supports \p{} Unicode property escapes
letters_only = regex.compile(r'[\p{L}\p{Zs}]+')
print(letters_only.findall("Hello, 世界! 42 café"))
# ['Hello', ' 世界', ' ', ' café']

# Filter by specific category
digits = regex.compile(r'\p{Nd}+')
print(digits.findall("Price: ¥100 or ٤٢٠"))
# ['100', '٤٢٠']

Edge Cases and Gotchas

Private Use Area (Co). Codepoints U+E000–U+F8FF have no standard meaning. Some fonts map custom glyphs here. Category filtering should explicitly handle or reject ‘Co’ characters.

Unassigned (Cn). Future Unicode versions may assign meaning to currently unassigned codepoints. Code that rejects ‘Cn’ today may need updating.

Surrogate pairs (Cs). In narrow Python builds (rare now), supplementary characters appear as surrogate pairs. unicodedata.category() returns ‘Cs’ for lone surrogates, which are invalid in well-formed UTF-8.

Format characters (Cf). Characters like the soft hyphen (U+00AD) and bidirectional marks (U+200E, U+200F) are ‘Cf’ and invisible in most contexts but affect rendering. Don’t strip them blindly from bidirectional text.

One Thing to Remember

Unicode categories are the most reliable way to classify characters across all writing systems — use unicodedata.category() for precise control, the regex module’s \p{} for pattern matching, and cached sets for high-throughput processing.

pythonunicodecategoriestext-processingunicodedataadvanced