Python Unicode and Encoding — Deep Dive

Dive into CPython's internal string representation, codec machinery, normalization forms, and strategies for bulletproof Unicode handling in production.

Unicode handling in Python 3 is clean at the surface, but the internals reveal careful engineering decisions about memory, performance, and compatibility. This deep dive covers CPython’s string representation, the codec system, normalization, and production-grade patterns.

CPython’s Internal String Representation

Since PEP 393 (Python 3.3), CPython uses a flexible string representation that adapts storage per string based on the highest code point it contains:

Kind	Bytes per char	Used when
Latin-1	1	All chars ≤ U+00FF
UCS-2	2	Any char in U+0100 – U+FFFF
UCS-4	4	Any char > U+FFFF (emojis, rare scripts)

import sys

sys.getsizeof("hello")     # ~54 bytes (Latin-1, 1 byte/char + overhead)
sys.getsizeof("héllo")     # ~54 bytes (still Latin-1, é is U+00E9)
sys.getsizeof("中ello")    # ~74 bytes (UCS-2, 2 bytes/char)
sys.getsizeof("🐍ello")   # ~92 bytes (UCS-4, 4 bytes/char)

Key implication: A single emoji in a long string forces the entire string to UCS-4 representation, quadrupling memory compared to ASCII-only text. When processing large text corpora, stripping or separating emoji-containing strings can significantly reduce memory usage.

The Compact ASCII Optimization

Strings containing only ASCII characters get a special compact representation where the UTF-8 view is the same as the internal data. This avoids double storage for the most common case.

The Codec System

Python’s encoding/decoding infrastructure is built on a pluggable codec registry.

How Codecs Are Registered

import codecs

# Look up a codec by name
info = codecs.lookup("utf-8")
print(info.name)         # 'utf-8'
print(type(info.encode)) # <class 'builtin_function_or_method'>

Python normalizes codec names: "utf-8", "UTF8", "utf_8" all resolve to the same codec.

Custom Codecs

You can register your own codecs:

import codecs

def rot13_encode(text):
    return (codecs.encode(text, "rot_13"), len(text))

def rot13_decode(data):
    return (codecs.decode(data, "rot_13"), len(data))

class Rot13Codec(codecs.Codec):
    encode = staticmethod(rot13_encode)
    decode = staticmethod(rot13_decode)

def find_rot13(name):
    if name == "rot13-custom":
        return codecs.CodecInfo(
            encode=Rot13Codec.encode,
            decode=Rot13Codec.decode,
            name="rot13-custom",
        )
    return None

codecs.register(find_rot13)
"hello".encode("rot13-custom")  # Works!

Incremental Codecs

For streaming, use incremental encoders/decoders that handle partial input:

import codecs

decoder = codecs.getincrementaldecoder("utf-8")()

# Feed partial UTF-8 bytes (é = \xc3\xa9)
result = decoder.decode(b"caf\xc3")   # "caf" — waits for more bytes
result += decoder.decode(b"\xa9!")     # "é!" — completes the character
# Total: "café!"

This is essential for network protocols and streaming file readers where you receive data in arbitrary chunks.

Error Handling Strategies

Beyond the standard error handlers, Python provides several specialized options:

text = "Price: ¥500 — item #①"

# surrogateescape: lossless roundtrip for undecodable bytes
# (used by os.fsdecode/fsencode for filenames)
raw = text.encode("ascii", errors="surrogateescape")

# backslashreplace: Python escape sequences
text.encode("ascii", errors="backslashreplace")
# b'Price: \\xa5500 \\u2014 item #\\u2460'

# namereplace: Unicode character names
text.encode("ascii", errors="namereplace")
# b'Price: \\N{YEN SIGN}500 \\N{EM DASH} item #\\N{CIRCLED DIGIT ONE}'

The surrogateescape Strategy

Unix filenames are bytes, not text. A filename might contain bytes that aren’t valid UTF-8. Python handles this using surrogate escaping:

import os

# If a filename contains byte 0xFF (invalid UTF-8),
# Python represents it as U+DCFF (a surrogate character)
# This allows lossless roundtrip through str:
filename_str = os.fsdecode(b"file\xffname")  # Uses surrogateescape
os.fsencode(filename_str)  # Back to b"file\xffname" exactly

Unicode Normalization Deep Dive

Unicode defines four normalization forms:

Form	Name	Behavior
NFC	Composed	Combines characters where possible
NFD	Decomposed	Splits characters into base + combining marks
NFKC	Compatibility Composed	NFC + replaces compatibility characters
NFKD	Compatibility Decomposed	NFD + replaces compatibility characters

import unicodedata

# NFC vs NFD
text = "e\u0301"  # e + combining acute accent
unicodedata.normalize("NFC", text)   # "é" (single char U+00E9)
unicodedata.normalize("NFD", text)   # "e\u0301" (stays decomposed)

# NFKC: compatibility normalization
unicodedata.normalize("NFKC", "①")   # "1"
unicodedata.normalize("NFKC", "ﬁ")   # "fi" (ligature expanded)
unicodedata.normalize("NFKC", "½")   # "1⁄2"

When to Use Which Form

NFC for storage and comparison (W3C recommendation)
NFD when you need to inspect combining marks
NFKC for search and username validation (RFC 8264/PRECIS)
NFKD for full-text search indexing

Unicode Security: Confusable Characters

Attackers use characters that look identical to ASCII but are different code points:

# These look the same but aren't
latin_a = "a"           # U+0061
cyrillic_a = "а"        # U+0430

latin_a == cyrillic_a   # False!

# Confusable detection (use the 'confusables' data from Unicode)
# In practice, use the `confusables` or `icu` library

For user-facing identifiers (usernames, domain names), always normalize with NFKC and check against Unicode confusable tables.

Production Patterns

Bulletproof File Reading

import chardet  # pip install chardet

def read_file_safely(path: str) -> str:
    """Read a text file, detecting encoding if necessary."""
    raw = open(path, "rb").read()
    
    # Try UTF-8 first (most common)
    try:
        return raw.decode("utf-8")
    except UnicodeDecodeError:
        pass
    
    # Try UTF-8 with BOM
    if raw.startswith(b"\xef\xbb\xbf"):
        return raw[3:].decode("utf-8")
    
    # Try UTF-16 (has BOM)
    if raw[:2] in (b"\xff\xfe", b"\xfe\xff"):
        return raw.decode("utf-16")
    
    # Fall back to detection
    detected = chardet.detect(raw)
    encoding = detected["encoding"] or "utf-8"
    return raw.decode(encoding, errors="replace")

Database Unicode Gotchas

MySQL’s utf8 charset only supports 3-byte UTF-8 (no emojis). Use utf8mb4 for full Unicode:

-- MySQL: this breaks on emojis
CREATE TABLE bad (text VARCHAR(255) CHARACTER SET utf8);

-- MySQL: this works for all Unicode
CREATE TABLE good (text VARCHAR(255) CHARACTER SET utf8mb4);

PostgreSQL uses UTF-8 natively and handles all Unicode correctly.

JSON and Unicode

Python’s json module defaults to escaping non-ASCII characters:

import json

data = {"city": "東京", "emoji": "🗼"}

# Default: ASCII-safe output
json.dumps(data)
# '{"city": "\\u6771\\u4eac", "emoji": "\\ud83d\\uddfc"}'

# Allow non-ASCII (smaller, more readable)
json.dumps(data, ensure_ascii=False)
# '{"city": "東京", "emoji": "🗼"}'

String Comparison and Sorting

For locale-aware sorting (e.g., German ä sorts near a, not after z):

import locale
locale.setlocale(locale.LC_COLLATE, "de_DE.UTF-8")

words = ["Zug", "Ärger", "Apfel"]
sorted(words, key=locale.strxfrm)
# ['Apfel', 'Ärger', 'Zug']  — Ä sorts near A

For more robust Unicode-aware sorting, use the PyICU library which implements the Unicode Collation Algorithm.

Performance Tips

Avoid Repeated Encoding/Decoding

# BAD: encode inside a loop
for line in lines:
    network.send(line.encode("utf-8"))

# BETTER: encode once
encoded = "\n".join(lines).encode("utf-8")
network.send(encoded)

Use memoryview for Large Binary Data

When processing large byte buffers, avoid creating copies:

data = b"..." * 1_000_000
view = memoryview(data)

# Slice without copying
chunk = view[1000:2000]
chunk.tobytes().decode("utf-8")

String Interning

CPython automatically interns small strings and identifiers. You can manually intern strings used as dictionary keys:

import sys
key = sys.intern("frequently_used_key")
# Now `key is` comparison works and dict lookups are faster

One Thing to Remember

CPython’s flexible string representation, combined with the pluggable codec system and four normalization forms, gives you complete control over Unicode — but production robustness requires always normalizing input (NFC), explicitly specifying encodings, and using utf8mb4 in MySQL.

pythonunicodeencodingtext-processingutf-8advanced