Python Unicode and Encoding — Deep Dive
Unicode handling in Python 3 is clean at the surface, but the internals reveal careful engineering decisions about memory, performance, and compatibility. This deep dive covers CPython’s string representation, the codec system, normalization, and production-grade patterns.
CPython’s Internal String Representation
Since PEP 393 (Python 3.3), CPython uses a flexible string representation that adapts storage per string based on the highest code point it contains:
| Kind | Bytes per char | Used when |
|---|---|---|
| Latin-1 | 1 | All chars ≤ U+00FF |
| UCS-2 | 2 | Any char in U+0100 – U+FFFF |
| UCS-4 | 4 | Any char > U+FFFF (emojis, rare scripts) |
import sys
sys.getsizeof("hello") # ~54 bytes (Latin-1, 1 byte/char + overhead)
sys.getsizeof("héllo") # ~54 bytes (still Latin-1, é is U+00E9)
sys.getsizeof("中ello") # ~74 bytes (UCS-2, 2 bytes/char)
sys.getsizeof("🐍ello") # ~92 bytes (UCS-4, 4 bytes/char)
Key implication: A single emoji in a long string forces the entire string to UCS-4 representation, quadrupling memory compared to ASCII-only text. When processing large text corpora, stripping or separating emoji-containing strings can significantly reduce memory usage.
The Compact ASCII Optimization
Strings containing only ASCII characters get a special compact representation where the UTF-8 view is the same as the internal data. This avoids double storage for the most common case.
The Codec System
Python’s encoding/decoding infrastructure is built on a pluggable codec registry.
How Codecs Are Registered
import codecs
# Look up a codec by name
info = codecs.lookup("utf-8")
print(info.name) # 'utf-8'
print(type(info.encode)) # <class 'builtin_function_or_method'>
Python normalizes codec names: "utf-8", "UTF8", "utf_8" all resolve to the same codec.
Custom Codecs
You can register your own codecs:
import codecs
def rot13_encode(text):
return (codecs.encode(text, "rot_13"), len(text))
def rot13_decode(data):
return (codecs.decode(data, "rot_13"), len(data))
class Rot13Codec(codecs.Codec):
encode = staticmethod(rot13_encode)
decode = staticmethod(rot13_decode)
def find_rot13(name):
if name == "rot13-custom":
return codecs.CodecInfo(
encode=Rot13Codec.encode,
decode=Rot13Codec.decode,
name="rot13-custom",
)
return None
codecs.register(find_rot13)
"hello".encode("rot13-custom") # Works!
Incremental Codecs
For streaming, use incremental encoders/decoders that handle partial input:
import codecs
decoder = codecs.getincrementaldecoder("utf-8")()
# Feed partial UTF-8 bytes (é = \xc3\xa9)
result = decoder.decode(b"caf\xc3") # "caf" — waits for more bytes
result += decoder.decode(b"\xa9!") # "é!" — completes the character
# Total: "café!"
This is essential for network protocols and streaming file readers where you receive data in arbitrary chunks.
Error Handling Strategies
Beyond the standard error handlers, Python provides several specialized options:
text = "Price: ¥500 — item #①"
# surrogateescape: lossless roundtrip for undecodable bytes
# (used by os.fsdecode/fsencode for filenames)
raw = text.encode("ascii", errors="surrogateescape")
# backslashreplace: Python escape sequences
text.encode("ascii", errors="backslashreplace")
# b'Price: \\xa5500 \\u2014 item #\\u2460'
# namereplace: Unicode character names
text.encode("ascii", errors="namereplace")
# b'Price: \\N{YEN SIGN}500 \\N{EM DASH} item #\\N{CIRCLED DIGIT ONE}'
The surrogateescape Strategy
Unix filenames are bytes, not text. A filename might contain bytes that aren’t valid UTF-8. Python handles this using surrogate escaping:
import os
# If a filename contains byte 0xFF (invalid UTF-8),
# Python represents it as U+DCFF (a surrogate character)
# This allows lossless roundtrip through str:
filename_str = os.fsdecode(b"file\xffname") # Uses surrogateescape
os.fsencode(filename_str) # Back to b"file\xffname" exactly
Unicode Normalization Deep Dive
Unicode defines four normalization forms:
| Form | Name | Behavior |
|---|---|---|
| NFC | Composed | Combines characters where possible |
| NFD | Decomposed | Splits characters into base + combining marks |
| NFKC | Compatibility Composed | NFC + replaces compatibility characters |
| NFKD | Compatibility Decomposed | NFD + replaces compatibility characters |
import unicodedata
# NFC vs NFD
text = "e\u0301" # e + combining acute accent
unicodedata.normalize("NFC", text) # "é" (single char U+00E9)
unicodedata.normalize("NFD", text) # "e\u0301" (stays decomposed)
# NFKC: compatibility normalization
unicodedata.normalize("NFKC", "①") # "1"
unicodedata.normalize("NFKC", "fi") # "fi" (ligature expanded)
unicodedata.normalize("NFKC", "½") # "1⁄2"
When to Use Which Form
- NFC for storage and comparison (W3C recommendation)
- NFD when you need to inspect combining marks
- NFKC for search and username validation (RFC 8264/PRECIS)
- NFKD for full-text search indexing
Unicode Security: Confusable Characters
Attackers use characters that look identical to ASCII but are different code points:
# These look the same but aren't
latin_a = "a" # U+0061
cyrillic_a = "а" # U+0430
latin_a == cyrillic_a # False!
# Confusable detection (use the 'confusables' data from Unicode)
# In practice, use the `confusables` or `icu` library
For user-facing identifiers (usernames, domain names), always normalize with NFKC and check against Unicode confusable tables.
Production Patterns
Bulletproof File Reading
import chardet # pip install chardet
def read_file_safely(path: str) -> str:
"""Read a text file, detecting encoding if necessary."""
raw = open(path, "rb").read()
# Try UTF-8 first (most common)
try:
return raw.decode("utf-8")
except UnicodeDecodeError:
pass
# Try UTF-8 with BOM
if raw.startswith(b"\xef\xbb\xbf"):
return raw[3:].decode("utf-8")
# Try UTF-16 (has BOM)
if raw[:2] in (b"\xff\xfe", b"\xfe\xff"):
return raw.decode("utf-16")
# Fall back to detection
detected = chardet.detect(raw)
encoding = detected["encoding"] or "utf-8"
return raw.decode(encoding, errors="replace")
Database Unicode Gotchas
MySQL’s utf8 charset only supports 3-byte UTF-8 (no emojis). Use utf8mb4 for full Unicode:
-- MySQL: this breaks on emojis
CREATE TABLE bad (text VARCHAR(255) CHARACTER SET utf8);
-- MySQL: this works for all Unicode
CREATE TABLE good (text VARCHAR(255) CHARACTER SET utf8mb4);
PostgreSQL uses UTF-8 natively and handles all Unicode correctly.
JSON and Unicode
Python’s json module defaults to escaping non-ASCII characters:
import json
data = {"city": "東京", "emoji": "🗼"}
# Default: ASCII-safe output
json.dumps(data)
# '{"city": "\\u6771\\u4eac", "emoji": "\\ud83d\\uddfc"}'
# Allow non-ASCII (smaller, more readable)
json.dumps(data, ensure_ascii=False)
# '{"city": "東京", "emoji": "🗼"}'
String Comparison and Sorting
For locale-aware sorting (e.g., German ä sorts near a, not after z):
import locale
locale.setlocale(locale.LC_COLLATE, "de_DE.UTF-8")
words = ["Zug", "Ärger", "Apfel"]
sorted(words, key=locale.strxfrm)
# ['Apfel', 'Ärger', 'Zug'] — Ä sorts near A
For more robust Unicode-aware sorting, use the PyICU library which implements the Unicode Collation Algorithm.
Performance Tips
Avoid Repeated Encoding/Decoding
# BAD: encode inside a loop
for line in lines:
network.send(line.encode("utf-8"))
# BETTER: encode once
encoded = "\n".join(lines).encode("utf-8")
network.send(encoded)
Use memoryview for Large Binary Data
When processing large byte buffers, avoid creating copies:
data = b"..." * 1_000_000
view = memoryview(data)
# Slice without copying
chunk = view[1000:2000]
chunk.tobytes().decode("utf-8")
String Interning
CPython automatically interns small strings and identifiers. You can manually intern strings used as dictionary keys:
import sys
key = sys.intern("frequently_used_key")
# Now `key is` comparison works and dict lookups are faster
One Thing to Remember
CPython’s flexible string representation, combined with the pluggable codec system and four normalization forms, gives you complete control over Unicode — but production robustness requires always normalizing input (NFC), explicitly specifying encodings, and using utf8mb4 in MySQL.
See Also
- Python Fuzzy Matching Fuzzywuzzy Find out how Python's FuzzyWuzzy library matches messy, misspelled text — like a friend who understands you even when you mumble.
- Python Regex Lookahead Lookbehind Learn how Python regex can peek ahead or behind without grabbing text — like checking what's next in line without stepping forward.
- Python Regex Named Groups Learn how Python regex named groups let you label the pieces you capture — like putting name tags on your search results.
- Python Regex Patterns Discover how Python regex patterns work like a secret code for finding hidden text treasures in any document.
- Python Regular Expressions Learn how Python can find tricky text patterns fast, like spotting every phone number hidden in a messy page.