Python Unicode and Encoding — Core Concepts
Text handling is one of the areas where Python 3 made a clean break from Python 2. In Python 3, all strings are Unicode by default, and the distinction between text and binary data is enforced at the type level.
The Two Types: str and bytes
Python 3 has a firm boundary:
| Type | Contains | Created with |
|---|---|---|
str | Unicode text (characters) | "hello", 'café' |
bytes | Raw byte sequences | b"hello", b'\xc3\xa9' |
You cannot mix them. "hello" + b"world" raises a TypeError. This strictness prevents an entire category of encoding bugs.
What Unicode Actually Is
Unicode is a standard that assigns a unique number (called a code point) to every character in every writing system. As of Unicode 15.1, there are over 149,000 characters.
Code points are written as U+XXXX:
U+0041→ AU+00E9→ éU+4E2D→ 中U+1F40D→ 🐍
Python lets you use code points directly:
"\u0041" # 'A'
"\u00e9" # 'é'
"\U0001F40D" # '🐍' (note capital U for codes above FFFF)
What Encoding Does
Encoding is the process of converting Unicode code points into bytes. Decoding is the reverse.
text = "café"
encoded = text.encode("utf-8") # b'caf\xc3\xa9'
decoded = encoded.decode("utf-8") # 'café'
UTF-8: The Default Standard
UTF-8 is a variable-width encoding:
| Code point range | Bytes needed | Examples |
|---|---|---|
| U+0000 – U+007F | 1 byte | ASCII characters |
| U+0080 – U+07FF | 2 bytes | Latin accents, Greek, Hebrew |
| U+0800 – U+FFFF | 3 bytes | Chinese, Japanese, Korean |
| U+10000 – U+10FFFF | 4 bytes | Emojis, rare scripts |
UTF-8 is backward-compatible with ASCII — any valid ASCII file is also valid UTF-8.
Other Common Encodings
- UTF-16: Used internally by Windows and Java. 2 or 4 bytes per character.
- Latin-1 (ISO-8859-1): 1 byte per character, covers Western European languages. Cannot represent Chinese, Arabic, etc.
- ASCII: 1 byte, only 128 characters. English letters, digits, basic punctuation.
Encoding Errors
When encoding or decoding fails, Python raises errors by default. You can control this with the errors parameter:
# Strict (default) — raises error
"café".encode("ascii") # UnicodeEncodeError
# Replace unknown chars
"café".encode("ascii", errors="replace") # b'caf?'
# Ignore unknown chars
"café".encode("ascii", errors="ignore") # b'caf'
# XML escape
"café".encode("ascii", errors="xmlcharrefreplace") # b'café'
Files and Encoding
When you open a text file, Python decodes bytes into str. The default encoding depends on your system, but you should always specify it:
# Explicit encoding — always recommended
with open("data.txt", encoding="utf-8") as f:
text = f.read()
# Binary mode — no encoding/decoding
with open("image.png", "rb") as f:
raw = f.read() # Returns bytes
Since Python 3.15 (upcoming), UTF-8 will be the default encoding on all platforms. Until then, Windows may default to a locale-specific encoding like cp1252.
String Length vs. Byte Length
A string’s len() counts characters (code points), not bytes:
text = "café"
len(text) # 4 characters
len(text.encode("utf-8")) # 5 bytes (é is 2 bytes in UTF-8)
len(text.encode("utf-16")) # 10 bytes (2 bytes per char + BOM)
This distinction matters for databases, network protocols, and file formats that specify size limits in bytes.
Normalization
Some characters can be represented in multiple ways:
écan be a single code point (U+00E9, “precomposed”)- Or the letter
e(U+0065) followed by a combining accent (U+0301, “decomposed”)
Both look identical but compare as different strings:
import unicodedata
s1 = "\u00e9" # precomposed é
s2 = "e\u0301" # e + combining accent
s1 == s2 # False!
unicodedata.normalize("NFC", s2) == s1 # True
Always normalize before comparing or storing text, especially user input.
Common Misconception
“UTF-8 and Unicode are the same thing.” Unicode is the character set (the catalog of code points). UTF-8 is one of several encodings (the way those code points get stored as bytes). UTF-16 and UTF-32 are other encodings of the same Unicode characters.
One Thing to Remember
In Python 3, str is always Unicode and bytes is always raw data — encoding bridges the two, and explicitly choosing UTF-8 everywhere prevents the vast majority of text bugs.
See Also
- Python Fuzzy Matching Fuzzywuzzy Find out how Python's FuzzyWuzzy library matches messy, misspelled text — like a friend who understands you even when you mumble.
- Python Regex Lookahead Lookbehind Learn how Python regex can peek ahead or behind without grabbing text — like checking what's next in line without stepping forward.
- Python Regex Named Groups Learn how Python regex named groups let you label the pieces you capture — like putting name tags on your search results.
- Python Regex Patterns Discover how Python regex patterns work like a secret code for finding hidden text treasures in any document.
- Python Regular Expressions Learn how Python can find tricky text patterns fast, like spotting every phone number hidden in a messy page.