Python Unicode and Encoding — Core Concepts

Understand how Python 3 represents text internally as Unicode and how encoding converts between strings and bytes.

Text handling is one of the areas where Python 3 made a clean break from Python 2. In Python 3, all strings are Unicode by default, and the distinction between text and binary data is enforced at the type level.

The Two Types: str and bytes

Python 3 has a firm boundary:

Type	Contains	Created with
`str`	Unicode text (characters)	`"hello"`, `'café'`
`bytes`	Raw byte sequences	`b"hello"`, `b'\xc3\xa9'`

You cannot mix them. "hello" + b"world" raises a TypeError. This strictness prevents an entire category of encoding bugs.

What Unicode Actually Is

Unicode is a standard that assigns a unique number (called a code point) to every character in every writing system. As of Unicode 15.1, there are over 149,000 characters.

Code points are written as U+XXXX:

U+0041 → A
U+00E9 → é
U+4E2D → 中
U+1F40D → 🐍

Python lets you use code points directly:

"\u0041"       # 'A'
"\u00e9"       # 'é'
"\U0001F40D"   # '🐍' (note capital U for codes above FFFF)

What Encoding Does

Encoding is the process of converting Unicode code points into bytes. Decoding is the reverse.

text = "café"
encoded = text.encode("utf-8")   # b'caf\xc3\xa9'
decoded = encoded.decode("utf-8") # 'café'

UTF-8: The Default Standard

UTF-8 is a variable-width encoding:

Code point range	Bytes needed	Examples
U+0000 – U+007F	1 byte	ASCII characters
U+0080 – U+07FF	2 bytes	Latin accents, Greek, Hebrew
U+0800 – U+FFFF	3 bytes	Chinese, Japanese, Korean
U+10000 – U+10FFFF	4 bytes	Emojis, rare scripts

UTF-8 is backward-compatible with ASCII — any valid ASCII file is also valid UTF-8.

Other Common Encodings

UTF-16: Used internally by Windows and Java. 2 or 4 bytes per character.
Latin-1 (ISO-8859-1): 1 byte per character, covers Western European languages. Cannot represent Chinese, Arabic, etc.
ASCII: 1 byte, only 128 characters. English letters, digits, basic punctuation.

Encoding Errors

When encoding or decoding fails, Python raises errors by default. You can control this with the errors parameter:

# Strict (default) — raises error
"café".encode("ascii")  # UnicodeEncodeError

# Replace unknown chars
"café".encode("ascii", errors="replace")  # b'caf?'

# Ignore unknown chars
"café".encode("ascii", errors="ignore")  # b'caf'

# XML escape
"café".encode("ascii", errors="xmlcharrefreplace")  # b'caf&#233;'

Files and Encoding

When you open a text file, Python decodes bytes into str. The default encoding depends on your system, but you should always specify it:

# Explicit encoding — always recommended
with open("data.txt", encoding="utf-8") as f:
    text = f.read()

# Binary mode — no encoding/decoding
with open("image.png", "rb") as f:
    raw = f.read()  # Returns bytes

Since Python 3.15 (upcoming), UTF-8 will be the default encoding on all platforms. Until then, Windows may default to a locale-specific encoding like cp1252.

String Length vs. Byte Length

A string’s len() counts characters (code points), not bytes:

text = "café"
len(text)                    # 4 characters
len(text.encode("utf-8"))    # 5 bytes (é is 2 bytes in UTF-8)
len(text.encode("utf-16"))   # 10 bytes (2 bytes per char + BOM)

This distinction matters for databases, network protocols, and file formats that specify size limits in bytes.

Normalization

Some characters can be represented in multiple ways:

é can be a single code point (U+00E9, “precomposed”)
Or the letter e (U+0065) followed by a combining accent (U+0301, “decomposed”)

Both look identical but compare as different strings:

import unicodedata
s1 = "\u00e9"          # precomposed é
s2 = "e\u0301"         # e + combining accent
s1 == s2               # False!

unicodedata.normalize("NFC", s2) == s1  # True

Always normalize before comparing or storing text, especially user input.

Common Misconception

“UTF-8 and Unicode are the same thing.” Unicode is the character set (the catalog of code points). UTF-8 is one of several encodings (the way those code points get stored as bytes). UTF-16 and UTF-32 are other encodings of the same Unicode characters.

One Thing to Remember

In Python 3, str is always Unicode and bytes is always raw data — encoding bridges the two, and explicitly choosing UTF-8 everywhere prevents the vast majority of text bugs.

pythonunicodeencodingtext-processingutf-8