Python Unicode and Encoding — Core Concepts

Text handling is one of the areas where Python 3 made a clean break from Python 2. In Python 3, all strings are Unicode by default, and the distinction between text and binary data is enforced at the type level.

The Two Types: str and bytes

Python 3 has a firm boundary:

TypeContainsCreated with
strUnicode text (characters)"hello", 'café'
bytesRaw byte sequencesb"hello", b'\xc3\xa9'

You cannot mix them. "hello" + b"world" raises a TypeError. This strictness prevents an entire category of encoding bugs.

What Unicode Actually Is

Unicode is a standard that assigns a unique number (called a code point) to every character in every writing system. As of Unicode 15.1, there are over 149,000 characters.

Code points are written as U+XXXX:

  • U+0041 → A
  • U+00E9 → é
  • U+4E2D → 中
  • U+1F40D → 🐍

Python lets you use code points directly:

"\u0041"       # 'A'
"\u00e9"       # 'é'
"\U0001F40D"   # '🐍' (note capital U for codes above FFFF)

What Encoding Does

Encoding is the process of converting Unicode code points into bytes. Decoding is the reverse.

text = "café"
encoded = text.encode("utf-8")   # b'caf\xc3\xa9'
decoded = encoded.decode("utf-8") # 'café'

UTF-8: The Default Standard

UTF-8 is a variable-width encoding:

Code point rangeBytes neededExamples
U+0000 – U+007F1 byteASCII characters
U+0080 – U+07FF2 bytesLatin accents, Greek, Hebrew
U+0800 – U+FFFF3 bytesChinese, Japanese, Korean
U+10000 – U+10FFFF4 bytesEmojis, rare scripts

UTF-8 is backward-compatible with ASCII — any valid ASCII file is also valid UTF-8.

Other Common Encodings

  • UTF-16: Used internally by Windows and Java. 2 or 4 bytes per character.
  • Latin-1 (ISO-8859-1): 1 byte per character, covers Western European languages. Cannot represent Chinese, Arabic, etc.
  • ASCII: 1 byte, only 128 characters. English letters, digits, basic punctuation.

Encoding Errors

When encoding or decoding fails, Python raises errors by default. You can control this with the errors parameter:

# Strict (default) — raises error
"café".encode("ascii")  # UnicodeEncodeError

# Replace unknown chars
"café".encode("ascii", errors="replace")  # b'caf?'

# Ignore unknown chars
"café".encode("ascii", errors="ignore")  # b'caf'

# XML escape
"café".encode("ascii", errors="xmlcharrefreplace")  # b'café'

Files and Encoding

When you open a text file, Python decodes bytes into str. The default encoding depends on your system, but you should always specify it:

# Explicit encoding — always recommended
with open("data.txt", encoding="utf-8") as f:
    text = f.read()

# Binary mode — no encoding/decoding
with open("image.png", "rb") as f:
    raw = f.read()  # Returns bytes

Since Python 3.15 (upcoming), UTF-8 will be the default encoding on all platforms. Until then, Windows may default to a locale-specific encoding like cp1252.

String Length vs. Byte Length

A string’s len() counts characters (code points), not bytes:

text = "café"
len(text)                    # 4 characters
len(text.encode("utf-8"))    # 5 bytes (é is 2 bytes in UTF-8)
len(text.encode("utf-16"))   # 10 bytes (2 bytes per char + BOM)

This distinction matters for databases, network protocols, and file formats that specify size limits in bytes.

Normalization

Some characters can be represented in multiple ways:

  • é can be a single code point (U+00E9, “precomposed”)
  • Or the letter e (U+0065) followed by a combining accent (U+0301, “decomposed”)

Both look identical but compare as different strings:

import unicodedata
s1 = "\u00e9"          # precomposed é
s2 = "e\u0301"         # e + combining accent
s1 == s2               # False!

unicodedata.normalize("NFC", s2) == s1  # True

Always normalize before comparing or storing text, especially user input.

Common Misconception

“UTF-8 and Unicode are the same thing.” Unicode is the character set (the catalog of code points). UTF-8 is one of several encodings (the way those code points get stored as bytes). UTF-16 and UTF-32 are other encodings of the same Unicode characters.

One Thing to Remember

In Python 3, str is always Unicode and bytes is always raw data — encoding bridges the two, and explicitly choosing UTF-8 everywhere prevents the vast majority of text bugs.

pythonunicodeencodingtext-processingutf-8

See Also

  • Python Fuzzy Matching Fuzzywuzzy Find out how Python's FuzzyWuzzy library matches messy, misspelled text — like a friend who understands you even when you mumble.
  • Python Regex Lookahead Lookbehind Learn how Python regex can peek ahead or behind without grabbing text — like checking what's next in line without stepping forward.
  • Python Regex Named Groups Learn how Python regex named groups let you label the pieces you capture — like putting name tags on your search results.
  • Python Regex Patterns Discover how Python regex patterns work like a secret code for finding hidden text treasures in any document.
  • Python Regular Expressions Learn how Python can find tricky text patterns fast, like spotting every phone number hidden in a messy page.