Python String Interning — Deep Dive
CPython’s Intern Table
CPython maintains a global dictionary called interned (located in Objects/unicodeobject.c) that maps string values to their canonical objects. When sys.intern() is called:
- Python checks if an equal string already exists in the
interneddictionary. - If yes, the existing object is returned and the new string can be garbage collected.
- If no, the new string is added to
internedand returned.
The interned dictionary uses the string itself as both the key and the value. This means interned strings hold at least two references from the intern table alone, preventing garbage collection until sys.intern is explicitly managed or the interpreter shuts down.
import sys
# The intern table grows monotonically during runtime
a = sys.intern("session_" + str(42)) # Adds to intern table
# This entry persists until interpreter shutdown
Compiler-Level Interning: The Peephole Optimizer
CPython’s compiler automatically interns certain strings before your code even runs:
Constant Folding
The peephole optimizer (and since Python 3.8, the AST optimizer) folds constant expressions:
# These become identical at compile time:
x = "hello" + "_" + "world" # Folded to "hello_world"
y = "hello_world"
print(x is y) # True
However, the optimizer has limits. It won’t fold strings longer than 4096 characters (as of CPython 3.12) to avoid bloating .pyc files:
long_a = "x" * 5000 # NOT folded — computed at runtime
long_b = "x" * 5000
print(long_a is long_b) # False
Name Interning
All identifiers in Python bytecode are interned: variable names, function names, attribute names, module names. The co_names tuple of every code object contains interned strings. This is why getattr(obj, "method_name") is fast — the string "method_name" written as a literal is already interned.
Memory Analysis
To understand the memory impact, consider a log parser processing 10 million lines where each line contains one of 5 log levels:
import sys
import tracemalloc
tracemalloc.start()
# Without interning
labels_raw = []
for level in ["ERROR", "WARNING", "INFO", "DEBUG", "TRACE"] * 2_000_000:
labels_raw.append(level)
snapshot1 = tracemalloc.take_snapshot()
# With interning
labels_interned = []
for level in ["ERROR", "WARNING", "INFO", "DEBUG", "TRACE"] * 2_000_000:
labels_interned.append(sys.intern(level))
snapshot2 = tracemalloc.take_snapshot()
In the raw case, Python may create multiple string objects for the same value when they come from runtime operations (like reading from files). With interning, all 2 million "ERROR" entries point to the same object.
The memory difference depends on how the strings are created. For literal repetitions (as above), CPython may already optimize. For strings read from I/O (files, network), interning typically saves 40–60% of string memory in applications with repetitive vocabularies.
Dictionary Key Optimization
CPython’s dictionary implementation has a fast path for interned string keys. During key lookup:
- Compute the hash of the lookup key.
- Find the hash table slot.
- If the slot’s key
isthe lookup key (pointer comparison), return immediately. - Only if identity fails, fall back to
__eq__comparison.
Step 3 is why interned keys are faster — the identity check succeeds on the first comparison. Without interning, step 3 fails and Python must do a full __eq__ string comparison.
This optimization is particularly impactful for **kwargs processing, getattr calls, and JSON deserialization where the same keys appear in every object:
import sys
import json
def intern_keys(obj):
"""Recursively intern all dictionary keys in a parsed JSON structure."""
if isinstance(obj, dict):
return {sys.intern(k): intern_keys(v) for k, v in obj.items()}
if isinstance(obj, list):
return [intern_keys(item) for item in obj]
return obj
# After parsing millions of JSON records with the same schema:
data = json.loads(raw_json)
data = intern_keys(data)
This pattern is used in production at companies processing large JSON datasets — Sentry’s event ingestion pipeline, for example, benefits from interning repetitive field names across millions of error events.
Implementation Across Python Runtimes
String interning behavior varies significantly:
| Runtime | Automatic Interning | Manual API |
|---|---|---|
| CPython 3.12 | Identifier-like literals, dict keys | sys.intern() |
| PyPy | More aggressive (JIT-guided) | sys.intern() |
| GraalPython | Similar to CPython | sys.intern() |
| Jython (legacy) | Delegates to JVM string pool | sys.intern() → String.intern() |
PyPy’s JIT compiler can identify hot string comparisons and apply interning dynamically, even for strings that CPython wouldn’t intern automatically.
Interning and the GIL
The intern table is a global shared resource. In CPython’s current GIL-protected model, concurrent access to sys.intern() is safe. However, with the experimental free-threaded Python (PEP 703, --disable-gil), the intern table requires its own lock.
In Python 3.13’s free-threaded build, sys.intern() uses fine-grained locking around the intern dictionary. This means:
- Calling
sys.intern()in a tight loop from multiple threads incurs lock contention - Pre-interning strings during single-threaded initialization is preferred
- Read-only access to already-interned strings remains fast
Production Patterns
Pattern 1: Intern During Deserialization
import sys
import csv
KNOWN_FIELDS = {"timestamp", "level", "message", "source", "trace_id"}
def read_logs(filepath):
with open(filepath) as f:
reader = csv.DictReader(f)
for row in reader:
interned_row = {
sys.intern(k): sys.intern(v) if k == "level" else v
for k, v in row.items()
}
yield interned_row
Pattern 2: Interned Enum-Like Constants
import sys
class LogLevel:
ERROR = sys.intern("ERROR")
WARNING = sys.intern("WARNING")
INFO = sys.intern("INFO")
DEBUG = sys.intern("DEBUG")
@classmethod
def normalize(cls, raw: str) -> str:
return sys.intern(raw.upper().strip())
Pattern 3: Measuring Interning Impact
import sys
import time
strings = [f"key_{i % 100}" for i in range(1_000_000)]
interned = [sys.intern(s) for s in strings]
lookup_key = sys.intern("key_42")
# Benchmark identity vs equality
start = time.perf_counter()
for s in interned:
_ = s is lookup_key
identity_time = time.perf_counter() - start
start = time.perf_counter()
for s in strings:
_ = s == lookup_key
equality_time = time.perf_counter() - start
print(f"Identity: {identity_time:.3f}s, Equality: {equality_time:.3f}s")
# Typical result: Identity is 2-4x faster
When Not to Intern
- Unique strings — Interning strings that appear only once wastes memory (the intern table entry has overhead).
- User-generated content — Interning arbitrary user input is a potential memory DoS vector.
- Short-lived strings — If strings are created and discarded quickly, the overhead of interning exceeds the benefit.
The one thing to remember: String interning converts O(n) equality comparisons into O(1) identity checks by ensuring duplicate strings share one memory object — apply it deliberately at deserialization boundaries for the biggest wins in memory and speed.
See Also
- Python Algorithmic Complexity Understand Algorithmic Complexity through a practical analogy so your Python decisions become faster and clearer.
- Python Async Performance Tuning Making your async Python faster is like organizing a busy restaurant kitchen — it's all about flow.
- Python Benchmark Methodology Why timing Python code once means nothing, and how fair testing works like a science experiment.
- Python C Extension Performance How Python borrows C's speed for the hard parts — like hiring a specialist for the toughest job on the worksite.
- Python Caching Strategies Understand Python caching strategies with a shortcut-road analogy so your app gets faster without taking wrong turns.