Python String Interning — Deep Dive

CPython's interning table internals, peephole optimization interactions, memory profiling of interned strings, and production patterns for high-throughput parsing.

CPython’s Intern Table

CPython maintains a global dictionary called interned (located in Objects/unicodeobject.c) that maps string values to their canonical objects. When sys.intern() is called:

Python checks if an equal string already exists in the interned dictionary.
If yes, the existing object is returned and the new string can be garbage collected.
If no, the new string is added to interned and returned.

The interned dictionary uses the string itself as both the key and the value. This means interned strings hold at least two references from the intern table alone, preventing garbage collection until sys.intern is explicitly managed or the interpreter shuts down.

import sys

# The intern table grows monotonically during runtime
a = sys.intern("session_" + str(42))  # Adds to intern table
# This entry persists until interpreter shutdown

Compiler-Level Interning: The Peephole Optimizer

CPython’s compiler automatically interns certain strings before your code even runs:

Constant Folding

The peephole optimizer (and since Python 3.8, the AST optimizer) folds constant expressions:

# These become identical at compile time:
x = "hello" + "_" + "world"  # Folded to "hello_world"
y = "hello_world"
print(x is y)  # True

However, the optimizer has limits. It won’t fold strings longer than 4096 characters (as of CPython 3.12) to avoid bloating .pyc files:

long_a = "x" * 5000  # NOT folded — computed at runtime
long_b = "x" * 5000
print(long_a is long_b)  # False

Name Interning

All identifiers in Python bytecode are interned: variable names, function names, attribute names, module names. The co_names tuple of every code object contains interned strings. This is why getattr(obj, "method_name") is fast — the string "method_name" written as a literal is already interned.

Memory Analysis

To understand the memory impact, consider a log parser processing 10 million lines where each line contains one of 5 log levels:

import sys
import tracemalloc

tracemalloc.start()

# Without interning
labels_raw = []
for level in ["ERROR", "WARNING", "INFO", "DEBUG", "TRACE"] * 2_000_000:
    labels_raw.append(level)

snapshot1 = tracemalloc.take_snapshot()

# With interning
labels_interned = []
for level in ["ERROR", "WARNING", "INFO", "DEBUG", "TRACE"] * 2_000_000:
    labels_interned.append(sys.intern(level))

snapshot2 = tracemalloc.take_snapshot()

In the raw case, Python may create multiple string objects for the same value when they come from runtime operations (like reading from files). With interning, all 2 million "ERROR" entries point to the same object.

The memory difference depends on how the strings are created. For literal repetitions (as above), CPython may already optimize. For strings read from I/O (files, network), interning typically saves 40–60% of string memory in applications with repetitive vocabularies.

Dictionary Key Optimization

CPython’s dictionary implementation has a fast path for interned string keys. During key lookup:

Compute the hash of the lookup key.
Find the hash table slot.
If the slot’s key is the lookup key (pointer comparison), return immediately.
Only if identity fails, fall back to __eq__ comparison.

Step 3 is why interned keys are faster — the identity check succeeds on the first comparison. Without interning, step 3 fails and Python must do a full __eq__ string comparison.

This optimization is particularly impactful for **kwargs processing, getattr calls, and JSON deserialization where the same keys appear in every object:

import sys
import json

def intern_keys(obj):
    """Recursively intern all dictionary keys in a parsed JSON structure."""
    if isinstance(obj, dict):
        return {sys.intern(k): intern_keys(v) for k, v in obj.items()}
    if isinstance(obj, list):
        return [intern_keys(item) for item in obj]
    return obj

# After parsing millions of JSON records with the same schema:
data = json.loads(raw_json)
data = intern_keys(data)

This pattern is used in production at companies processing large JSON datasets — Sentry’s event ingestion pipeline, for example, benefits from interning repetitive field names across millions of error events.

Implementation Across Python Runtimes

String interning behavior varies significantly:

Runtime	Automatic Interning	Manual API
CPython 3.12	Identifier-like literals, dict keys	`sys.intern()`
PyPy	More aggressive (JIT-guided)	`sys.intern()`
GraalPython	Similar to CPython	`sys.intern()`
Jython (legacy)	Delegates to JVM string pool	`sys.intern()` → `String.intern()`

PyPy’s JIT compiler can identify hot string comparisons and apply interning dynamically, even for strings that CPython wouldn’t intern automatically.

Interning and the GIL

The intern table is a global shared resource. In CPython’s current GIL-protected model, concurrent access to sys.intern() is safe. However, with the experimental free-threaded Python (PEP 703, --disable-gil), the intern table requires its own lock.

In Python 3.13’s free-threaded build, sys.intern() uses fine-grained locking around the intern dictionary. This means:

Calling sys.intern() in a tight loop from multiple threads incurs lock contention
Pre-interning strings during single-threaded initialization is preferred
Read-only access to already-interned strings remains fast

Production Patterns

Pattern 1: Intern During Deserialization

import sys
import csv

KNOWN_FIELDS = {"timestamp", "level", "message", "source", "trace_id"}

def read_logs(filepath):
    with open(filepath) as f:
        reader = csv.DictReader(f)
        for row in reader:
            interned_row = {
                sys.intern(k): sys.intern(v) if k == "level" else v
                for k, v in row.items()
            }
            yield interned_row

Pattern 2: Interned Enum-Like Constants

import sys

class LogLevel:
    ERROR = sys.intern("ERROR")
    WARNING = sys.intern("WARNING")
    INFO = sys.intern("INFO")
    DEBUG = sys.intern("DEBUG")
    
    @classmethod
    def normalize(cls, raw: str) -> str:
        return sys.intern(raw.upper().strip())

Pattern 3: Measuring Interning Impact

import sys
import time

strings = [f"key_{i % 100}" for i in range(1_000_000)]
interned = [sys.intern(s) for s in strings]

lookup_key = sys.intern("key_42")

# Benchmark identity vs equality
start = time.perf_counter()
for s in interned:
    _ = s is lookup_key
identity_time = time.perf_counter() - start

start = time.perf_counter()
for s in strings:
    _ = s == lookup_key
equality_time = time.perf_counter() - start

print(f"Identity: {identity_time:.3f}s, Equality: {equality_time:.3f}s")
# Typical result: Identity is 2-4x faster

When Not to Intern

Unique strings — Interning strings that appear only once wastes memory (the intern table entry has overhead).
User-generated content — Interning arbitrary user input is a potential memory DoS vector.
Short-lived strings — If strings are created and discarded quickly, the overhead of interning exceeds the benefit.

The one thing to remember: String interning converts O(n) equality comparisons into O(1) identity checks by ensuring duplicate strings share one memory object — apply it deliberately at deserialization boundaries for the biggest wins in memory and speed.

pythonperformancememoryoptimizationcpython-internals