Python String Interning Internals — Deep Dive

Explore CPython's string interning implementation — the intern dictionary, peephole optimizer interactions, memory analysis, and production patterns for high-throughput string processing.

String interning is one of CPython’s most impactful optimizations, yet its mechanics are poorly understood. This deep dive traces the interning process through CPython’s C source, examines when and why the compiler interns strings, measures real memory savings, and demonstrates production patterns for high-throughput applications.

The Intern Table

CPython maintains a global dictionary (interned in Objects/unicodeobject.c) that maps string values to their canonical objects:

import sys

# sys.intern() adds to and retrieves from this global dict
a = sys.intern("hello world")
b = sys.intern("hello world")

print(a is b)          # True — same object
print(id(a) == id(b))  # True

# Without interning, runtime-computed strings are separate objects
c = "hello" + " " + "world"
d = "hello" + " " + "world"
print(c is d)  # False (usually) — different objects, same value
print(c == d)  # True — same value

What the C Code Does

When sys.intern(s) is called:

Check if s is already in the interned dict
If yes, return the existing object and decrement the new object’s refcount
If no, add s to the dict and mark it as interned (setting state to SSTATE_INTERNED_MORTAL)
Return s

The interned dict uses the string’s hash as the key, making lookup O(1) amortized.

Compiler-Level Interning

CPython’s compiler interns strings before your code even runs:

Constant Folding and the Peephole Optimizer

import dis

def example():
    a = "hello"
    b = "hello"
    return a is b

dis.dis(example)
# Both 'a' and 'b' load the SAME constant from co_consts
# The compiler deduplicates identical constant values

The compiler’s constant table (co_consts) stores each unique value once. Two identical string literals in the same code object share the same constant — this is deduplication, not interning per se, but the effect is similar.

What the Compiler Interns

# ✅ Identifier-like strings: automatically interned
a = "hello"          # Interned (looks like an identifier)
b = "MAX_SIZE"       # Interned

# ❌ Non-identifier strings: NOT automatically interned
c = "hello world"    # Not interned (contains space)
d = "hello!"         # Not interned (contains punctuation)

# ✅ But short strings may be cached anyway
e = "a"              # Single-char strings (0-127) are always cached
f = chr(97)
print(e is f)        # True — from the single-char cache

# ⚠️ Compile-time constant folding can create interned results
g = "hello" + "_" + "world"  # Compiler may fold to "hello_world" → interned
h = "hello_world"
print(g is h)        # Often True (compiler optimization)

Memory Analysis

import sys

# Measure memory impact of interning
def memory_without_interning(n: int) -> int:
    strings = ["status_code_" + str(i % 10) for i in range(n)]
    return sum(sys.getsizeof(s) for s in strings)

def memory_with_interning(n: int) -> int:
    strings = [sys.intern("status_code_" + str(i % 10)) for i in range(n)]
    # Only 10 unique objects exist
    unique = set(id(s) for s in strings)
    return sum(sys.getsizeof(s) for s in set(strings))

n = 1_000_000
print(f"Without interning: {memory_without_interning(n):,} bytes for objects")
print(f"With interning: {memory_with_interning(n):,} bytes for unique objects")
# Without: ~60MB of string objects
# With: ~660 bytes for 10 unique strings (plus list of references)

Measuring with tracemalloc

import tracemalloc
import sys

tracemalloc.start()

# Scenario 1: No interning
labels = ["category_" + str(i % 100) for i in range(500_000)]
snapshot1 = tracemalloc.take_snapshot()
stats1 = snapshot1.statistics('lineno')
total1 = sum(s.size for s in stats1)

# Clear and measure with interning
labels.clear()
tracemalloc.clear_traces()

labels = [sys.intern("category_" + str(i % 100)) for i in range(500_000)]
snapshot2 = tracemalloc.take_snapshot()
stats2 = snapshot2.statistics('lineno')
total2 = sum(s.size for s in stats2)

print(f"Without interning: {total1:,} bytes")
print(f"With interning: {total2:,} bytes")

The Single-Character Cache

CPython maintains a cache of single-character strings for ASCII codepoints 0-127:

# All single ASCII characters are pre-cached
a = chr(65)  # 'A'
b = chr(65)  # 'A'
print(a is b)  # True — from the cache

# This is why single-char `is` comparison appears to work
# But DON'T rely on it — it's an implementation detail
c = chr(200)  # 'È' — outside ASCII cache
d = chr(200)
print(c is d)  # May be False in some contexts

Interning in Dictionary Operations

CPython automatically interns strings used as dictionary keys in certain contexts:

# Dict literal keys are typically interned at compile time
d = {"name": "Alice", "age": 30}

# When doing lookups, CPython can use identity comparison first
# If keys are interned, dict["name"] does:
#   1. Compare id (pointer) — O(1)
#   2. If different id, compare hash — O(1)
#   3. If same hash, compare characters — O(n)
#
# With interning, step 1 succeeds for most lookups

The Impact on Dict Performance

import timeit
import sys

# Interned keys
key_interned = sys.intern("frequently_used_key")
d = {key_interned: 42}

# Non-interned key (new string object each time)
def lookup_non_interned():
    k = "frequently" + "_used_key"  # Runtime concatenation
    return d[k]

def lookup_interned():
    return d[key_interned]

t_non = timeit.timeit(lookup_non_interned, number=1_000_000)
t_int = timeit.timeit(lookup_interned, number=1_000_000)
print(f"Non-interned: {t_non:.3f}s, Interned: {t_int:.3f}s")
# Interned lookups are typically 10-30% faster

Production Patterns

Log Level Interning

import sys

class LogParser:
    """Parser that interns repeated field values for memory efficiency."""

    _INTERN_FIELDS = {'level', 'source', 'host'}

    def parse_line(self, line: str) -> dict:
        parts = line.split('\t')
        record = {
            'timestamp': parts[0],
            'level': parts[1],
            'source': parts[2],
            'host': parts[3],
            'message': parts[4],
        }
        # Intern high-cardinality-but-repetitive fields
        for field in self._INTERN_FIELDS:
            record[field] = sys.intern(record[field])
        return record

    def parse_file(self, path: str) -> list[dict]:
        records = []
        with open(path) as f:
            for line in f:
                records.append(self.parse_line(line.strip()))
        return records

# For 10M log lines with 5 log levels and 100 hosts:
# Without interning: ~800MB for level + host strings
# With interning: ~105 unique strings cached, references only

DataFrame Column Optimization

import pandas as pd
import sys

def intern_categorical_columns(df: pd.DataFrame, columns: list[str]) -> pd.DataFrame:
    """Intern string values in low-cardinality columns."""
    for col in columns:
        if df[col].dtype == object:
            # For truly low-cardinality, use pd.Categorical instead
            if df[col].nunique() < 1000:
                df[col] = pd.Categorical(df[col])
            else:
                # For medium cardinality, interning still helps
                df[col] = df[col].map(sys.intern)
    return df

# pandas Categorical is usually better than interning for DataFrames
# But interning shines when strings flow between dicts/sets/custom objects

Symbol Tables in Parsers

import sys

class Tokenizer:
    """Tokenizer that interns all identifiers for fast comparison."""

    def __init__(self):
        self._keywords = {sys.intern(kw) for kw in [
            'if', 'else', 'while', 'for', 'return', 'def', 'class',
        ]}

    def tokenize(self, source: str) -> list[tuple[str, str]]:
        tokens = []
        for word in source.split():
            interned = sys.intern(word)
            if interned in self._keywords:
                tokens.append(('KEYWORD', interned))
            else:
                tokens.append(('IDENT', interned))
        return tokens

    def is_keyword(self, token: str) -> bool:
        # With interning, this `in` check uses identity first
        return sys.intern(token) in self._keywords

Interning Lifetime and Cleanup

Interned strings have special reference counting:

# SSTATE_INTERNED_MORTAL: removed when refcount drops to 0
# SSTATE_INTERNED_IMMORTAL: lives forever (used for built-in names)

# sys.intern() creates MORTAL entries
s = sys.intern("temporary_value")
# When all references to s are deleted, it MAY be removed from intern table
# (implementation-dependent — don't rely on this for memory management)

# Built-in names like 'None', 'True', '__init__' are IMMORTAL
# They persist for the interpreter's lifetime

Gotchas

Don’t use is for string comparison. Even with interning, is is not guaranteed to work for all equal strings. Always use ==.

Interning non-string-like strings is fine. sys.intern("hello world") works — it’s just not done automatically.

Interned strings increase interpreter memory. They persist in the global dict even when unused (mortal strings can theoretically be cleaned up, but the interpreter rarely does this proactively). Don’t intern unbounded user input.

PyPy handles interning differently. PyPy’s JIT compiler may optimize string operations differently. Always benchmark on your target runtime.

One Thing to Remember

CPython’s string interning converts duplicate string objects into shared references via a global dictionary — use sys.intern() for high-repetition string fields in data pipelines, but never rely on is for comparison since interning behavior is an implementation detail.

pythonstringsinterningmemorycpythoninternalsadvanced