Compiler Pipeline — Deep Dive
Architecture Overview
CPython’s compilation pipeline lives primarily in these source files:
Parser/tokenize.c— TokenizerParser/pegen.c,Parser/parser.c— PEG parser (auto-generated fromGrammar/python.gram)Python/ast.c— AST construction and validationPython/ast_opt.c— AST-level optimizationsPython/symtable.c— Symbol table constructionPython/compile.c— Bytecode compilationPython/flowgraph.c— CFG construction and optimization (Python 3.12+)Python/ceval.c— The evaluation loop
Understanding these files gives you a map of the entire journey from source to execution.
Stage 1: Tokenization in Detail
The tokenizer is the first consumer of source bytes. It handles encoding detection (reading # -*- coding: utf-8 -*- or the BOM), converts bytes to characters, and produces a token stream.
Key behaviors:
- Indentation tracking: The tokenizer maintains an indentation stack. When a line’s indentation increases, it emits
INDENT. When it decreases, it emits one or moreDEDENTtokens to match. This is how Python’s significant whitespace is encoded at the grammar level. - Implicit line joining: Expressions inside brackets (
(),[],{}) can span multiple lines without backslash continuation. The tokenizer tracks bracket depth to handle this. - F-string tokenization: Since Python 3.12, f-strings are tokenized properly at the tokenizer level (previously they were partially handled by the parser).
import token, tokenize, io
code = """
def greet(name):
return f"Hello, {name}!"
"""
for tok in tokenize.generate_tokens(io.StringIO(code).readline):
print(f"{token.tok_name[tok.type]:10} {tok.string!r:20} "
f"line {tok.start[0]}")
Stage 2: The PEG Parser
Python 3.9 replaced the LL(1) parser with a PEG parser. The grammar is defined in Grammar/python.gram using a notation that supports:
- Ordered alternatives: Try each option in order, use the first that matches
- Lookahead:
&expr(positive) and!expr(negative) without consuming input - Left recursion: Handled automatically (LL(1) cannot do this)
- Soft keywords:
matchandcaseare keywords only in pattern-matching contexts
The parser generator (Tools/peg_generator/) reads this grammar and produces C code that builds AST nodes directly. There is no intermediate parse tree — tokens go straight to AST.
import ast
source = """
match command:
case "quit":
exit()
case ("go", direction):
move(direction)
"""
tree = ast.parse(source)
match_node = tree.body[0]
print(type(match_node).__name__) # Match
print(len(match_node.cases)) # 2
Stage 3: AST Validation and Optimization
After parsing, ast.c validates the tree (checking for invalid constructs that the grammar allows but semantics forbid, like del f()). Then ast_opt.c runs optimizations:
Constant folding evaluates constant expressions recursively. The folder handles:
# All folded at compile time:
x = 1 + 2 * 3 # → 7
y = "ab" + "cd" # → "abcd"
z = (1, 2) + (3,) # → (1, 2, 3)
w = not True # → False
Membership test optimization converts constant containers:
# list → tuple (O(n) but less overhead):
x in [1, 2, 3] → x in (1, 2, 3)
# set literal → frozenset (O(1) lookup):
x in {1, 2, 3} → x in frozenset({1, 2, 3})
Format string simplification handles simple cases where f-string segments can be merged.
Stage 4: Symbol Table Construction
Python/symtable.c performs a critical pre-pass before compilation. It walks the entire AST and builds a symbol table that maps every name to its scope:
- LOCAL — Assigned in this scope (uses
LOAD_FAST) - GLOBAL_EXPLICIT — Declared with
globalstatement - GLOBAL_IMPLICIT — Referenced but not assigned locally
- FREE — Referenced from an enclosing scope (
nonlocalor implicit closure) - CELL — Local but also captured by an inner scope
This pass detects errors:
def broken():
print(x) # SyntaxError or UnboundLocalError?
x = 10 # This makes x local, so the print above fails
# The symbol table marks x as LOCAL because of the assignment,
# so LOAD_FAST is used — but x hasn't been assigned yet at the print
The symbol table also determines whether a function needs a closure (any CELL or FREE variables) and sets appropriate flags in the code object.
Stage 5: Bytecode Compilation
Python/compile.c is the core compiler. It walks the AST depth-first and emits bytecode instructions. The compilation is scope-driven: entering a function definition starts a new compilation unit with its own code object.
Key responsibilities:
- Expression compilation: Recursive descent through expression nodes, emitting stack-machine instructions
- Control flow: Translating
if/elif/else,for,while,try/except/finallyinto jump instructions - Comprehension compilation: Each comprehension creates its own code object (a hidden nested function)
- Class compilation: Class bodies are compiled as separate code objects and executed at class creation time
The compiler produces an intermediate representation — a sequence of instruction objects with symbolic labels for jump targets.
Stage 6: CFG Construction and Optimization
In Python 3.12+, the instruction sequence is converted to a Control Flow Graph (CFG) in Python/flowgraph.c. Each basic block is a sequence of non-branching instructions. Edges represent jumps.
The CFG optimizer performs:
- Dead block elimination: Blocks unreachable from the entry point are removed
- Jump threading: Chains of jumps are collapsed
- Instruction peephole: NOP removal, redundant load/store elimination
- Stack depth calculation: The maximum evaluation stack depth is computed by walking all CFG paths
After optimization, the CFG is linearized back into a flat instruction sequence, labels are resolved to concrete offsets, and the final bytecode bytes are assembled.
Stage 7: Code Object Assembly
The final step packages everything into a PyCodeObject:
- Bytecode bytes (
co_code) - Constants pool (
co_consts) - Variable name tables (
co_varnames,co_names,co_cellvars,co_freevars) - Source mapping (
co_linetable, exception table) - Metadata (argument counts, flags, stack size)
Nested scopes produce nested code objects — inner function code objects appear in the outer’s co_consts.
The Eval Loop
Python/ceval.c contains _PyEval_EvalFrameDefault, a massive function that implements the bytecode interpreter. It is essentially a giant switch statement over opcodes:
// Simplified pseudocode
for (;;) {
opcode = next_instruction();
switch (opcode) {
case LOAD_FAST:
value = fastlocals[oparg];
PUSH(value);
break;
case BINARY_OP:
right = POP();
left = POP();
result = binary_op(oparg, left, right);
PUSH(result);
break;
case RETURN_VALUE:
retval = POP();
goto exit_frame;
// ... hundreds more cases
}
}
Python 3.11+ uses computed gotos (on supported compilers) instead of a switch, and Python 3.13 experiments with a register-based interpreter for further speedup.
Specializing Adaptive Interpreter
Python 3.11 added a runtime optimization layer on top of the static compilation. After a bytecode instruction executes a few times, the interpreter replaces it with a specialized version:
BINARY_OP (+) → BINARY_OP_ADD_INT (when both operands are ints)
LOAD_ATTR → LOAD_ATTR_INSTANCE_VALUE (when object layout is known)
CALL → CALL_PY_EXACT_ARGS (when target is a Python function)
Specialization is per-instruction and uses inline caches stored after the instruction. If the type assumption fails, the instruction de-specializes back to the generic form. This is not JIT compilation — it is still interpreted — but it eliminates type-checking overhead for common patterns.
Hooking into the Pipeline
You can intercept compilation at several points:
Import hooks (importlib.abc.SourceLoader.source_to_code) — Replace the entire compilation process for imported modules.
AST transformers — Modify the tree between parsing and compilation:
import ast
source = open("module.py").read()
tree = ast.parse(source)
tree = MyTransformer().visit(tree)
code = compile(tree, "module.py", "exec")
exec(code)
compile() built-in — Accepts source strings, AST objects, or (with flags) returns an AST instead of a code object.
sys.settrace() / sys.monitoring — Hook into the eval loop at runtime.
One thing to remember: CPython’s compiler pipeline is a sophisticated multi-stage system that transforms human-readable source into optimized bytecode through tokenization, PEG parsing, AST optimization, symbol resolution, CFG-based compilation, and bytecode optimization. Each stage is well-defined and hookable, giving tool authors and curious developers clear entry points to understand, analyze, and extend how Python processes code.
See Also
- Python Abstract Syntax Trees How Python reads your code like a recipe before cooking it — the hidden tree structure behind every program.
- Python Bytecode Manipulation How Python secretly translates your code into tiny instructions — and how you can peek at and change those instructions yourself.
- Python Code Objects Internals What Python actually creates when it reads your function — the hidden blueprint that tells the computer what to do.
- Python Frame Objects Why Python keeps a notepad for every running function — and how it remembers where it left off.
- Python Peephole Optimizer How Python quietly tidies up your code behind the scenes — making it faster without you lifting a finger.