Python JIT Compiler (Copy and Patch) — Deep Dive

Stencil generation pipeline, Tier 2 trace formation, micro-op optimisation passes, runtime patching mechanics, and comparison with LuaJIT and V8 architectures.

Technical overview

PEP 744 (Python 3.13) introduced an experimental copy-and-patch JIT compiler for CPython. The technique, originally described by Xu and Kjolstad (2021), generates native machine code by stitching together pre-compiled templates. This deep dive covers the full pipeline from bytecode to machine code execution.

The execution pipeline in detail

Source (.py)
    │
    ▼
Compiler → Bytecode (Tier 0)
    │
    ▼
Adaptive interpreter → Specialised bytecodes (Tier 1)
    │  (after 8 executions with consistent types)
    ▼
Trace recorder → Micro-op trace (Tier 2 cold)
    │
    ▼
Trace optimiser → Optimised micro-ops (Tier 2 hot)
    │
    ▼
JIT compiler → Native machine code (Tier 2 JIT)

Tier transition triggers

Tier 0 → Tier 1: Each bytecode instruction has a counter. After 8 type-stable executions, it specialises.
Tier 1 → Tier 2: A backward jump (loop header) or frequently called function triggers trace recording. The threshold is configurable.
Tier 2 → JIT: Currently, all Tier 2 traces are JIT-compiled if the JIT is enabled. Future versions may add a hotness threshold.

Micro-operations (uops)

The Tier 2 optimizer works with micro-operations — a lower-level IR than bytecodes:

# Python code
def sum_squares(n):
    total = 0
    for i in range(n):
        total += i * i
    return total

# Tier 1 bytecodes (simplified)
LOAD_FAST        0 (n)
CALL_INTRINSIC   RANGE
GET_ITER
FOR_ITER
LOAD_FAST        1 (i)
BINARY_OP_MUL_INT
BINARY_OP_ADD_INT_INPLACE
JUMP_BACKWARD

# Tier 2 micro-ops (simplified)
_SET_IP          loop_header
_CHECK_VALIDITY  
_GUARD_TYPE_VERSION  i, int
_LOAD_FAST       1
_GUARD_TYPE_VERSION  i, int  ← may be eliminated
_BINARY_OP_MUL_INT
_GUARD_TYPE_VERSION  total, int
_BINARY_OP_ADD_INT
_STORE_FAST      0
_JUMP_TO_TOP

Optimisation passes

The Tier 2 optimizer runs several passes:

Guard elimination: If _GUARD_TYPE_VERSION for the same variable appears multiple times without an intervening store, duplicates are removed.
Dead code elimination: Uops whose results are never read are removed.
Constant folding: If operands are known constants (e.g., from LOAD_CONST), the operation is computed at trace time.
Type propagation: Type information flows forward through the trace, enabling more guard elimination.

# After optimisation
_SET_IP          loop_header
_CHECK_VALIDITY
_GUARD_TYPE_VERSION  i, int    ← kept (first check)
_LOAD_FAST       1
_BINARY_OP_MUL_INT             ← guard removed (already checked)
_BINARY_OP_ADD_INT             ← guard removed (int*int=int guaranteed)
_STORE_FAST      0
_JUMP_TO_TOP

Stencil generation pipeline

Build-time process

┌─────────────────────────────────────────────────┐
│ For each micro-op:                              │
│                                                 │
│ 1. C implementation (Tools/jit/template.c)      │
│    ↓                                            │
│ 2. Clang -emit-llvm → LLVM bitcode             │
│    ↓                                            │
│ 3. LLVM → relocatable object (.o)               │
│    ↓                                            │
│ 4. Python script extracts machine code +        │
│    relocation entries → stencil data            │
│    ↓                                            │
│ 5. Generated C header with stencil arrays       │
│    (jit_stencils.h)                             │
└─────────────────────────────────────────────────┘

Each stencil consists of:

Body: Raw machine code bytes
Holes: A list of (offset, kind, value) tuples describing what needs patching
Metadata: Size, alignment requirements

Example stencil structure

// Generated stencil for _BINARY_OP_ADD_INT
static const StencilGroup BINARY_OP_ADD_INT_stencil = {
    .body = {
        0x48, 0x8b, 0x45, 0x00,  // mov rax, [rbp+HOLE]
        0x48, 0x8b, 0x5d, 0x00,  // mov rbx, [rbp+HOLE]
        0x48, 0x01, 0xd8,        // add rax, rbx
        // ... overflow check, result storage, etc.
    },
    .body_size = 64,
    .holes = {
        {4, HOLE_base, offsetof(Frame, stack[0])},
        {12, HOLE_base, offsetof(Frame, stack[1])},
        // ...
    },
    .holes_size = 3,
};

Runtime patching

Memory management

// Allocate executable memory for the trace
char *memory = mmap(NULL, trace_size,
    PROT_READ | PROT_WRITE,  // Initially writable
    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

// Copy and patch each stencil
for (int i = 0; i < trace->length; i++) {
    Stencil *s = get_stencil(trace->uops[i].opcode);
    memcpy(memory + offset, s->body, s->body_size);
    
    for (int j = 0; j < s->holes_size; j++) {
        patch_hole(memory + offset + s->holes[j].offset,
                   s->holes[j].kind,
                   resolve_value(trace->uops[i], s->holes[j].value));
    }
    offset += s->body_size;
}

// Make executable (W^X policy)
mprotect(memory, trace_size, PROT_READ | PROT_EXEC);

Hole types

Hole type	Resolution
`HOLE_base`	Frame pointer offset
`HOLE_oparg`	Instruction argument value
`HOLE_operand`	64-bit operand (object pointer, jump target)
`HOLE_continue`	Address of the next stencil in the trace
`HOLE_deopt`	Address to jump to on guard failure (back to interpreter)

Deoptimisation

When a guard fails (type changed, counter overflowed), execution transfers back to the Tier 1 interpreter:

JIT code: _GUARD_TYPE_VERSION
    cmp [obj+tp_version_tag], expected_version
    jne HOLE_deopt    ← patched to interpreter entry point
    ; continue fast path...

The interpreter resumes at the correct bytecode offset, and the trace may be marked for re-compilation if types stabilise to new values.

Comparison with other JIT approaches

vs. LuaJIT (tracing JIT)

Aspect	CPython copy-and-patch	LuaJIT
IR	Micro-ops (CPython-specific)	SSA IR (custom)
Register allocation	None (stack-based)	Linear scan
Compilation speed	~10µs per trace	~100µs per trace
Code quality	Good	Excellent
Maintenance burden	Low (C stencils)	High (hand-tuned assembler)
Hot loop speedup	1.02-1.10×	5-50×

vs. V8 TurboFan (method JIT)

Aspect	CPython copy-and-patch	V8 TurboFan
Compilation unit	Trace (basic block)	Function (full graph)
Inlining	Not yet	Aggressive
Escape analysis	No	Yes
Compilation speed	~10µs	~10ms
Team size needed	2-3 engineers	20+ engineers

vs. PyPy (meta-tracing JIT)

Aspect	CPython copy-and-patch	PyPy
Approach	Trace + stitch	Meta-tracing RPython
C extension compat	Full	Limited (cpyext overhead)
Startup overhead	Minimal	Significant
Steady-state speed	Modest improvement	2-10× faster
Memory overhead	Low	Higher

Platform support

Platform	Status
x86-64 Linux	Supported
x86-64 macOS	Supported
x86-64 Windows	Supported
AArch64 Linux	Supported
AArch64 macOS (Apple Silicon)	Supported
32-bit platforms	Not supported
RISC-V, s390x	Not yet

What’s planned for 3.14 and beyond

Function inlining: Inline small callees directly into the trace, eliminating call/return overhead
Better register usage: Allocate frequently-used values to registers instead of stack slots
Loop unrolling: Repeat loop bodies to reduce branch overhead
Superblock formation: Extend traces through branches, not just linear paths
Profile-guided optimisation: Use runtime profiles to guide trace selection

The goal is to achieve 2-5× speedup on pyperformance by Python 3.16-3.17.

Debugging the JIT

# View JIT statistics
import sys
if hasattr(sys, '_jit'):
    stats = sys._jit  # JIT statistics object
    
# Environment variables
# PYTHON_JIT=1          Enable JIT
# PYTHON_JIT=0          Disable JIT
# PYTHON_JIT_DEBUG=1    Print JIT compilation events

# Disassemble JIT output (requires perf or lldb)
perf record -g python3.13 script.py
perf annotate  # Shows JIT-compiled regions

# Or with LLDB
lldb python3.13 -- script.py
# Set breakpoint on _PyJIT_Compile to inspect generated code

The one thing to remember: Copy-and-patch achieves the seemingly impossible — a production-quality JIT compiler maintainable by a small team — by outsourcing code generation to LLVM at build time and doing only simple copy-and-fill at runtime.

pythonperformancejit-compiler