Python JIT Compiler (Copy and Patch) — Deep Dive

Technical overview

PEP 744 (Python 3.13) introduced an experimental copy-and-patch JIT compiler for CPython. The technique, originally described by Xu and Kjolstad (2021), generates native machine code by stitching together pre-compiled templates. This deep dive covers the full pipeline from bytecode to machine code execution.

The execution pipeline in detail

Source (.py)


Compiler → Bytecode (Tier 0)


Adaptive interpreter → Specialised bytecodes (Tier 1)
    │  (after 8 executions with consistent types)

Trace recorder → Micro-op trace (Tier 2 cold)


Trace optimiser → Optimised micro-ops (Tier 2 hot)


JIT compiler → Native machine code (Tier 2 JIT)

Tier transition triggers

  • Tier 0 → Tier 1: Each bytecode instruction has a counter. After 8 type-stable executions, it specialises.
  • Tier 1 → Tier 2: A backward jump (loop header) or frequently called function triggers trace recording. The threshold is configurable.
  • Tier 2 → JIT: Currently, all Tier 2 traces are JIT-compiled if the JIT is enabled. Future versions may add a hotness threshold.

Micro-operations (uops)

The Tier 2 optimizer works with micro-operations — a lower-level IR than bytecodes:

# Python code
def sum_squares(n):
    total = 0
    for i in range(n):
        total += i * i
    return total

# Tier 1 bytecodes (simplified)
LOAD_FAST        0 (n)
CALL_INTRINSIC   RANGE
GET_ITER
FOR_ITER
LOAD_FAST        1 (i)
BINARY_OP_MUL_INT
BINARY_OP_ADD_INT_INPLACE
JUMP_BACKWARD

# Tier 2 micro-ops (simplified)
_SET_IP          loop_header
_CHECK_VALIDITY  
_GUARD_TYPE_VERSION  i, int
_LOAD_FAST       1
_GUARD_TYPE_VERSION  i, int  ← may be eliminated
_BINARY_OP_MUL_INT
_GUARD_TYPE_VERSION  total, int
_BINARY_OP_ADD_INT
_STORE_FAST      0
_JUMP_TO_TOP

Optimisation passes

The Tier 2 optimizer runs several passes:

  1. Guard elimination: If _GUARD_TYPE_VERSION for the same variable appears multiple times without an intervening store, duplicates are removed.

  2. Dead code elimination: Uops whose results are never read are removed.

  3. Constant folding: If operands are known constants (e.g., from LOAD_CONST), the operation is computed at trace time.

  4. Type propagation: Type information flows forward through the trace, enabling more guard elimination.

# After optimisation
_SET_IP          loop_header
_CHECK_VALIDITY
_GUARD_TYPE_VERSION  i, int    ← kept (first check)
_LOAD_FAST       1
_BINARY_OP_MUL_INT             ← guard removed (already checked)
_BINARY_OP_ADD_INT             ← guard removed (int*int=int guaranteed)
_STORE_FAST      0
_JUMP_TO_TOP

Stencil generation pipeline

Build-time process

┌─────────────────────────────────────────────────┐
│ For each micro-op:                              │
│                                                 │
│ 1. C implementation (Tools/jit/template.c)      │
│    ↓                                            │
│ 2. Clang -emit-llvm → LLVM bitcode             │
│    ↓                                            │
│ 3. LLVM → relocatable object (.o)               │
│    ↓                                            │
│ 4. Python script extracts machine code +        │
│    relocation entries → stencil data            │
│    ↓                                            │
│ 5. Generated C header with stencil arrays       │
│    (jit_stencils.h)                             │
└─────────────────────────────────────────────────┘

Each stencil consists of:

  • Body: Raw machine code bytes
  • Holes: A list of (offset, kind, value) tuples describing what needs patching
  • Metadata: Size, alignment requirements

Example stencil structure

// Generated stencil for _BINARY_OP_ADD_INT
static const StencilGroup BINARY_OP_ADD_INT_stencil = {
    .body = {
        0x48, 0x8b, 0x45, 0x00,  // mov rax, [rbp+HOLE]
        0x48, 0x8b, 0x5d, 0x00,  // mov rbx, [rbp+HOLE]
        0x48, 0x01, 0xd8,        // add rax, rbx
        // ... overflow check, result storage, etc.
    },
    .body_size = 64,
    .holes = {
        {4, HOLE_base, offsetof(Frame, stack[0])},
        {12, HOLE_base, offsetof(Frame, stack[1])},
        // ...
    },
    .holes_size = 3,
};

Runtime patching

Memory management

// Allocate executable memory for the trace
char *memory = mmap(NULL, trace_size,
    PROT_READ | PROT_WRITE,  // Initially writable
    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

// Copy and patch each stencil
for (int i = 0; i < trace->length; i++) {
    Stencil *s = get_stencil(trace->uops[i].opcode);
    memcpy(memory + offset, s->body, s->body_size);
    
    for (int j = 0; j < s->holes_size; j++) {
        patch_hole(memory + offset + s->holes[j].offset,
                   s->holes[j].kind,
                   resolve_value(trace->uops[i], s->holes[j].value));
    }
    offset += s->body_size;
}

// Make executable (W^X policy)
mprotect(memory, trace_size, PROT_READ | PROT_EXEC);

Hole types

Hole typeResolution
HOLE_baseFrame pointer offset
HOLE_opargInstruction argument value
HOLE_operand64-bit operand (object pointer, jump target)
HOLE_continueAddress of the next stencil in the trace
HOLE_deoptAddress to jump to on guard failure (back to interpreter)

Deoptimisation

When a guard fails (type changed, counter overflowed), execution transfers back to the Tier 1 interpreter:

JIT code: _GUARD_TYPE_VERSION
    cmp [obj+tp_version_tag], expected_version
    jne HOLE_deopt    ← patched to interpreter entry point
    ; continue fast path...

The interpreter resumes at the correct bytecode offset, and the trace may be marked for re-compilation if types stabilise to new values.

Comparison with other JIT approaches

vs. LuaJIT (tracing JIT)

AspectCPython copy-and-patchLuaJIT
IRMicro-ops (CPython-specific)SSA IR (custom)
Register allocationNone (stack-based)Linear scan
Compilation speed~10µs per trace~100µs per trace
Code qualityGoodExcellent
Maintenance burdenLow (C stencils)High (hand-tuned assembler)
Hot loop speedup1.02-1.10×5-50×

vs. V8 TurboFan (method JIT)

AspectCPython copy-and-patchV8 TurboFan
Compilation unitTrace (basic block)Function (full graph)
InliningNot yetAggressive
Escape analysisNoYes
Compilation speed~10µs~10ms
Team size needed2-3 engineers20+ engineers

vs. PyPy (meta-tracing JIT)

AspectCPython copy-and-patchPyPy
ApproachTrace + stitchMeta-tracing RPython
C extension compatFullLimited (cpyext overhead)
Startup overheadMinimalSignificant
Steady-state speedModest improvement2-10× faster
Memory overheadLowHigher

Platform support

PlatformStatus
x86-64 LinuxSupported
x86-64 macOSSupported
x86-64 WindowsSupported
AArch64 LinuxSupported
AArch64 macOS (Apple Silicon)Supported
32-bit platformsNot supported
RISC-V, s390xNot yet

What’s planned for 3.14 and beyond

  1. Function inlining: Inline small callees directly into the trace, eliminating call/return overhead
  2. Better register usage: Allocate frequently-used values to registers instead of stack slots
  3. Loop unrolling: Repeat loop bodies to reduce branch overhead
  4. Superblock formation: Extend traces through branches, not just linear paths
  5. Profile-guided optimisation: Use runtime profiles to guide trace selection

The goal is to achieve 2-5× speedup on pyperformance by Python 3.16-3.17.

Debugging the JIT

# View JIT statistics
import sys
if hasattr(sys, '_jit'):
    stats = sys._jit  # JIT statistics object
    
# Environment variables
# PYTHON_JIT=1          Enable JIT
# PYTHON_JIT=0          Disable JIT
# PYTHON_JIT_DEBUG=1    Print JIT compilation events
# Disassemble JIT output (requires perf or lldb)
perf record -g python3.13 script.py
perf annotate  # Shows JIT-compiled regions

# Or with LLDB
lldb python3.13 -- script.py
# Set breakpoint on _PyJIT_Compile to inspect generated code

The one thing to remember: Copy-and-patch achieves the seemingly impossible — a production-quality JIT compiler maintainable by a small team — by outsourcing code generation to LLVM at build time and doing only simple copy-and-fill at runtime.

pythonperformancejit-compiler

See Also

  • Python 310 New Features Python 3.10 gave programmers a shape-sorting machine, friendlier error messages, and cleaner ways to say 'this or that' in type hints.
  • Python 311 New Features Python 3.11 made everything faster, error messages smarter, and let you catch several mistakes at once instead of stopping at the first one.
  • Python 312 New Features Python 3.12 made type hints shorter, f-strings more powerful, and started preparing Python's engine for a world without the GIL.
  • Python 313 New Features Python 3.13 finally lets multiple tasks run at the same time for real, added a speed booster engine, and gave the interactive prompt a colourful makeover.
  • Python Exception Groups Python's ExceptionGroup is like getting one report card that lists every mistake at once instead of stopping at the first one.