Python JIT Compiler (Copy and Patch) — Deep Dive
Technical overview
PEP 744 (Python 3.13) introduced an experimental copy-and-patch JIT compiler for CPython. The technique, originally described by Xu and Kjolstad (2021), generates native machine code by stitching together pre-compiled templates. This deep dive covers the full pipeline from bytecode to machine code execution.
The execution pipeline in detail
Source (.py)
│
▼
Compiler → Bytecode (Tier 0)
│
▼
Adaptive interpreter → Specialised bytecodes (Tier 1)
│ (after 8 executions with consistent types)
▼
Trace recorder → Micro-op trace (Tier 2 cold)
│
▼
Trace optimiser → Optimised micro-ops (Tier 2 hot)
│
▼
JIT compiler → Native machine code (Tier 2 JIT)
Tier transition triggers
- Tier 0 → Tier 1: Each bytecode instruction has a counter. After 8 type-stable executions, it specialises.
- Tier 1 → Tier 2: A backward jump (loop header) or frequently called function triggers trace recording. The threshold is configurable.
- Tier 2 → JIT: Currently, all Tier 2 traces are JIT-compiled if the JIT is enabled. Future versions may add a hotness threshold.
Micro-operations (uops)
The Tier 2 optimizer works with micro-operations — a lower-level IR than bytecodes:
# Python code
def sum_squares(n):
total = 0
for i in range(n):
total += i * i
return total
# Tier 1 bytecodes (simplified)
LOAD_FAST 0 (n)
CALL_INTRINSIC RANGE
GET_ITER
FOR_ITER
LOAD_FAST 1 (i)
BINARY_OP_MUL_INT
BINARY_OP_ADD_INT_INPLACE
JUMP_BACKWARD
# Tier 2 micro-ops (simplified)
_SET_IP loop_header
_CHECK_VALIDITY
_GUARD_TYPE_VERSION i, int
_LOAD_FAST 1
_GUARD_TYPE_VERSION i, int ← may be eliminated
_BINARY_OP_MUL_INT
_GUARD_TYPE_VERSION total, int
_BINARY_OP_ADD_INT
_STORE_FAST 0
_JUMP_TO_TOP
Optimisation passes
The Tier 2 optimizer runs several passes:
-
Guard elimination: If
_GUARD_TYPE_VERSIONfor the same variable appears multiple times without an intervening store, duplicates are removed. -
Dead code elimination: Uops whose results are never read are removed.
-
Constant folding: If operands are known constants (e.g., from
LOAD_CONST), the operation is computed at trace time. -
Type propagation: Type information flows forward through the trace, enabling more guard elimination.
# After optimisation
_SET_IP loop_header
_CHECK_VALIDITY
_GUARD_TYPE_VERSION i, int ← kept (first check)
_LOAD_FAST 1
_BINARY_OP_MUL_INT ← guard removed (already checked)
_BINARY_OP_ADD_INT ← guard removed (int*int=int guaranteed)
_STORE_FAST 0
_JUMP_TO_TOP
Stencil generation pipeline
Build-time process
┌─────────────────────────────────────────────────┐
│ For each micro-op: │
│ │
│ 1. C implementation (Tools/jit/template.c) │
│ ↓ │
│ 2. Clang -emit-llvm → LLVM bitcode │
│ ↓ │
│ 3. LLVM → relocatable object (.o) │
│ ↓ │
│ 4. Python script extracts machine code + │
│ relocation entries → stencil data │
│ ↓ │
│ 5. Generated C header with stencil arrays │
│ (jit_stencils.h) │
└─────────────────────────────────────────────────┘
Each stencil consists of:
- Body: Raw machine code bytes
- Holes: A list of (offset, kind, value) tuples describing what needs patching
- Metadata: Size, alignment requirements
Example stencil structure
// Generated stencil for _BINARY_OP_ADD_INT
static const StencilGroup BINARY_OP_ADD_INT_stencil = {
.body = {
0x48, 0x8b, 0x45, 0x00, // mov rax, [rbp+HOLE]
0x48, 0x8b, 0x5d, 0x00, // mov rbx, [rbp+HOLE]
0x48, 0x01, 0xd8, // add rax, rbx
// ... overflow check, result storage, etc.
},
.body_size = 64,
.holes = {
{4, HOLE_base, offsetof(Frame, stack[0])},
{12, HOLE_base, offsetof(Frame, stack[1])},
// ...
},
.holes_size = 3,
};
Runtime patching
Memory management
// Allocate executable memory for the trace
char *memory = mmap(NULL, trace_size,
PROT_READ | PROT_WRITE, // Initially writable
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// Copy and patch each stencil
for (int i = 0; i < trace->length; i++) {
Stencil *s = get_stencil(trace->uops[i].opcode);
memcpy(memory + offset, s->body, s->body_size);
for (int j = 0; j < s->holes_size; j++) {
patch_hole(memory + offset + s->holes[j].offset,
s->holes[j].kind,
resolve_value(trace->uops[i], s->holes[j].value));
}
offset += s->body_size;
}
// Make executable (W^X policy)
mprotect(memory, trace_size, PROT_READ | PROT_EXEC);
Hole types
| Hole type | Resolution |
|---|---|
HOLE_base | Frame pointer offset |
HOLE_oparg | Instruction argument value |
HOLE_operand | 64-bit operand (object pointer, jump target) |
HOLE_continue | Address of the next stencil in the trace |
HOLE_deopt | Address to jump to on guard failure (back to interpreter) |
Deoptimisation
When a guard fails (type changed, counter overflowed), execution transfers back to the Tier 1 interpreter:
JIT code: _GUARD_TYPE_VERSION
cmp [obj+tp_version_tag], expected_version
jne HOLE_deopt ← patched to interpreter entry point
; continue fast path...
The interpreter resumes at the correct bytecode offset, and the trace may be marked for re-compilation if types stabilise to new values.
Comparison with other JIT approaches
vs. LuaJIT (tracing JIT)
| Aspect | CPython copy-and-patch | LuaJIT |
|---|---|---|
| IR | Micro-ops (CPython-specific) | SSA IR (custom) |
| Register allocation | None (stack-based) | Linear scan |
| Compilation speed | ~10µs per trace | ~100µs per trace |
| Code quality | Good | Excellent |
| Maintenance burden | Low (C stencils) | High (hand-tuned assembler) |
| Hot loop speedup | 1.02-1.10× | 5-50× |
vs. V8 TurboFan (method JIT)
| Aspect | CPython copy-and-patch | V8 TurboFan |
|---|---|---|
| Compilation unit | Trace (basic block) | Function (full graph) |
| Inlining | Not yet | Aggressive |
| Escape analysis | No | Yes |
| Compilation speed | ~10µs | ~10ms |
| Team size needed | 2-3 engineers | 20+ engineers |
vs. PyPy (meta-tracing JIT)
| Aspect | CPython copy-and-patch | PyPy |
|---|---|---|
| Approach | Trace + stitch | Meta-tracing RPython |
| C extension compat | Full | Limited (cpyext overhead) |
| Startup overhead | Minimal | Significant |
| Steady-state speed | Modest improvement | 2-10× faster |
| Memory overhead | Low | Higher |
Platform support
| Platform | Status |
|---|---|
| x86-64 Linux | Supported |
| x86-64 macOS | Supported |
| x86-64 Windows | Supported |
| AArch64 Linux | Supported |
| AArch64 macOS (Apple Silicon) | Supported |
| 32-bit platforms | Not supported |
| RISC-V, s390x | Not yet |
What’s planned for 3.14 and beyond
- Function inlining: Inline small callees directly into the trace, eliminating call/return overhead
- Better register usage: Allocate frequently-used values to registers instead of stack slots
- Loop unrolling: Repeat loop bodies to reduce branch overhead
- Superblock formation: Extend traces through branches, not just linear paths
- Profile-guided optimisation: Use runtime profiles to guide trace selection
The goal is to achieve 2-5× speedup on pyperformance by Python 3.16-3.17.
Debugging the JIT
# View JIT statistics
import sys
if hasattr(sys, '_jit'):
stats = sys._jit # JIT statistics object
# Environment variables
# PYTHON_JIT=1 Enable JIT
# PYTHON_JIT=0 Disable JIT
# PYTHON_JIT_DEBUG=1 Print JIT compilation events
# Disassemble JIT output (requires perf or lldb)
perf record -g python3.13 script.py
perf annotate # Shows JIT-compiled regions
# Or with LLDB
lldb python3.13 -- script.py
# Set breakpoint on _PyJIT_Compile to inspect generated code
The one thing to remember: Copy-and-patch achieves the seemingly impossible — a production-quality JIT compiler maintainable by a small team — by outsourcing code generation to LLVM at build time and doing only simple copy-and-fill at runtime.
See Also
- Python 310 New Features Python 3.10 gave programmers a shape-sorting machine, friendlier error messages, and cleaner ways to say 'this or that' in type hints.
- Python 311 New Features Python 3.11 made everything faster, error messages smarter, and let you catch several mistakes at once instead of stopping at the first one.
- Python 312 New Features Python 3.12 made type hints shorter, f-strings more powerful, and started preparing Python's engine for a world without the GIL.
- Python 313 New Features Python 3.13 finally lets multiple tasks run at the same time for real, added a speed booster engine, and gave the interactive prompt a colourful makeover.
- Python Exception Groups Python's ExceptionGroup is like getting one report card that lists every mistake at once instead of stopping at the first one.