Python 3.13 New Features — Deep Dive
Technical overview
Python 3.13 (October 2024) shipped two experimental features with deep architectural implications: free threading (PEP 703) and a copy-and-patch JIT compiler (PEP 744). Both are disabled by default, but they represent the most fundamental changes to CPython’s execution model since the GIL was introduced in 1992.
Free-threaded CPython — architecture
The problem with removing the GIL
The GIL protected CPython from data races on:
- Reference counts — every object has
ob_refcnt, incremented and decremented constantly - Container internals — dicts, lists, and sets are not thread-safe data structures
- Memory allocation —
pymallocis not thread-safe - Global state — interpreter state, module registries, import machinery
Simply removing the GIL without replacing these protections would cause crashes and data corruption.
Biased reference counting
The most critical change. Each object has a thread-local reference count and a shared reference count:
// Simplified structure
struct _object {
Py_ssize_t ob_tid; // Owning thread ID
uint16_t ob_flags;
uint32_t ob_ref_local; // Thread-local refcount (uncontended)
Py_ssize_t ob_ref_shared; // Shared refcount (atomic operations)
};
- The owning thread updates
ob_ref_localwithout atomics — fast path - Other threads update
ob_ref_sharedwith atomic operations — slow path - When
ob_ref_localdrops to zero, the shared count is checked atomically - Object deallocation happens when both counts reach zero
This is “biased” because most reference count operations happen on the owning thread, keeping the fast path truly fast.
Per-object locking
Critical section locks protect container mutations:
// Internal API for container operations
Py_BEGIN_CRITICAL_SECTION(dict);
// ... mutate dict internals ...
Py_END_CRITICAL_SECTION();
These are lightweight mutexes that:
- Use the object header (no separate allocation)
- Support deadlock detection (critical sections are ordered)
- Fall back to stop-the-world collection for complex cases
Immortal objects
Common objects (None, True, False, small integers, interned strings) are marked immortal — their reference count never changes:
// Immortal objects have a special refcount value
#define _Py_IMMORTAL_REFCNT ((Py_ssize_t)(UINT32_MAX >> 1))
This eliminates contention on the most frequently shared objects.
Deferred reference counting
Some objects (module globals, type objects) use deferred reference counting — their refcount decrements are batched and processed during GC pauses rather than inline. This reduces atomic operations for long-lived objects.
Performance characteristics
| Workload | GIL build | Free-threaded (1 thread) | Free-threaded (4 threads) |
|---|---|---|---|
| pyperformance avg | 1.00× | 0.92× | N/A (single-threaded) |
| CPU-bound parallel | 1.00× | 0.90× | 3.2× |
| I/O-bound parallel | 1.00× | 0.95× | ~1.0× (I/O bound) |
Single-threaded code is 5-10% slower due to locking overhead. Multi-threaded CPU-bound work scales near-linearly with cores.
Copy-and-patch JIT compiler
Architecture
The JIT uses a technique called “copy and patch” from Haas et al. (2021):
- Stencil generation (build time): Each bytecode instruction is compiled to native machine code by Clang/LLVM, producing a “stencil” — a template with holes for operands
- Patching (runtime): When a hot trace is detected, stencils are copied into a buffer and holes are filled with concrete values (object pointers, offsets)
- Execution: The patched buffer is marked executable and called directly
Build time: Runtime:
┌─────────┐ ┌──────────┐ ┌──────────────────┐
│ C code │ ──→ │ Stencils │ ──→ │ Copy + Patch │
│ per │ │ (.h data)│ │ concrete values │
│ opcode │ └──────────┘ └────────┬─────────┘
└─────────┘ │
┌──────▼──────┐
│ Executable │
│ native code │
└─────────────┘
Why “copy and patch” instead of a traditional JIT?
- Simpler: No IR, no register allocator, no instruction scheduler at runtime
- Faster compilation: Copying and patching takes microseconds vs. milliseconds for LLVM-based JITs
- Correct by construction: Each stencil is verified by Clang at build time
- Maintainable: CPython developers write C, not assembly
Current limitations
- Only traces of Tier 2 (optimised) bytecodes are JIT-compiled
- No inlining across function boundaries (planned for 3.14)
- No loop unrolling or constant folding beyond what the Tier 2 optimiser provides
- Platform support: x86-64, AArch64 (ARM64)
Enabling and measuring
# Enable JIT
PYTHON_JIT=1 python3.13 script.py
# Check if JIT is available
python3.13 -c "import sys; print(sys._jit)"
# Disable for comparison
PYTHON_JIT=0 python3.13 script.py
New REPL implementation
_pyrepl internals
The new REPL is based on PyPy’s pyrepl, adapted for CPython:
Input → _pyrepl.reader → _pyrepl.commands → _pyrepl.console
│
▼
_pyrepl.completing (tab completion)
_pyrepl.historical (history management)
Key differences from the old readline-based REPL:
- Block-aware: Knows about Python indentation and multi-line constructs
- Customisable: Supports custom key bindings via
~/.pyrepl_config(undocumented, may change) - No C dependency: Pure Python, no
libreadlineorlibeditneeded
Compatibility
The new REPL detects when stdin is not a terminal (piped input) and falls back to the classic REPL. It also respects PYTHONSTARTUP and IPython-style magic commands are not supported (use IPython for those).
Extension module compatibility
Free-threaded builds
Extension modules must declare thread-safety:
static struct PyModuleDef_Slot module_slots[] = {
{Py_mod_multiple_interpreters, Py_MOD_PER_INTERPRETER_GIL_SUPPORTED},
{Py_mod_gil, Py_MOD_GIL_NOT_USED}, // Declares GIL-free safety
{0, NULL}
};
Without Py_MOD_GIL_NOT_USED, the interpreter re-enables the GIL when the module is imported.
Impact on package ecosystem
As of early 2026:
- NumPy 2.1+ supports free-threaded builds
- Cython 3.1+ can generate free-threaded compatible code
- pybind11 2.13+ has experimental free-threaded support
- Many smaller packages still need updates
locals() semantics (PEP 667)
The change from “sometimes a view, sometimes a copy” to “always a snapshot” required modifying frame.f_locals:
import sys
def example():
x = 1
frame = sys._getframe()
frame.f_locals # Now always returns a fresh snapshot
x = 2
frame.f_locals # Another fresh snapshot with x=2
Debuggers and profilers that relied on mutating locals() to change variables must now use frame.f_locals writes directly (which does affect the frame).
Migration strategy
- Test with 3.13 default build — most code works unchanged
- Test with free-threaded build only if you have CPU-bound threading workloads
- Audit C extensions for global state and thread safety if targeting free-threaded
- Remove deprecated stdlib imports — 19 modules were removed
- Don’t depend on
locals()mutation — ensure code works with snapshot semantics - Try the JIT on benchmarks — measure, don’t assume improvement
The one thing to remember: Python 3.13 is the inflection point — biased reference counting and copy-and-patch JIT are the technical foundations for a Python that runs on all cores and approaches compiled-language speed over the next several releases.
See Also
- Python 310 New Features Python 3.10 gave programmers a shape-sorting machine, friendlier error messages, and cleaner ways to say 'this or that' in type hints.
- Python 311 New Features Python 3.11 made everything faster, error messages smarter, and let you catch several mistakes at once instead of stopping at the first one.
- Python 312 New Features Python 3.12 made type hints shorter, f-strings more powerful, and started preparing Python's engine for a world without the GIL.
- Python Exception Groups Python's ExceptionGroup is like getting one report card that lists every mistake at once instead of stopping at the first one.
- Python Free Threading Nogil Python has always had a rule that only one thing can happen at a time — free threading finally changes that, like opening extra checkout lanes at the grocery store.