Python 3.13 New Features — Deep Dive

Free-threaded CPython's locking architecture, copy-and-patch JIT internals, the new REPL implementation, and what extension authors need to know.

Technical overview

Python 3.13 (October 2024) shipped two experimental features with deep architectural implications: free threading (PEP 703) and a copy-and-patch JIT compiler (PEP 744). Both are disabled by default, but they represent the most fundamental changes to CPython’s execution model since the GIL was introduced in 1992.

Free-threaded CPython — architecture

The problem with removing the GIL

The GIL protected CPython from data races on:

Reference counts — every object has ob_refcnt, incremented and decremented constantly
Container internals — dicts, lists, and sets are not thread-safe data structures
Memory allocation — pymalloc is not thread-safe
Global state — interpreter state, module registries, import machinery

Simply removing the GIL without replacing these protections would cause crashes and data corruption.

Biased reference counting

The most critical change. Each object has a thread-local reference count and a shared reference count:

// Simplified structure
struct _object {
    Py_ssize_t ob_tid;         // Owning thread ID
    uint16_t ob_flags;
    uint32_t ob_ref_local;     // Thread-local refcount (uncontended)
    Py_ssize_t ob_ref_shared;  // Shared refcount (atomic operations)
};

The owning thread updates ob_ref_local without atomics — fast path
Other threads update ob_ref_shared with atomic operations — slow path
When ob_ref_local drops to zero, the shared count is checked atomically
Object deallocation happens when both counts reach zero

This is “biased” because most reference count operations happen on the owning thread, keeping the fast path truly fast.

Per-object locking

Critical section locks protect container mutations:

// Internal API for container operations
Py_BEGIN_CRITICAL_SECTION(dict);
// ... mutate dict internals ...
Py_END_CRITICAL_SECTION();

These are lightweight mutexes that:

Use the object header (no separate allocation)
Support deadlock detection (critical sections are ordered)
Fall back to stop-the-world collection for complex cases

Immortal objects

Common objects (None, True, False, small integers, interned strings) are marked immortal — their reference count never changes:

// Immortal objects have a special refcount value
#define _Py_IMMORTAL_REFCNT ((Py_ssize_t)(UINT32_MAX >> 1))

This eliminates contention on the most frequently shared objects.

Deferred reference counting

Some objects (module globals, type objects) use deferred reference counting — their refcount decrements are batched and processed during GC pauses rather than inline. This reduces atomic operations for long-lived objects.

Performance characteristics

Workload	GIL build	Free-threaded (1 thread)	Free-threaded (4 threads)
pyperformance avg	1.00×	0.92×	N/A (single-threaded)
CPU-bound parallel	1.00×	0.90×	3.2×
I/O-bound parallel	1.00×	0.95×	~1.0× (I/O bound)

Single-threaded code is 5-10% slower due to locking overhead. Multi-threaded CPU-bound work scales near-linearly with cores.

Copy-and-patch JIT compiler

Architecture

The JIT uses a technique called “copy and patch” from Haas et al. (2021):

Stencil generation (build time): Each bytecode instruction is compiled to native machine code by Clang/LLVM, producing a “stencil” — a template with holes for operands
Patching (runtime): When a hot trace is detected, stencils are copied into a buffer and holes are filled with concrete values (object pointers, offsets)
Execution: The patched buffer is marked executable and called directly

Build time:                        Runtime:
┌─────────┐     ┌──────────┐      ┌──────────────────┐
│ C code  │ ──→ │ Stencils │ ──→  │ Copy + Patch     │
│ per     │     │ (.h data)│      │ concrete values  │
│ opcode  │     └──────────┘      └────────┬─────────┘
└─────────┘                                │
                                    ┌──────▼──────┐
                                    │ Executable  │
                                    │ native code │
                                    └─────────────┘

Why “copy and patch” instead of a traditional JIT?

Simpler: No IR, no register allocator, no instruction scheduler at runtime
Faster compilation: Copying and patching takes microseconds vs. milliseconds for LLVM-based JITs
Correct by construction: Each stencil is verified by Clang at build time
Maintainable: CPython developers write C, not assembly

Current limitations

Only traces of Tier 2 (optimised) bytecodes are JIT-compiled
No inlining across function boundaries (planned for 3.14)
No loop unrolling or constant folding beyond what the Tier 2 optimiser provides
Platform support: x86-64, AArch64 (ARM64)

Enabling and measuring

# Enable JIT
PYTHON_JIT=1 python3.13 script.py

# Check if JIT is available
python3.13 -c "import sys; print(sys._jit)"

# Disable for comparison
PYTHON_JIT=0 python3.13 script.py

New REPL implementation

`_pyrepl` internals

The new REPL is based on PyPy’s pyrepl, adapted for CPython:

Input → _pyrepl.reader → _pyrepl.commands → _pyrepl.console
                            │
                            ▼
                    _pyrepl.completing (tab completion)
                    _pyrepl.historical (history management)

Key differences from the old readline-based REPL:

Block-aware: Knows about Python indentation and multi-line constructs
Customisable: Supports custom key bindings via ~/.pyrepl_config (undocumented, may change)
No C dependency: Pure Python, no libreadline or libedit needed

Compatibility

The new REPL detects when stdin is not a terminal (piped input) and falls back to the classic REPL. It also respects PYTHONSTARTUP and IPython-style magic commands are not supported (use IPython for those).

Extension module compatibility

Free-threaded builds

Extension modules must declare thread-safety:

static struct PyModuleDef_Slot module_slots[] = {
    {Py_mod_multiple_interpreters, Py_MOD_PER_INTERPRETER_GIL_SUPPORTED},
    {Py_mod_gil, Py_MOD_GIL_NOT_USED},  // Declares GIL-free safety
    {0, NULL}
};

Without Py_MOD_GIL_NOT_USED, the interpreter re-enables the GIL when the module is imported.

Impact on package ecosystem

As of early 2026:

NumPy 2.1+ supports free-threaded builds
Cython 3.1+ can generate free-threaded compatible code
pybind11 2.13+ has experimental free-threaded support
Many smaller packages still need updates

`locals()` semantics (PEP 667)

The change from “sometimes a view, sometimes a copy” to “always a snapshot” required modifying frame.f_locals:

import sys

def example():
    x = 1
    frame = sys._getframe()
    frame.f_locals  # Now always returns a fresh snapshot
    x = 2
    frame.f_locals  # Another fresh snapshot with x=2

Debuggers and profilers that relied on mutating locals() to change variables must now use frame.f_locals writes directly (which does affect the frame).

Migration strategy

Test with 3.13 default build — most code works unchanged
Test with free-threaded build only if you have CPU-bound threading workloads
Audit C extensions for global state and thread safety if targeting free-threaded
Remove deprecated stdlib imports — 19 modules were removed
Don’t depend on locals() mutation — ensure code works with snapshot semantics
Try the JIT on benchmarks — measure, don’t assume improvement

The one thing to remember: Python 3.13 is the inflection point — biased reference counting and copy-and-patch JIT are the technical foundations for a Python that runs on all cores and approaches compiled-language speed over the next several releases.

pythonpython313release-features