Python Free Threading (No-GIL) — Deep Dive
Technical overview
PEP 703 (accepted October 2023, shipped experimentally in Python 3.13) removes the GIL from CPython. This required re-engineering every system that the GIL previously protected: reference counting, memory allocation, container thread safety, the garbage collector, and the import system. This deep dive covers the internal mechanisms and what extension authors need to know.
Biased reference counting — implementation
Object header layout
Every Python object in free-threaded builds has an expanded header:
struct _object {
// Biased refcount fields
uintptr_t ob_tid; // Thread ID of the owning thread
uint16_t _padding;
_PyObject_HEAD_EXTRA // Debug-only linked list
struct {
uint32_t local; // Thread-local refcount (non-atomic)
Py_ssize_t shared; // Shared refcount (atomic)
} ob_ref;
PyTypeObject *ob_type;
};
Fast path (owning thread)
When the owning thread (matching ob_tid) increments or decrements the refcount:
static inline void Py_INCREF(PyObject *op) {
if (_Py_IsOwnedByCurrentThread(op)) {
op->ob_ref.local++; // Non-atomic, cache-line local
} else {
_Py_atomic_add(&op->ob_ref.shared, 2); // Atomic, shifted by 1
}
}
The local increment has the same cost as the old GIL-protected ob_refcnt++. No atomics, no memory barriers.
Slow path (other threads)
The shared refcount uses atomic operations. The count is shifted left by 1 bit — the lowest bit is a “merge” flag that triggers thread-local count consolidation during GC.
Deallocation
When ob_ref.local reaches zero:
- If
ob_ref.sharedis also zero (checked atomically) → deallocate - If
ob_ref.sharedis positive → transfer ownership by settingob_tid = 0and merging local into shared - If
ob_ref.sharedhas the merge flag set → defer to GC cycle
Performance analysis
On a microbenchmark that just increments/decrements refcounts:
- Owning thread: identical speed to GIL build
- Non-owning thread: ~3× slower due to atomic operations
- In real workloads, 85-95% of refcount operations hit the fast path
Per-object critical sections
Implementation
Critical sections use a lightweight mutex stored in the object header:
typedef struct {
_PyMutex mutex;
} _PyCriticalSection;
// Usage in dict operations
Py_BEGIN_CRITICAL_SECTION(dict_obj);
// ... mutate dict hash table ...
Py_END_CRITICAL_SECTION();
_PyMutex is a 1-byte mutex that uses:
- Fast path: A single atomic compare-and-swap (uncontended case)
- Slow path: OS futex for contended waits
Two-object critical sections
Some operations need to lock two objects atomically (e.g., dict.update(other_dict)):
Py_BEGIN_CRITICAL_SECTION2(dict1, dict2);
// Both dicts locked, guaranteed deadlock-free
Py_END_CRITICAL_SECTION2();
Deadlock prevention uses pointer ordering — always lock the object at the lower memory address first.
Where critical sections are used
| Object type | Protected operations |
|---|---|
dict | Insert, delete, resize, iteration |
list | Append, insert, sort, slice assignment |
set | Add, discard, union, intersection |
type | MRO modification, descriptor cache |
frame | f_locals access, stack manipulation |
Immortal objects
Mechanism
Objects can be marked immortal by setting a special refcount value:
#define _Py_IMMORTAL_REFCNT UINT32_MAX
static inline void Py_INCREF(PyObject *op) {
if (_Py_IsImmortal(op)) {
return; // No-op
}
// ... normal biased refcounting ...
}
Which objects are immortal
None,True,False- Small integers (
-5to256) - Interned strings (module names, attribute names)
- Type objects (builtin types)
- Code objects loaded from
.pycfiles - The empty tuple
()
Impact
Immortalisation eliminates the most contended refcount operations. In a typical Python program, None and True/False account for ~15% of all Py_INCREF calls.
Garbage collector changes
Stop-the-world GC
The cyclic garbage collector now uses stop-the-world pauses:
- Signal all threads to pause at safe points
- Run the mark phase (traverse reference graph)
- Run the sweep phase (collect unreachable objects)
- Resume all threads
Safe points are inserted at:
- Backward jumps (loop iterations)
- Function calls
Py_BEGIN_CRITICAL_SECTIONboundaries
GC overhead
Stop-the-world pauses are typically <1ms for most applications. Programs with millions of cyclic objects may see longer pauses. The GC runs less frequently than in GIL builds because deferred refcounting handles most cleanup.
Memory allocator changes
pymalloc was replaced with mimalloc in free-threaded builds:
- Thread-safe by design — per-thread heaps with lock-free fast paths
- Better scalability — no global allocator lock
- Comparable performance to pymalloc for single-threaded code
- ~5% more memory due to per-thread arena overhead
Making C extensions thread-safe
Step 1: Remove global state
// BAD: global state
static PyObject *module_cache = NULL;
// GOOD: per-module state
typedef struct {
PyObject *cache;
} module_state;
Step 2: Declare GIL-free compatibility
static PyModuleDef_Slot module_slots[] = {
{Py_mod_multiple_interpreters, Py_MOD_PER_INTERPRETER_GIL_SUPPORTED},
{Py_mod_gil, Py_MOD_GIL_NOT_USED},
{0, NULL}
};
Step 3: Protect shared mutable state
// Use critical sections for object mutations
Py_BEGIN_CRITICAL_SECTION(self);
self->counter++;
Py_END_CRITICAL_SECTION();
// Or use atomic operations for simple counters
_Py_atomic_add_int(&self->counter, 1);
Step 4: Test with ThreadSanitizer
# Build with TSan
CFLAGS="-fsanitize=thread" python3.13t setup.py build
# Run tests
TSAN_OPTIONS="suppressions=tsan.supp" python3.13t -m pytest
GIL re-enablement safety net
When a C extension without Py_MOD_GIL_NOT_USED is imported, CPython re-enables the GIL for all threads:
import sys
print(sys._is_gil_enabled()) # False
import some_old_extension # Triggers GIL re-enablement
print(sys._is_gil_enabled()) # True — GIL is back
You can force GIL-off mode with PYTHON_GIL=0, but this risks crashes with unsafe extensions.
Benchmarks: real-world impact
Web server (gunicorn + Flask)
| Configuration | Requests/sec |
|---|---|
| GIL build, 4 workers (processes) | 12,400 |
| Free-threaded, 4 threads, 1 worker | 11,800 |
| Free-threaded, 4 threads, 4 workers | 15,200 |
Web servers benefit modestly because they’re mostly I/O-bound.
Data processing (pure Python)
| Configuration | Time (seconds) |
|---|---|
| GIL build, 1 thread | 10.2 |
| GIL build, 4 threads | 10.5 (GIL serialises) |
| Free-threaded, 1 thread | 11.1 (overhead) |
| Free-threaded, 4 threads | 3.1 (true parallelism) |
CPU-bound work scales near-linearly with cores.
Scientific computing (NumPy)
NumPy 2.1+ releases the GIL during large array operations. Combined with free threading, parallel NumPy workflows see significant speedups without switching to multiprocessing.
Timeline and what’s next
- 3.13: Experimental, separate build
- 3.14: Performance improvements, broader library support
- 3.15-3.16: Stabilisation, potential default
- 3.17+: GIL build may become the non-default option
The Steering Council committed to maintaining backward compatibility — the GIL build won’t disappear until the ecosystem is ready.
The one thing to remember: Free threading replaces one big lock with dozens of fine-grained mechanisms — biased refcounting, per-object mutexes, immortal objects, and stop-the-world GC — each optimised for the common case where objects are owned by a single thread.
See Also
- Python 310 New Features Python 3.10 gave programmers a shape-sorting machine, friendlier error messages, and cleaner ways to say 'this or that' in type hints.
- Python 311 New Features Python 3.11 made everything faster, error messages smarter, and let you catch several mistakes at once instead of stopping at the first one.
- Python 312 New Features Python 3.12 made type hints shorter, f-strings more powerful, and started preparing Python's engine for a world without the GIL.
- Python 313 New Features Python 3.13 finally lets multiple tasks run at the same time for real, added a speed booster engine, and gave the interactive prompt a colourful makeover.
- Python Exception Groups Python's ExceptionGroup is like getting one report card that lists every mistake at once instead of stopping at the first one.