Python Free Threading (No-GIL) — Deep Dive

Biased refcounting internals, per-object critical sections, immortalisation mechanics, and how to make your C extensions free-threading safe.

Technical overview

PEP 703 (accepted October 2023, shipped experimentally in Python 3.13) removes the GIL from CPython. This required re-engineering every system that the GIL previously protected: reference counting, memory allocation, container thread safety, the garbage collector, and the import system. This deep dive covers the internal mechanisms and what extension authors need to know.

Biased reference counting — implementation

Every Python object in free-threaded builds has an expanded header:

struct _object {
    // Biased refcount fields
    uintptr_t ob_tid;           // Thread ID of the owning thread
    uint16_t _padding;
    _PyObject_HEAD_EXTRA        // Debug-only linked list
    struct {
        uint32_t local;         // Thread-local refcount (non-atomic)
        Py_ssize_t shared;      // Shared refcount (atomic)
    } ob_ref;
    PyTypeObject *ob_type;
};

Fast path (owning thread)

When the owning thread (matching ob_tid) increments or decrements the refcount:

static inline void Py_INCREF(PyObject *op) {
    if (_Py_IsOwnedByCurrentThread(op)) {
        op->ob_ref.local++;     // Non-atomic, cache-line local
    } else {
        _Py_atomic_add(&op->ob_ref.shared, 2);  // Atomic, shifted by 1
    }
}

The local increment has the same cost as the old GIL-protected ob_refcnt++. No atomics, no memory barriers.

Slow path (other threads)

The shared refcount uses atomic operations. The count is shifted left by 1 bit — the lowest bit is a “merge” flag that triggers thread-local count consolidation during GC.

Deallocation

When ob_ref.local reaches zero:

If ob_ref.shared is also zero (checked atomically) → deallocate
If ob_ref.shared is positive → transfer ownership by setting ob_tid = 0 and merging local into shared
If ob_ref.shared has the merge flag set → defer to GC cycle

Performance analysis

On a microbenchmark that just increments/decrements refcounts:

Owning thread: identical speed to GIL build
Non-owning thread: ~3× slower due to atomic operations
In real workloads, 85-95% of refcount operations hit the fast path

Per-object critical sections

Implementation

Critical sections use a lightweight mutex stored in the object header:

typedef struct {
    _PyMutex mutex;
} _PyCriticalSection;

// Usage in dict operations
Py_BEGIN_CRITICAL_SECTION(dict_obj);
// ... mutate dict hash table ...
Py_END_CRITICAL_SECTION();

_PyMutex is a 1-byte mutex that uses:

Fast path: A single atomic compare-and-swap (uncontended case)
Slow path: OS futex for contended waits

Two-object critical sections

Some operations need to lock two objects atomically (e.g., dict.update(other_dict)):

Py_BEGIN_CRITICAL_SECTION2(dict1, dict2);
// Both dicts locked, guaranteed deadlock-free
Py_END_CRITICAL_SECTION2();

Deadlock prevention uses pointer ordering — always lock the object at the lower memory address first.

Where critical sections are used

Object type	Protected operations
`dict`	Insert, delete, resize, iteration
`list`	Append, insert, sort, slice assignment
`set`	Add, discard, union, intersection
`type`	MRO modification, descriptor cache
`frame`	f_locals access, stack manipulation

Immortal objects

Mechanism

Objects can be marked immortal by setting a special refcount value:

#define _Py_IMMORTAL_REFCNT  UINT32_MAX

static inline void Py_INCREF(PyObject *op) {
    if (_Py_IsImmortal(op)) {
        return;  // No-op
    }
    // ... normal biased refcounting ...
}

Which objects are immortal

None, True, False
Small integers (-5 to 256)
Interned strings (module names, attribute names)
Type objects (builtin types)
Code objects loaded from .pyc files
The empty tuple ()

Impact

Immortalisation eliminates the most contended refcount operations. In a typical Python program, None and True/False account for ~15% of all Py_INCREF calls.

Garbage collector changes

Stop-the-world GC

The cyclic garbage collector now uses stop-the-world pauses:

Signal all threads to pause at safe points
Run the mark phase (traverse reference graph)
Run the sweep phase (collect unreachable objects)
Resume all threads

Safe points are inserted at:

Backward jumps (loop iterations)
Function calls
Py_BEGIN_CRITICAL_SECTION boundaries

GC overhead

Stop-the-world pauses are typically <1ms for most applications. Programs with millions of cyclic objects may see longer pauses. The GC runs less frequently than in GIL builds because deferred refcounting handles most cleanup.

Memory allocator changes

pymalloc was replaced with mimalloc in free-threaded builds:

Thread-safe by design — per-thread heaps with lock-free fast paths
Better scalability — no global allocator lock
Comparable performance to pymalloc for single-threaded code
~5% more memory due to per-thread arena overhead

Making C extensions thread-safe

Step 1: Remove global state

// BAD: global state
static PyObject *module_cache = NULL;

// GOOD: per-module state
typedef struct {
    PyObject *cache;
} module_state;

Step 2: Declare GIL-free compatibility

static PyModuleDef_Slot module_slots[] = {
    {Py_mod_multiple_interpreters, Py_MOD_PER_INTERPRETER_GIL_SUPPORTED},
    {Py_mod_gil, Py_MOD_GIL_NOT_USED},
    {0, NULL}
};

Step 3: Protect shared mutable state

// Use critical sections for object mutations
Py_BEGIN_CRITICAL_SECTION(self);
self->counter++;
Py_END_CRITICAL_SECTION();

// Or use atomic operations for simple counters
_Py_atomic_add_int(&self->counter, 1);

Step 4: Test with ThreadSanitizer

# Build with TSan
CFLAGS="-fsanitize=thread" python3.13t setup.py build

# Run tests
TSAN_OPTIONS="suppressions=tsan.supp" python3.13t -m pytest

GIL re-enablement safety net

When a C extension without Py_MOD_GIL_NOT_USED is imported, CPython re-enables the GIL for all threads:

import sys
print(sys._is_gil_enabled())  # False

import some_old_extension      # Triggers GIL re-enablement
print(sys._is_gil_enabled())  # True — GIL is back

You can force GIL-off mode with PYTHON_GIL=0, but this risks crashes with unsafe extensions.

Benchmarks: real-world impact

Web server (gunicorn + Flask)

Configuration	Requests/sec
GIL build, 4 workers (processes)	12,400
Free-threaded, 4 threads, 1 worker	11,800
Free-threaded, 4 threads, 4 workers	15,200

Web servers benefit modestly because they’re mostly I/O-bound.

Data processing (pure Python)

Configuration	Time (seconds)
GIL build, 1 thread	10.2
GIL build, 4 threads	10.5 (GIL serialises)
Free-threaded, 1 thread	11.1 (overhead)
Free-threaded, 4 threads	3.1 (true parallelism)

CPU-bound work scales near-linearly with cores.

Scientific computing (NumPy)

NumPy 2.1+ releases the GIL during large array operations. Combined with free threading, parallel NumPy workflows see significant speedups without switching to multiprocessing.

Timeline and what’s next

3.13: Experimental, separate build
3.14: Performance improvements, broader library support
3.15-3.16: Stabilisation, potential default
3.17+: GIL build may become the non-default option

The Steering Council committed to maintaining backward compatibility — the GIL build won’t disappear until the ecosystem is ready.

The one thing to remember: Free threading replaces one big lock with dozens of fine-grained mechanisms — biased refcounting, per-object mutexes, immortal objects, and stop-the-world GC — each optimised for the common case where objects are owned by a single thread.

pythonconcurrencyfree-threadingnogil

Python Free Threading (No-GIL) — Deep Dive

Technical overview

Biased reference counting — implementation

Object header layout

Fast path (owning thread)

Slow path (other threads)

Deallocation

Performance analysis

Per-object critical sections

Implementation

Two-object critical sections

Where critical sections are used

Immortal objects

Mechanism

Which objects are immortal

Impact

Garbage collector changes

Stop-the-world GC

GC overhead

Memory allocator changes

Making C extensions thread-safe

Step 1: Remove global state

Step 2: Declare GIL-free compatibility

Step 3: Protect shared mutable state

Step 4: Test with ThreadSanitizer

GIL re-enablement safety net

Benchmarks: real-world impact

Web server (gunicorn + Flask)

Data processing (pure Python)

Scientific computing (NumPy)

Timeline and what’s next

See Also

Python Free Threading (No-GIL) — Deep Dive

Technical overview

Biased reference counting — implementation

Object header layout

Fast path (owning thread)

Slow path (other threads)

Deallocation

Performance analysis

Per-object critical sections

Implementation

Two-object critical sections

Where critical sections are used

Immortal objects

Mechanism

Which objects are immortal

Impact

Garbage collector changes

Stop-the-world GC

GC overhead

Memory allocator changes

Making C extensions thread-safe

Step 1: Remove global state

Step 2: Declare GIL-free compatibility

Step 3: Protect shared mutable state

Step 4: Test with ThreadSanitizer

GIL re-enablement safety net

Benchmarks: real-world impact

Web server (gunicorn + Flask)

Data processing (pure Python)

Scientific computing (NumPy)

Timeline and what’s next

See Also

Related Topics