Python Free Threading (No-GIL) — Deep Dive

Technical overview

PEP 703 (accepted October 2023, shipped experimentally in Python 3.13) removes the GIL from CPython. This required re-engineering every system that the GIL previously protected: reference counting, memory allocation, container thread safety, the garbage collector, and the import system. This deep dive covers the internal mechanisms and what extension authors need to know.

Biased reference counting — implementation

Object header layout

Every Python object in free-threaded builds has an expanded header:

struct _object {
    // Biased refcount fields
    uintptr_t ob_tid;           // Thread ID of the owning thread
    uint16_t _padding;
    _PyObject_HEAD_EXTRA        // Debug-only linked list
    struct {
        uint32_t local;         // Thread-local refcount (non-atomic)
        Py_ssize_t shared;      // Shared refcount (atomic)
    } ob_ref;
    PyTypeObject *ob_type;
};

Fast path (owning thread)

When the owning thread (matching ob_tid) increments or decrements the refcount:

static inline void Py_INCREF(PyObject *op) {
    if (_Py_IsOwnedByCurrentThread(op)) {
        op->ob_ref.local++;     // Non-atomic, cache-line local
    } else {
        _Py_atomic_add(&op->ob_ref.shared, 2);  // Atomic, shifted by 1
    }
}

The local increment has the same cost as the old GIL-protected ob_refcnt++. No atomics, no memory barriers.

Slow path (other threads)

The shared refcount uses atomic operations. The count is shifted left by 1 bit — the lowest bit is a “merge” flag that triggers thread-local count consolidation during GC.

Deallocation

When ob_ref.local reaches zero:

  1. If ob_ref.shared is also zero (checked atomically) → deallocate
  2. If ob_ref.shared is positive → transfer ownership by setting ob_tid = 0 and merging local into shared
  3. If ob_ref.shared has the merge flag set → defer to GC cycle

Performance analysis

On a microbenchmark that just increments/decrements refcounts:

  • Owning thread: identical speed to GIL build
  • Non-owning thread: ~3× slower due to atomic operations
  • In real workloads, 85-95% of refcount operations hit the fast path

Per-object critical sections

Implementation

Critical sections use a lightweight mutex stored in the object header:

typedef struct {
    _PyMutex mutex;
} _PyCriticalSection;

// Usage in dict operations
Py_BEGIN_CRITICAL_SECTION(dict_obj);
// ... mutate dict hash table ...
Py_END_CRITICAL_SECTION();

_PyMutex is a 1-byte mutex that uses:

  • Fast path: A single atomic compare-and-swap (uncontended case)
  • Slow path: OS futex for contended waits

Two-object critical sections

Some operations need to lock two objects atomically (e.g., dict.update(other_dict)):

Py_BEGIN_CRITICAL_SECTION2(dict1, dict2);
// Both dicts locked, guaranteed deadlock-free
Py_END_CRITICAL_SECTION2();

Deadlock prevention uses pointer ordering — always lock the object at the lower memory address first.

Where critical sections are used

Object typeProtected operations
dictInsert, delete, resize, iteration
listAppend, insert, sort, slice assignment
setAdd, discard, union, intersection
typeMRO modification, descriptor cache
framef_locals access, stack manipulation

Immortal objects

Mechanism

Objects can be marked immortal by setting a special refcount value:

#define _Py_IMMORTAL_REFCNT  UINT32_MAX

static inline void Py_INCREF(PyObject *op) {
    if (_Py_IsImmortal(op)) {
        return;  // No-op
    }
    // ... normal biased refcounting ...
}

Which objects are immortal

  • None, True, False
  • Small integers (-5 to 256)
  • Interned strings (module names, attribute names)
  • Type objects (builtin types)
  • Code objects loaded from .pyc files
  • The empty tuple ()

Impact

Immortalisation eliminates the most contended refcount operations. In a typical Python program, None and True/False account for ~15% of all Py_INCREF calls.

Garbage collector changes

Stop-the-world GC

The cyclic garbage collector now uses stop-the-world pauses:

  1. Signal all threads to pause at safe points
  2. Run the mark phase (traverse reference graph)
  3. Run the sweep phase (collect unreachable objects)
  4. Resume all threads

Safe points are inserted at:

  • Backward jumps (loop iterations)
  • Function calls
  • Py_BEGIN_CRITICAL_SECTION boundaries

GC overhead

Stop-the-world pauses are typically <1ms for most applications. Programs with millions of cyclic objects may see longer pauses. The GC runs less frequently than in GIL builds because deferred refcounting handles most cleanup.

Memory allocator changes

pymalloc was replaced with mimalloc in free-threaded builds:

  • Thread-safe by design — per-thread heaps with lock-free fast paths
  • Better scalability — no global allocator lock
  • Comparable performance to pymalloc for single-threaded code
  • ~5% more memory due to per-thread arena overhead

Making C extensions thread-safe

Step 1: Remove global state

// BAD: global state
static PyObject *module_cache = NULL;

// GOOD: per-module state
typedef struct {
    PyObject *cache;
} module_state;

Step 2: Declare GIL-free compatibility

static PyModuleDef_Slot module_slots[] = {
    {Py_mod_multiple_interpreters, Py_MOD_PER_INTERPRETER_GIL_SUPPORTED},
    {Py_mod_gil, Py_MOD_GIL_NOT_USED},
    {0, NULL}
};

Step 3: Protect shared mutable state

// Use critical sections for object mutations
Py_BEGIN_CRITICAL_SECTION(self);
self->counter++;
Py_END_CRITICAL_SECTION();

// Or use atomic operations for simple counters
_Py_atomic_add_int(&self->counter, 1);

Step 4: Test with ThreadSanitizer

# Build with TSan
CFLAGS="-fsanitize=thread" python3.13t setup.py build

# Run tests
TSAN_OPTIONS="suppressions=tsan.supp" python3.13t -m pytest

GIL re-enablement safety net

When a C extension without Py_MOD_GIL_NOT_USED is imported, CPython re-enables the GIL for all threads:

import sys
print(sys._is_gil_enabled())  # False

import some_old_extension      # Triggers GIL re-enablement
print(sys._is_gil_enabled())  # True — GIL is back

You can force GIL-off mode with PYTHON_GIL=0, but this risks crashes with unsafe extensions.

Benchmarks: real-world impact

Web server (gunicorn + Flask)

ConfigurationRequests/sec
GIL build, 4 workers (processes)12,400
Free-threaded, 4 threads, 1 worker11,800
Free-threaded, 4 threads, 4 workers15,200

Web servers benefit modestly because they’re mostly I/O-bound.

Data processing (pure Python)

ConfigurationTime (seconds)
GIL build, 1 thread10.2
GIL build, 4 threads10.5 (GIL serialises)
Free-threaded, 1 thread11.1 (overhead)
Free-threaded, 4 threads3.1 (true parallelism)

CPU-bound work scales near-linearly with cores.

Scientific computing (NumPy)

NumPy 2.1+ releases the GIL during large array operations. Combined with free threading, parallel NumPy workflows see significant speedups without switching to multiprocessing.

Timeline and what’s next

  • 3.13: Experimental, separate build
  • 3.14: Performance improvements, broader library support
  • 3.15-3.16: Stabilisation, potential default
  • 3.17+: GIL build may become the non-default option

The Steering Council committed to maintaining backward compatibility — the GIL build won’t disappear until the ecosystem is ready.

The one thing to remember: Free threading replaces one big lock with dozens of fine-grained mechanisms — biased refcounting, per-object mutexes, immortal objects, and stop-the-world GC — each optimised for the common case where objects are owned by a single thread.

pythonconcurrencyfree-threadingnogil

See Also

  • Python 310 New Features Python 3.10 gave programmers a shape-sorting machine, friendlier error messages, and cleaner ways to say 'this or that' in type hints.
  • Python 311 New Features Python 3.11 made everything faster, error messages smarter, and let you catch several mistakes at once instead of stopping at the first one.
  • Python 312 New Features Python 3.12 made type hints shorter, f-strings more powerful, and started preparing Python's engine for a world without the GIL.
  • Python 313 New Features Python 3.13 finally lets multiple tasks run at the same time for real, added a speed booster engine, and gave the interactive prompt a colourful makeover.
  • Python Exception Groups Python's ExceptionGroup is like getting one report card that lists every mistake at once instead of stopping at the first one.