Python Threading Locks and Semaphores — Deep Dive
Lock Implementation Under the Hood
CPython’s threading.Lock wraps a _thread.lock object, which is implemented as an OS-level mutex (pthread_mutex on Unix, CRITICAL_SECTION on Windows):
// Simplified CPython lock (Modules/_threadmodule.c)
typedef struct {
PyObject_HEAD
char locked; // 0 or 1
unsigned long owner; // thread id (RLock only)
unsigned long count; // recursion depth (RLock only)
} lockobject;
acquire() calls PyThread_acquire_lock(), which uses:
- Linux:
sem_wait()on a POSIX semaphore (or futex on newer kernels) - macOS:
pthread_mutex_lock() - Windows:
WaitForSingleObject()on a kernel event
When a lock is contended, the waiting thread is put to sleep by the OS — no spinning. This makes locks efficient for long waits but adds syscall overhead for short critical sections.
The GIL Interaction
The GIL and user-level locks serve different purposes but interact subtly:
- Thread A holds the GIL and acquires Lock X.
- Thread A releases the GIL (I/O operation or periodic check interval).
- Thread B gets the GIL and tries to acquire Lock X — blocks.
- Thread B releases the GIL while blocked on Lock X.
- Thread A gets the GIL back and continues inside Lock X.
The key insight: while a thread waits on a user-level lock, it releases the GIL, allowing other threads to run. This means locks don’t cause GIL starvation.
In Python 3.12+ with per-interpreter GILs, user-level locks become even more critical because threads in the same interpreter still share a GIL but threads across interpreters don’t — yet they might share resources via shared memory.
Contention Profiling
High lock contention is a silent performance killer. Measure it:
import threading
import time
class ProfiledLock:
def __init__(self, name=""):
self._lock = threading.Lock()
self.name = name
self.wait_time = 0.0
self.hold_time = 0.0
self.acquisitions = 0
def __enter__(self):
start = time.perf_counter()
self._lock.acquire()
self.wait_time += time.perf_counter() - start
self.acquisitions += 1
self._acquired_at = time.perf_counter()
return self
def __exit__(self, *args):
self.hold_time += time.perf_counter() - self._acquired_at
self._lock.release()
def stats(self):
return {
"name": self.name,
"acquisitions": self.acquisitions,
"total_wait_s": round(self.wait_time, 4),
"total_hold_s": round(self.hold_time, 4),
"avg_wait_ms": round(self.wait_time / max(1, self.acquisitions) * 1000, 3),
}
If avg_wait_ms is high, your threads spend more time waiting than working — consider lock-free data structures or reducing the critical section.
Advanced Synchronization Primitives
Event: One-Time Signal
ready = threading.Event()
def worker():
ready.wait() # blocks until set
do_work()
def main():
initialize_resources()
ready.set() # all waiting workers proceed
Unlike Condition, Event doesn’t require a lock. It’s ideal for “start gun” patterns where many threads wait for a single signal.
Barrier: Rendezvous Point
barrier = threading.Barrier(4) # wait for 4 threads
def phase_worker(phase_data):
result_1 = process_phase_1(phase_data)
barrier.wait() # all 4 threads must reach here
# Now all phase_1 results exist
result_2 = process_phase_2()
Barriers synchronize threads at a specific point. All threads must call wait() before any can proceed. Useful for phased computation where each phase depends on all threads completing the previous phase.
Reader-Writer Lock (Custom)
Python doesn’t ship one, but it’s a common need — many readers, exclusive writer:
class RWLock:
def __init__(self):
self._readers = 0
self._lock = threading.Lock()
self._writers = threading.Lock()
def read_acquire(self):
with self._lock:
self._readers += 1
if self._readers == 1:
self._writers.acquire()
def read_release(self):
with self._lock:
self._readers -= 1
if self._readers == 0:
self._writers.release()
def write_acquire(self):
self._writers.acquire()
def write_release(self):
self._writers.release()
This allows unlimited concurrent readers but exclusive writer access. Caveat: this simple implementation can starve writers if readers never drain to zero. Production implementations add writer-priority queuing.
Deadlock Detection
Python doesn’t detect deadlocks automatically, but you can identify them:
Timeout-Based Detection
acquired = lock.acquire(timeout=10)
if not acquired:
logging.error("Potential deadlock detected!")
import traceback
traceback.print_stack()
# Dump all thread stacks
import sys
for thread_id, frame in sys._current_frames().items():
print(f"\nThread {thread_id}:")
traceback.print_stack(frame)
Thread Dump on Signal (Unix)
import signal
import sys
import traceback
def dump_threads(signum, frame):
for thread_id, stack in sys._current_frames().items():
name = {t.ident: t.name for t in threading.enumerate()}.get(thread_id, "?")
print(f"\n--- Thread {name} ({thread_id}) ---")
traceback.print_stack(stack)
signal.signal(signal.SIGUSR1, dump_threads)
# Send: kill -USR1 <pid>
Lock-Free Alternatives
Sometimes you can avoid locks entirely:
queue.Queue (Thread-Safe by Design)
import queue
q = queue.Queue(maxsize=100)
def producer():
q.put(item) # thread-safe, blocks if full
def consumer():
item = q.get() # thread-safe, blocks if empty
q.task_done()
threading.local (Per-Thread State)
local = threading.local()
def worker():
local.connection = create_connection() # each thread gets its own
local.connection.query(...)
No lock needed because each thread has its own copy.
Atomic Operations
Some operations are atomic under the GIL and don’t need locks:
list.append(x)— atomicdict[key] = value— atomicx = shared_list.pop()— atomic
But compound operations (if key in dict: dict[key] += 1) are never atomic. When in doubt, use a lock.
Production Pitfalls
-
Lock in
__del__: Finalizers run in unpredictable threads. Acquiring locks in__del__can deadlock. -
Lock during import: Module-level code runs under the import lock. Acquiring user locks there can deadlock with other threads importing.
-
Daemon threads with locks: Daemon threads are killed abruptly at interpreter shutdown. If they hold locks,
atexithandlers or other threads may hang. -
Over-locking: A single global lock (the “big lock” anti-pattern) serializes all work. Profile contention and use fine-grained locks:
# BAD: one lock for everything
global_lock = threading.Lock()
# GOOD: per-resource locks
class UserCache:
def __init__(self):
self._locks = {} # user_id → Lock
self._meta_lock = threading.Lock()
def get_lock(self, user_id):
with self._meta_lock:
if user_id not in self._locks:
self._locks[user_id] = threading.Lock()
return self._locks[user_id]
One thing to remember: Locks are OS-level primitives in CPython — they’re efficient when uncontended but deadly when overused. Profile your lock wait times, minimize critical sections, prefer queue.Queue and threading.local when possible, and always acquire multiple locks in a consistent order to prevent deadlocks.
See Also
- Python Actor Model Why treating each piece of your program like a person with their own mailbox makes concurrency way less scary.
- Python Aiocache Caching aiocache remembers expensive answers so your async Python app doesn't waste time asking the same question twice.
- Python Aiofiles Async Io aiofiles lets your async Python program read and write files without freezing — because normal file operations secretly block everything.
- Python Aiohttp Understand Aiohttp through an everyday analogy so Python behavior feels intuitive, not random.
- Python Anyio Portability AnyIO lets your async Python code work with any async library — write once, run on asyncio or Trio without changes.