Python Shared Memory Multiprocessing — Deep Dive
OS-Level Shared Memory
Python’s SharedMemory maps to different OS primitives depending on the platform:
Linux: POSIX Shared Memory
// What Python does under the hood:
int fd = shm_open("/psm_abc123", O_CREAT | O_RDWR, 0600);
ftruncate(fd, size);
void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
The shared memory object appears as a file in the tmpfs filesystem at /dev/shm/. It lives in RAM (backed by swap if needed) and persists until explicitly unlinked or the system reboots.
macOS: POSIX Shared Memory
Same API, but implemented differently. macOS limits shared memory segment names and sizes more aggressively. The default maximum is controlled by kern.sysv.shmmax sysctl.
Windows: Named File Mappings
HANDLE hMap = CreateFileMappingW(
INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, size, L"psm_abc123");
void *ptr = MapViewOfFile(hMap, FILE_MAP_ALL_ACCESS, 0, 0, size);
Windows shared memory is backed by the paging file and persists until all handles are closed.
Memory Layout for Structured Data
Raw shared memory is unstructured bytes. For complex data, you need to define a layout:
import struct
import numpy as np
from multiprocessing import shared_memory
class SharedDataFrame:
"""Fixed-schema shared data structure."""
HEADER = struct.Struct('QQQ') # n_rows, n_cols, version
def __init__(self, name=None, n_rows=0, n_cols=0, create=False):
header_size = self.HEADER.size
data_size = n_rows * n_cols * 8 # float64
if create:
total = header_size + data_size
self.shm = shared_memory.SharedMemory(
create=True, size=total, name=name
)
self.HEADER.pack_into(self.shm.buf, 0, n_rows, n_cols, 0)
else:
self.shm = shared_memory.SharedMemory(name=name)
n_rows, n_cols, _ = self.HEADER.unpack_from(self.shm.buf, 0)
self.data = np.ndarray(
(n_rows, n_cols),
dtype=np.float64,
buffer=self.shm.buf,
offset=header_size
)
@property
def version(self):
return self.HEADER.unpack_from(self.shm.buf, 0)[2]
def bump_version(self):
n_rows, n_cols, ver = self.HEADER.unpack_from(self.shm.buf, 0)
self.HEADER.pack_into(self.shm.buf, 0, n_rows, n_cols, ver + 1)
def close(self):
self.shm.close()
def unlink(self):
self.shm.unlink()
The version field enables optimistic concurrency: readers check the version before and after reading to detect concurrent writes.
Synchronization Patterns
Pattern 1: Lock-Protected Writes
from multiprocessing import Lock, Process, shared_memory
import numpy as np
def writer(shm_name, shape, lock, n_updates):
shm = shared_memory.SharedMemory(name=shm_name)
arr = np.ndarray(shape, dtype=np.float64, buffer=shm.buf)
for i in range(n_updates):
with lock:
arr += 0.001 # Atomic update under lock
shm.close()
lock = Lock()
shm = shared_memory.SharedMemory(create=True, size=80_000)
arr = np.ndarray((10_000,), dtype=np.float64, buffer=shm.buf)
arr[:] = 0.0
workers = [Process(target=writer, args=(shm.name, (10_000,), lock, 100))
for _ in range(4)]
for w in workers:
w.start()
for w in workers:
w.join()
print(f"Final sum: {arr.sum():.1f}") # Should be ~4000.0
shm.close()
shm.unlink()
Pattern 2: Reader-Writer with Double Buffering
For low-latency systems where readers shouldn’t block writers:
import numpy as np
from multiprocessing import shared_memory, Value
import ctypes
class DoubleBuffer:
"""Lock-free double buffering for one writer, multiple readers."""
def __init__(self, shape, dtype=np.float64, name_prefix="dbuf"):
nbytes = int(np.prod(shape)) * np.dtype(dtype).itemsize
self.shm_a = shared_memory.SharedMemory(
create=True, size=nbytes, name=f"{name_prefix}_a"
)
self.shm_b = shared_memory.SharedMemory(
create=True, size=nbytes, name=f"{name_prefix}_b"
)
self.buf_a = np.ndarray(shape, dtype=dtype, buffer=self.shm_a.buf)
self.buf_b = np.ndarray(shape, dtype=dtype, buffer=self.shm_b.buf)
# Shared flag: 0 = readers use A, 1 = readers use B
self.active = Value(ctypes.c_int, 0)
def write(self, data):
"""Writer fills the inactive buffer, then swaps."""
if self.active.value == 0:
self.buf_b[:] = data
self.active.value = 1
else:
self.buf_a[:] = data
self.active.value = 0
def read(self):
"""Readers always read the active buffer."""
if self.active.value == 0:
return self.buf_a.copy()
return self.buf_b.copy()
def cleanup(self):
self.shm_a.close()
self.shm_b.close()
self.shm_a.unlink()
self.shm_b.unlink()
The writer never touches the buffer that readers are using. The atomic swap of active ensures readers always see a complete, consistent snapshot.
Performance: Shared Memory vs Pickle Transfer
Benchmark passing a 100 MB NumPy array between processes:
import numpy as np
import time
from multiprocessing import Process, Queue, shared_memory
data = np.random.randn(12_500_000).astype(np.float64) # ~100 MB
# Method 1: Queue (pickle)
def queue_consumer(q):
arr = q.get()
_ = arr.sum()
q = Queue()
start = time.perf_counter()
q.put(data)
p = Process(target=queue_consumer, args=(q,))
p.start()
p.join()
queue_time = time.perf_counter() - start
# Method 2: Shared Memory
def shm_consumer(name, shape, dtype):
shm = shared_memory.SharedMemory(name=name)
arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
_ = arr.sum()
shm.close()
shm = shared_memory.SharedMemory(create=True, size=data.nbytes)
shared = np.ndarray(data.shape, dtype=data.dtype, buffer=shm.buf)
start = time.perf_counter()
shared[:] = data
p = Process(target=shm_consumer, args=(shm.name, data.shape, data.dtype))
p.start()
p.join()
shm_time = time.perf_counter() - start
shm.close()
shm.unlink()
Typical results on a modern Linux system:
| Method | Transfer Time | Peak Extra Memory |
|---|---|---|
| Queue (pickle) | 1.2s | 200 MB (serialized + deserialized) |
| SharedMemory | 0.08s | 0 MB (shared pages) |
| Speedup | 15x | ∞ (no extra copy) |
The shared memory approach is 15x faster and uses no additional memory for the transfer.
Production Architecture: Shared Memory Data Server
A common pattern for data-intensive applications:
import signal
import numpy as np
from multiprocessing import shared_memory, Process, Event
class DataServer:
"""Central process that manages shared data."""
def __init__(self, config):
self.segments = {}
self.running = Event()
self.running.set()
def create_segment(self, name, shape, dtype=np.float64):
nbytes = int(np.prod(shape)) * np.dtype(dtype).itemsize
shm = shared_memory.SharedMemory(create=True, size=nbytes, name=name)
arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
self.segments[name] = (shm, arr, shape, dtype)
return name
def update_segment(self, name, data):
shm, arr, shape, dtype = self.segments[name]
arr[:] = data
def cleanup(self):
for name, (shm, arr, shape, dtype) in self.segments.items():
shm.close()
shm.unlink()
def analytics_worker(segment_name, shape, dtype, stop_event):
"""Worker that reads shared data for analytics."""
shm = shared_memory.SharedMemory(name=segment_name)
arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
while not stop_event.is_set():
# Read shared data — no copy
stats = {
'mean': float(arr.mean()),
'std': float(arr.std()),
'max': float(arr.max()),
}
publish_metrics(stats)
stop_event.wait(1.0)
shm.close()
This architecture is used in real-time analytics systems, financial market data distribution, and scientific instrument data processing.
Gotchas and Edge Cases
Resource Leak on Crash
If a process crashes without calling unlink(), the shared memory block persists. On Linux, you can clean up manually:
import os
import glob
# Find and clean orphaned shared memory
for path in glob.glob("/dev/shm/psm_*"):
print(f"Removing orphan: {path}")
os.unlink(path)
In production, use atexit handlers and signal handlers:
import atexit
import signal
def cleanup():
for shm in all_shared_segments:
try:
shm.close()
shm.unlink()
except FileNotFoundError:
pass
atexit.register(cleanup)
signal.signal(signal.SIGTERM, lambda *_: (cleanup(), exit(0)))
Fork vs Spawn
With fork (Linux default), child processes inherit the parent’s memory mappings. With spawn (Windows default, macOS default in 3.8+), children start fresh and must explicitly attach to shared memory by name.
Always pass the shared memory name explicitly rather than relying on fork inheritance — it’s portable and avoids subtle bugs when switching start methods.
Size Limitations
Shared memory size is limited by available RAM + swap. On Linux, /dev/shm is typically mounted as a tmpfs with a default size of 50% of RAM. Docker containers often have a much smaller /dev/shm (64 MB by default) — use --shm-size to increase it:
docker run --shm-size=2g myimage
The one thing to remember: Shared memory eliminates the serialization bottleneck in multiprocessing by giving all processes direct access to the same physical memory pages — but it requires explicit lifecycle management (create/close/unlink), synchronization for concurrent writes, and awareness of platform-specific limits like Docker’s /dev/shm default size.
See Also
- Python Algorithmic Complexity Understand Algorithmic Complexity through a practical analogy so your Python decisions become faster and clearer.
- Python Async Performance Tuning Making your async Python faster is like organizing a busy restaurant kitchen — it's all about flow.
- Python Benchmark Methodology Why timing Python code once means nothing, and how fair testing works like a science experiment.
- Python C Extension Performance How Python borrows C's speed for the hard parts — like hiring a specialist for the toughest job on the worksite.
- Python Caching Strategies Understand Python caching strategies with a shortcut-road analogy so your app gets faster without taking wrong turns.