Python Shared Memory Multiprocessing — Deep Dive

POSIX shared memory internals, lock-free coordination patterns, NumPy zero-copy workflows, and production architectures for shared-memory parallel Python.

OS-Level Shared Memory

Python’s SharedMemory maps to different OS primitives depending on the platform:

Linux: POSIX Shared Memory

// What Python does under the hood:
int fd = shm_open("/psm_abc123", O_CREAT | O_RDWR, 0600);
ftruncate(fd, size);
void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

The shared memory object appears as a file in the tmpfs filesystem at /dev/shm/. It lives in RAM (backed by swap if needed) and persists until explicitly unlinked or the system reboots.

macOS: POSIX Shared Memory

Same API, but implemented differently. macOS limits shared memory segment names and sizes more aggressively. The default maximum is controlled by kern.sysv.shmmax sysctl.

Windows: Named File Mappings

HANDLE hMap = CreateFileMappingW(
    INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, size, L"psm_abc123");
void *ptr = MapViewOfFile(hMap, FILE_MAP_ALL_ACCESS, 0, 0, size);

Windows shared memory is backed by the paging file and persists until all handles are closed.

Memory Layout for Structured Data

Raw shared memory is unstructured bytes. For complex data, you need to define a layout:

import struct
import numpy as np
from multiprocessing import shared_memory

class SharedDataFrame:
    """Fixed-schema shared data structure."""

    HEADER = struct.Struct('QQQ')  # n_rows, n_cols, version

    def __init__(self, name=None, n_rows=0, n_cols=0, create=False):
        header_size = self.HEADER.size
        data_size = n_rows * n_cols * 8  # float64

        if create:
            total = header_size + data_size
            self.shm = shared_memory.SharedMemory(
                create=True, size=total, name=name
            )
            self.HEADER.pack_into(self.shm.buf, 0, n_rows, n_cols, 0)
        else:
            self.shm = shared_memory.SharedMemory(name=name)
            n_rows, n_cols, _ = self.HEADER.unpack_from(self.shm.buf, 0)

        self.data = np.ndarray(
            (n_rows, n_cols),
            dtype=np.float64,
            buffer=self.shm.buf,
            offset=header_size
        )

    @property
    def version(self):
        return self.HEADER.unpack_from(self.shm.buf, 0)[2]

    def bump_version(self):
        n_rows, n_cols, ver = self.HEADER.unpack_from(self.shm.buf, 0)
        self.HEADER.pack_into(self.shm.buf, 0, n_rows, n_cols, ver + 1)

    def close(self):
        self.shm.close()

    def unlink(self):
        self.shm.unlink()

The version field enables optimistic concurrency: readers check the version before and after reading to detect concurrent writes.

Synchronization Patterns

Pattern 1: Lock-Protected Writes

from multiprocessing import Lock, Process, shared_memory
import numpy as np

def writer(shm_name, shape, lock, n_updates):
    shm = shared_memory.SharedMemory(name=shm_name)
    arr = np.ndarray(shape, dtype=np.float64, buffer=shm.buf)

    for i in range(n_updates):
        with lock:
            arr += 0.001  # Atomic update under lock

    shm.close()

lock = Lock()
shm = shared_memory.SharedMemory(create=True, size=80_000)
arr = np.ndarray((10_000,), dtype=np.float64, buffer=shm.buf)
arr[:] = 0.0

workers = [Process(target=writer, args=(shm.name, (10_000,), lock, 100))
           for _ in range(4)]
for w in workers:
    w.start()
for w in workers:
    w.join()

print(f"Final sum: {arr.sum():.1f}")  # Should be ~4000.0
shm.close()
shm.unlink()

Pattern 2: Reader-Writer with Double Buffering

For low-latency systems where readers shouldn’t block writers:

import numpy as np
from multiprocessing import shared_memory, Value
import ctypes

class DoubleBuffer:
    """Lock-free double buffering for one writer, multiple readers."""

    def __init__(self, shape, dtype=np.float64, name_prefix="dbuf"):
        nbytes = int(np.prod(shape)) * np.dtype(dtype).itemsize

        self.shm_a = shared_memory.SharedMemory(
            create=True, size=nbytes, name=f"{name_prefix}_a"
        )
        self.shm_b = shared_memory.SharedMemory(
            create=True, size=nbytes, name=f"{name_prefix}_b"
        )

        self.buf_a = np.ndarray(shape, dtype=dtype, buffer=self.shm_a.buf)
        self.buf_b = np.ndarray(shape, dtype=dtype, buffer=self.shm_b.buf)

        # Shared flag: 0 = readers use A, 1 = readers use B
        self.active = Value(ctypes.c_int, 0)

    def write(self, data):
        """Writer fills the inactive buffer, then swaps."""
        if self.active.value == 0:
            self.buf_b[:] = data
            self.active.value = 1
        else:
            self.buf_a[:] = data
            self.active.value = 0

    def read(self):
        """Readers always read the active buffer."""
        if self.active.value == 0:
            return self.buf_a.copy()
        return self.buf_b.copy()

    def cleanup(self):
        self.shm_a.close()
        self.shm_b.close()
        self.shm_a.unlink()
        self.shm_b.unlink()

The writer never touches the buffer that readers are using. The atomic swap of active ensures readers always see a complete, consistent snapshot.

Performance: Shared Memory vs Pickle Transfer

Benchmark passing a 100 MB NumPy array between processes:

import numpy as np
import time
from multiprocessing import Process, Queue, shared_memory

data = np.random.randn(12_500_000).astype(np.float64)  # ~100 MB

# Method 1: Queue (pickle)
def queue_consumer(q):
    arr = q.get()
    _ = arr.sum()

q = Queue()
start = time.perf_counter()
q.put(data)
p = Process(target=queue_consumer, args=(q,))
p.start()
p.join()
queue_time = time.perf_counter() - start

# Method 2: Shared Memory
def shm_consumer(name, shape, dtype):
    shm = shared_memory.SharedMemory(name=name)
    arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
    _ = arr.sum()
    shm.close()

shm = shared_memory.SharedMemory(create=True, size=data.nbytes)
shared = np.ndarray(data.shape, dtype=data.dtype, buffer=shm.buf)

start = time.perf_counter()
shared[:] = data
p = Process(target=shm_consumer, args=(shm.name, data.shape, data.dtype))
p.start()
p.join()
shm_time = time.perf_counter() - start

shm.close()
shm.unlink()

Typical results on a modern Linux system:

Method	Transfer Time	Peak Extra Memory
Queue (pickle)	1.2s	200 MB (serialized + deserialized)
SharedMemory	0.08s	0 MB (shared pages)
Speedup	15x	∞ (no extra copy)

The shared memory approach is 15x faster and uses no additional memory for the transfer.

Production Architecture: Shared Memory Data Server

A common pattern for data-intensive applications:

import signal
import numpy as np
from multiprocessing import shared_memory, Process, Event

class DataServer:
    """Central process that manages shared data."""

    def __init__(self, config):
        self.segments = {}
        self.running = Event()
        self.running.set()

    def create_segment(self, name, shape, dtype=np.float64):
        nbytes = int(np.prod(shape)) * np.dtype(dtype).itemsize
        shm = shared_memory.SharedMemory(create=True, size=nbytes, name=name)
        arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
        self.segments[name] = (shm, arr, shape, dtype)
        return name

    def update_segment(self, name, data):
        shm, arr, shape, dtype = self.segments[name]
        arr[:] = data

    def cleanup(self):
        for name, (shm, arr, shape, dtype) in self.segments.items():
            shm.close()
            shm.unlink()


def analytics_worker(segment_name, shape, dtype, stop_event):
    """Worker that reads shared data for analytics."""
    shm = shared_memory.SharedMemory(name=segment_name)
    arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)

    while not stop_event.is_set():
        # Read shared data — no copy
        stats = {
            'mean': float(arr.mean()),
            'std': float(arr.std()),
            'max': float(arr.max()),
        }
        publish_metrics(stats)
        stop_event.wait(1.0)

    shm.close()

This architecture is used in real-time analytics systems, financial market data distribution, and scientific instrument data processing.

Gotchas and Edge Cases

Resource Leak on Crash

If a process crashes without calling unlink(), the shared memory block persists. On Linux, you can clean up manually:

import os
import glob

# Find and clean orphaned shared memory
for path in glob.glob("/dev/shm/psm_*"):
    print(f"Removing orphan: {path}")
    os.unlink(path)

In production, use atexit handlers and signal handlers:

import atexit
import signal

def cleanup():
    for shm in all_shared_segments:
        try:
            shm.close()
            shm.unlink()
        except FileNotFoundError:
            pass

atexit.register(cleanup)
signal.signal(signal.SIGTERM, lambda *_: (cleanup(), exit(0)))

Fork vs Spawn

With fork (Linux default), child processes inherit the parent’s memory mappings. With spawn (Windows default, macOS default in 3.8+), children start fresh and must explicitly attach to shared memory by name.

Always pass the shared memory name explicitly rather than relying on fork inheritance — it’s portable and avoids subtle bugs when switching start methods.

Size Limitations

Shared memory size is limited by available RAM + swap. On Linux, /dev/shm is typically mounted as a tmpfs with a default size of 50% of RAM. Docker containers often have a much smaller /dev/shm (64 MB by default) — use --shm-size to increase it:

docker run --shm-size=2g myimage

The one thing to remember: Shared memory eliminates the serialization bottleneck in multiprocessing by giving all processes direct access to the same physical memory pages — but it requires explicit lifecycle management (create/close/unlink), synchronization for concurrent writes, and awareness of platform-specific limits like Docker’s /dev/shm default size.

pythonperformancemultiprocessingconcurrencysystems