Python Multiprocessing Pool — Deep Dive

Pool internals, chunk sizing strategies, fork vs spawn, error handling, memory optimization, and when to use ProcessPoolExecutor instead.

Pool Architecture

A multiprocessing.Pool consists of:

Worker processes — spawned at pool creation, each running an internal event loop that pulls tasks from a shared input queue.
Task queue (SimpleQueue) — where the parent process puts (func, args) tuples.
Result queue (SimpleQueue) — where workers put (task_id, result_or_exception) tuples.
Task handler thread — in the parent, takes tasks from the user’s call and puts them on the task queue.
Result handler thread — in the parent, reads the result queue and maps results back to AsyncResult objects.
Pool manager thread — monitors workers and restarts any that die unexpectedly.

This means the parent process runs three background threads to manage the pool. Understanding this helps debug issues where the parent’s main thread is blocked but pool operations still proceed.

Fork vs Spawn vs Forkserver

The start method determines how worker processes are created:

import multiprocessing as mp
mp.set_start_method("spawn")  # must be called before creating Pool

Method	How it works	Platform	Gotchas
`fork`	Copy-on-write clone of parent	Unix	Unsafe with threads, locks can deadlock
`spawn`	Fresh interpreter, imports module	All	Slower startup, requires picklable args
`forkserver`	Fork from a clean server process	Unix	Compromise: clean + fast-ish

macOS warning: Since Python 3.8, spawn is the default on macOS because fork can cause crashes with macOS frameworks (Cocoa, Core Foundation). On Linux, fork remains the default.

Critical rule: If your parent process uses threads (logging, async I/O, database connections), use spawn or forkserver. Forking a multi-threaded process copies only the calling thread — other threads’ locks may be in an acquired state, causing deadlocks.

Chunk Size Optimization

pool.map(func, iterable, chunksize=1) controls how many items are sent to each worker at once:

# Default: chunksize=1 (calculated internally as len(iterable) // (4 * pool_size))
pool.map(process, items, chunksize=100)

The tradeoff:

Small chunks (1-10): Better load balancing when tasks have variable duration, but more IPC overhead.
Large chunks (100-1000): Less IPC overhead, but workers with fast chunks sit idle while slow chunks finish.

A practical formula for uniform-duration tasks:

chunksize = max(1, len(items) // (pool_size * 4))

For variable-duration tasks, use imap_unordered with small chunks:

with Pool(4) as pool:
    for result in pool.imap_unordered(process, items, chunksize=10):
        handle(result)

Error Handling

Exceptions in Workers

If a worker function raises, the exception is pickled and re-raised in the parent:

def risky(n):
    if n == 42:
        raise ValueError("bad number")
    return n * 2

with Pool(4) as pool:
    try:
        results = pool.map(risky, range(100))
    except ValueError as e:
        print(f"Worker error: {e}")

However, only the first exception is raised with map. Other workers may have also failed silently. For per-item error handling, use apply_async:

with Pool(4) as pool:
    futures = [pool.apply_async(risky, (n,)) for n in range(100)]
    for i, future in enumerate(futures):
        try:
            result = future.get(timeout=30)
        except ValueError:
            print(f"Item {i} failed")

Worker Crashes

If a worker process segfaults or is killed by the OS (OOM killer), the pool replaces it automatically. The task that was running is lost — its AsyncResult.get() raises multiprocessing.context.TimeoutError or the pool marks it as failed.

Use maxtasksperchild to periodically recycle workers (useful for tasks that leak memory):

with Pool(4, maxtasksperchild=100) as pool:
    # Each worker restarts after processing 100 tasks
    results = pool.map(leaky_function, big_dataset)

Memory Optimization

Shared Memory (Python 3.8+)

For large read-only data (arrays, lookup tables), use multiprocessing.shared_memory instead of copying to each worker:

from multiprocessing import shared_memory
import numpy as np

# Parent creates shared array
data = np.random.rand(10_000_000)
shm = shared_memory.SharedMemory(create=True, size=data.nbytes)
shared_array = np.ndarray(data.shape, dtype=data.dtype, buffer=shm.buf)
shared_array[:] = data

def worker(shm_name, shape, dtype):
    existing_shm = shared_memory.SharedMemory(name=shm_name)
    arr = np.ndarray(shape, dtype=dtype, buffer=existing_shm.buf)
    # Read arr without copying
    result = arr.sum()
    existing_shm.close()
    return result

Global Variables with fork

On Unix with fork, global variables in the parent are available in workers via copy-on-write. This is efficient for read-only data:

LOOKUP_TABLE = load_big_table()  # loaded once in parent

def process(item):
    return LOOKUP_TABLE[item]  # reads copy-on-write memory

with Pool(4) as pool:
    results = pool.map(process, items)

Copy-on-write breaks if the worker modifies the data — the OS then allocates a full copy for that process.

Pool vs ProcessPoolExecutor

concurrent.futures.ProcessPoolExecutor is the newer, simpler API:

from concurrent.futures import ProcessPoolExecutor, as_completed

with ProcessPoolExecutor(max_workers=4) as executor:
    futures = {executor.submit(process, item): item for item in items}
    for future in as_completed(futures):
        item = futures[future]
        try:
            result = future.result()
        except Exception as e:
            print(f"{item} failed: {e}")

Key differences:

Feature	Pool	ProcessPoolExecutor
`map` variants	map, imap, imap_unordered, starmap	map (ordered only)
Individual tasks	apply_async	submit (returns Future)
Callbacks	AsyncResult.callback	Future.add_done_callback
maxtasksperchild	Yes	Yes (3.11+)
Initializer	Yes	Yes
Integration	Standalone	Works with asyncio via `run_in_executor`

When to choose ProcessPoolExecutor: Integration with asyncio, simpler future-based API, as_completed() for earliest-result processing.

When to choose Pool: You need imap_unordered, starmap, or fine-grained chunksize control.

Profiling Pool Performance

Measure whether multiprocessing actually helps:

import time
from multiprocessing import Pool

def benchmark(pool_size, data):
    start = time.perf_counter()
    with Pool(pool_size) as pool:
        pool.map(process, data)
    return time.perf_counter() - start

# Compare sequential vs parallel
sequential = benchmark(1, data)
parallel = benchmark(4, data)
speedup = sequential / parallel
print(f"Speedup: {speedup:.1f}x")  # ideal: 4.0x, realistic: 2-3.5x

If speedup is less than 1.5x, the overhead of process creation and IPC outweighs the parallelism benefit. Consider batching more work per task or using threads instead.

One thing to remember: multiprocessing.Pool shines for embarrassingly parallel CPU work — choose the right start method for your platform, size your chunks based on task duration variance, and monitor memory to avoid the OOM killer eating your workers.

pythonconcurrencymultiprocessing