Python Multiprocessing Pool — Deep Dive

Pool Architecture

A multiprocessing.Pool consists of:

  1. Worker processes — spawned at pool creation, each running an internal event loop that pulls tasks from a shared input queue.
  2. Task queue (SimpleQueue) — where the parent process puts (func, args) tuples.
  3. Result queue (SimpleQueue) — where workers put (task_id, result_or_exception) tuples.
  4. Task handler thread — in the parent, takes tasks from the user’s call and puts them on the task queue.
  5. Result handler thread — in the parent, reads the result queue and maps results back to AsyncResult objects.
  6. Pool manager thread — monitors workers and restarts any that die unexpectedly.

This means the parent process runs three background threads to manage the pool. Understanding this helps debug issues where the parent’s main thread is blocked but pool operations still proceed.

Fork vs Spawn vs Forkserver

The start method determines how worker processes are created:

import multiprocessing as mp
mp.set_start_method("spawn")  # must be called before creating Pool
MethodHow it worksPlatformGotchas
forkCopy-on-write clone of parentUnixUnsafe with threads, locks can deadlock
spawnFresh interpreter, imports moduleAllSlower startup, requires picklable args
forkserverFork from a clean server processUnixCompromise: clean + fast-ish

macOS warning: Since Python 3.8, spawn is the default on macOS because fork can cause crashes with macOS frameworks (Cocoa, Core Foundation). On Linux, fork remains the default.

Critical rule: If your parent process uses threads (logging, async I/O, database connections), use spawn or forkserver. Forking a multi-threaded process copies only the calling thread — other threads’ locks may be in an acquired state, causing deadlocks.

Chunk Size Optimization

pool.map(func, iterable, chunksize=1) controls how many items are sent to each worker at once:

# Default: chunksize=1 (calculated internally as len(iterable) // (4 * pool_size))
pool.map(process, items, chunksize=100)

The tradeoff:

  • Small chunks (1-10): Better load balancing when tasks have variable duration, but more IPC overhead.
  • Large chunks (100-1000): Less IPC overhead, but workers with fast chunks sit idle while slow chunks finish.

A practical formula for uniform-duration tasks:

chunksize = max(1, len(items) // (pool_size * 4))

For variable-duration tasks, use imap_unordered with small chunks:

with Pool(4) as pool:
    for result in pool.imap_unordered(process, items, chunksize=10):
        handle(result)

Error Handling

Exceptions in Workers

If a worker function raises, the exception is pickled and re-raised in the parent:

def risky(n):
    if n == 42:
        raise ValueError("bad number")
    return n * 2

with Pool(4) as pool:
    try:
        results = pool.map(risky, range(100))
    except ValueError as e:
        print(f"Worker error: {e}")

However, only the first exception is raised with map. Other workers may have also failed silently. For per-item error handling, use apply_async:

with Pool(4) as pool:
    futures = [pool.apply_async(risky, (n,)) for n in range(100)]
    for i, future in enumerate(futures):
        try:
            result = future.get(timeout=30)
        except ValueError:
            print(f"Item {i} failed")

Worker Crashes

If a worker process segfaults or is killed by the OS (OOM killer), the pool replaces it automatically. The task that was running is lost — its AsyncResult.get() raises multiprocessing.context.TimeoutError or the pool marks it as failed.

Use maxtasksperchild to periodically recycle workers (useful for tasks that leak memory):

with Pool(4, maxtasksperchild=100) as pool:
    # Each worker restarts after processing 100 tasks
    results = pool.map(leaky_function, big_dataset)

Memory Optimization

Shared Memory (Python 3.8+)

For large read-only data (arrays, lookup tables), use multiprocessing.shared_memory instead of copying to each worker:

from multiprocessing import shared_memory
import numpy as np

# Parent creates shared array
data = np.random.rand(10_000_000)
shm = shared_memory.SharedMemory(create=True, size=data.nbytes)
shared_array = np.ndarray(data.shape, dtype=data.dtype, buffer=shm.buf)
shared_array[:] = data

def worker(shm_name, shape, dtype):
    existing_shm = shared_memory.SharedMemory(name=shm_name)
    arr = np.ndarray(shape, dtype=dtype, buffer=existing_shm.buf)
    # Read arr without copying
    result = arr.sum()
    existing_shm.close()
    return result

Global Variables with fork

On Unix with fork, global variables in the parent are available in workers via copy-on-write. This is efficient for read-only data:

LOOKUP_TABLE = load_big_table()  # loaded once in parent

def process(item):
    return LOOKUP_TABLE[item]  # reads copy-on-write memory

with Pool(4) as pool:
    results = pool.map(process, items)

Copy-on-write breaks if the worker modifies the data — the OS then allocates a full copy for that process.

Pool vs ProcessPoolExecutor

concurrent.futures.ProcessPoolExecutor is the newer, simpler API:

from concurrent.futures import ProcessPoolExecutor, as_completed

with ProcessPoolExecutor(max_workers=4) as executor:
    futures = {executor.submit(process, item): item for item in items}
    for future in as_completed(futures):
        item = futures[future]
        try:
            result = future.result()
        except Exception as e:
            print(f"{item} failed: {e}")

Key differences:

FeaturePoolProcessPoolExecutor
map variantsmap, imap, imap_unordered, starmapmap (ordered only)
Individual tasksapply_asyncsubmit (returns Future)
CallbacksAsyncResult.callbackFuture.add_done_callback
maxtasksperchildYesYes (3.11+)
InitializerYesYes
IntegrationStandaloneWorks with asyncio via run_in_executor

When to choose ProcessPoolExecutor: Integration with asyncio, simpler future-based API, as_completed() for earliest-result processing.

When to choose Pool: You need imap_unordered, starmap, or fine-grained chunksize control.

Profiling Pool Performance

Measure whether multiprocessing actually helps:

import time
from multiprocessing import Pool

def benchmark(pool_size, data):
    start = time.perf_counter()
    with Pool(pool_size) as pool:
        pool.map(process, data)
    return time.perf_counter() - start

# Compare sequential vs parallel
sequential = benchmark(1, data)
parallel = benchmark(4, data)
speedup = sequential / parallel
print(f"Speedup: {speedup:.1f}x")  # ideal: 4.0x, realistic: 2-3.5x

If speedup is less than 1.5x, the overhead of process creation and IPC outweighs the parallelism benefit. Consider batching more work per task or using threads instead.

One thing to remember: multiprocessing.Pool shines for embarrassingly parallel CPU work — choose the right start method for your platform, size your chunks based on task duration variance, and monitor memory to avoid the OOM killer eating your workers.

pythonconcurrencymultiprocessing

See Also

  • Python Actor Model Why treating each piece of your program like a person with their own mailbox makes concurrency way less scary.
  • Python Aiocache Caching aiocache remembers expensive answers so your async Python app doesn't waste time asking the same question twice.
  • Python Aiofiles Async Io aiofiles lets your async Python program read and write files without freezing — because normal file operations secretly block everything.
  • Python Aiohttp Understand Aiohttp through an everyday analogy so Python behavior feels intuitive, not random.
  • Python Anyio Portability AnyIO lets your async Python code work with any async library — write once, run on asyncio or Trio without changes.