Python Multiprocessing — Deep Dive

Multiprocessing is Python’s standard route to true CPU parallelism in CPython. It works by distributing work across isolated processes, each with its own interpreter and GIL. The engineering challenge is no longer “can it run in parallel?” but “can I keep IPC and orchestration overhead below compute gains?”

Cost Model First

Before APIs, understand cost components:

  • process startup cost
  • task dispatch overhead
  • serialization (pickle) cost
  • inter-process data transfer
  • result aggregation
  • memory footprint per worker

Speedup appears when per-task compute time dominates those costs. If each task is only 200 microseconds, multiprocessing usually loses.

Start Methods: spawn, fork, forkserver

spawn

  • starts fresh interpreter
  • imports main module
  • safest cross-platform behavior
  • requires guard: if __name__ == "__main__":

fork (Unix)

  • copies parent process state via copy-on-write
  • fast startup
  • can inherit unsafe runtime state (threads, locks, open sockets)

forkserver

  • avoids some fork hazards by forking from a clean server process

For production portability and fewer heisenbugs, designing for spawn constraints is usually the best default.

Pickle Boundaries and Function Design

Pool workers need importable callables. Top-level functions are safest.

Bad candidates:

  • nested functions
  • lambdas
  • closures capturing non-picklable objects

Good pattern:

# module-level

def transform(record):
    return record.id, expensive_cpu_step(record.payload)

Then pass plain serializable inputs, not heavy runtime objects with hidden resources.

Pool APIs and Workload Shapes

map

  • ordered results
  • waits for full completion

imap

  • iterator of ordered results
  • streaming consumption

imap_unordered

  • yields as tasks finish
  • ideal when task durations vary

apply_async

  • explicit async submit + callbacks
  • useful for custom orchestration

For long tails in task time, imap_unordered often improves total throughput and latency of first useful results.

Chunking Strategy

Chunking amortizes IPC overhead by sending groups of items per dispatch.

Heuristic:

  • homogeneous tasks: larger chunks
  • heterogeneous tasks: smaller chunks

Measure with realistic distributions, not only averages. If p99 task time is much larger than median, too-large chunks create stragglers that delay pool completion.

Shared Memory Options

Sometimes copying data to each worker is too expensive.

Options:

  • multiprocessing.shared_memory (Py3.8+)
  • Array / Value for primitive shared objects
  • memory-mapped files (mmap, NumPy memmap)

Shared memory can dramatically reduce copy overhead for large arrays, but synchronization and lifecycle management become harder.

Manager Objects: Convenience vs Throughput

multiprocessing.Manager() provides proxy objects (dict, list, etc.) accessible across processes. This is convenient but slower because every operation is remote IPC via a manager server process.

Use manager objects for coordination metadata, not high-frequency hot-path data operations.

Failure Handling and Worker Health

Production pools need robust error policy:

  • capture input payload identifiers with exceptions
  • decide retry strategy (idempotent tasks only)
  • recycle workers if memory leaks are suspected (maxtasksperchild)
  • enforce per-task timeouts where possible
with Pool(processes=8, maxtasksperchild=500) as pool:
    ...

Worker recycling is valuable when third-party C extensions fragment memory over long runs.

Cancellation and Shutdown

Common lifecycle methods:

  • close(): no more tasks
  • terminate(): hard stop workers
  • join(): wait for worker exit

Prefer graceful close/join for normal paths, terminate for incident recovery or deadlines.

NUMA and CPU Affinity (Advanced)

On high-core servers, NUMA topology and CPU affinity can impact performance for memory-intensive jobs. Python stdlib doesn’t abstract this deeply, but OS-level pinning and process placement can help advanced workloads.

For most teams, better gains come earlier from reducing serialization volume and tuning chunks.

Multiprocessing in Data Pipelines

A common architecture in ETL/ML preprocessing:

  1. one reader process streams raw data chunks
  2. worker pool performs CPU-heavy parsing/feature extraction
  3. writer process batches outputs to storage

This keeps responsibilities clear and avoids every worker opening separate DB connections.

Interaction with Async Systems

In async web services, CPU-heavy sections can be offloaded to a process pool:

loop = asyncio.get_running_loop()
result = await loop.run_in_executor(process_pool, cpu_fn, payload)

This preserves event-loop responsiveness while still using multi-core compute.

Benchmarking Correctly

Benchmark end-to-end wall time and resource usage:

  • total runtime
  • CPU utilization per core
  • peak RSS memory
  • serialization time fraction
  • task failure/retry rate

Real-world datasets often include skew, malformed records, and non-uniform compute costs. Synthetic benchmarks with perfect uniform tasks can produce misleading scaling curves.

Real-World Example Categories

  • media transcoding farms splitting frame ranges across workers
  • geospatial tile rendering across CPU cores
  • fraud scoring where expensive feature extraction dominates
  • scientific Monte Carlo simulations with independent trials

These workloads share a property: each unit has substantial compute compared with IPC overhead.

Pitfalls Checklist

  1. forgetting the __main__ guard and causing recursive child spawning
  2. sending huge objects repeatedly instead of preloading or shared memory
  3. using manager proxy structures in tight loops
  4. assuming pool size should always equal CPU count
  5. ignoring backpressure and flooding pool submit queue

One Thing to Remember

Multiprocessing performance comes from balancing compute per task against orchestration cost; design around serialization boundaries, chunking, and worker lifecycle to unlock real multi-core gains.

pythonmultiprocessingparallelismipcperformance

See Also

  • Python Async Await Async/await helps one Python program juggle many waiting jobs at once, like a chef who keeps multiple pots moving without standing still.
  • Python Basics Python is the programming language that reads like plain English — here's why millions of beginners (and experts) choose it first.
  • Python Booleans Make Booleans click with one clear analogy you can reuse whenever Python feels confusing.
  • Python Break Continue Make Break Continue click with one clear analogy you can reuse whenever Python feels confusing.
  • Python Closures See how Python functions can remember private information, even after the outer function has already finished.