Python Multiprocessing — Deep Dive
Multiprocessing is Python’s standard route to true CPU parallelism in CPython. It works by distributing work across isolated processes, each with its own interpreter and GIL. The engineering challenge is no longer “can it run in parallel?” but “can I keep IPC and orchestration overhead below compute gains?”
Cost Model First
Before APIs, understand cost components:
- process startup cost
- task dispatch overhead
- serialization (pickle) cost
- inter-process data transfer
- result aggregation
- memory footprint per worker
Speedup appears when per-task compute time dominates those costs. If each task is only 200 microseconds, multiprocessing usually loses.
Start Methods: spawn, fork, forkserver
spawn
- starts fresh interpreter
- imports main module
- safest cross-platform behavior
- requires guard:
if __name__ == "__main__":
fork (Unix)
- copies parent process state via copy-on-write
- fast startup
- can inherit unsafe runtime state (threads, locks, open sockets)
forkserver
- avoids some
forkhazards by forking from a clean server process
For production portability and fewer heisenbugs, designing for spawn constraints is usually the best default.
Pickle Boundaries and Function Design
Pool workers need importable callables. Top-level functions are safest.
Bad candidates:
- nested functions
- lambdas
- closures capturing non-picklable objects
Good pattern:
# module-level
def transform(record):
return record.id, expensive_cpu_step(record.payload)
Then pass plain serializable inputs, not heavy runtime objects with hidden resources.
Pool APIs and Workload Shapes
map
- ordered results
- waits for full completion
imap
- iterator of ordered results
- streaming consumption
imap_unordered
- yields as tasks finish
- ideal when task durations vary
apply_async
- explicit async submit + callbacks
- useful for custom orchestration
For long tails in task time, imap_unordered often improves total throughput and latency of first useful results.
Chunking Strategy
Chunking amortizes IPC overhead by sending groups of items per dispatch.
Heuristic:
- homogeneous tasks: larger chunks
- heterogeneous tasks: smaller chunks
Measure with realistic distributions, not only averages. If p99 task time is much larger than median, too-large chunks create stragglers that delay pool completion.
Shared Memory Options
Sometimes copying data to each worker is too expensive.
Options:
multiprocessing.shared_memory(Py3.8+)Array/Valuefor primitive shared objects- memory-mapped files (
mmap, NumPy memmap)
Shared memory can dramatically reduce copy overhead for large arrays, but synchronization and lifecycle management become harder.
Manager Objects: Convenience vs Throughput
multiprocessing.Manager() provides proxy objects (dict, list, etc.) accessible across processes. This is convenient but slower because every operation is remote IPC via a manager server process.
Use manager objects for coordination metadata, not high-frequency hot-path data operations.
Failure Handling and Worker Health
Production pools need robust error policy:
- capture input payload identifiers with exceptions
- decide retry strategy (idempotent tasks only)
- recycle workers if memory leaks are suspected (
maxtasksperchild) - enforce per-task timeouts where possible
with Pool(processes=8, maxtasksperchild=500) as pool:
...
Worker recycling is valuable when third-party C extensions fragment memory over long runs.
Cancellation and Shutdown
Common lifecycle methods:
close(): no more tasksterminate(): hard stop workersjoin(): wait for worker exit
Prefer graceful close/join for normal paths, terminate for incident recovery or deadlines.
NUMA and CPU Affinity (Advanced)
On high-core servers, NUMA topology and CPU affinity can impact performance for memory-intensive jobs. Python stdlib doesn’t abstract this deeply, but OS-level pinning and process placement can help advanced workloads.
For most teams, better gains come earlier from reducing serialization volume and tuning chunks.
Multiprocessing in Data Pipelines
A common architecture in ETL/ML preprocessing:
- one reader process streams raw data chunks
- worker pool performs CPU-heavy parsing/feature extraction
- writer process batches outputs to storage
This keeps responsibilities clear and avoids every worker opening separate DB connections.
Interaction with Async Systems
In async web services, CPU-heavy sections can be offloaded to a process pool:
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(process_pool, cpu_fn, payload)
This preserves event-loop responsiveness while still using multi-core compute.
Benchmarking Correctly
Benchmark end-to-end wall time and resource usage:
- total runtime
- CPU utilization per core
- peak RSS memory
- serialization time fraction
- task failure/retry rate
Real-world datasets often include skew, malformed records, and non-uniform compute costs. Synthetic benchmarks with perfect uniform tasks can produce misleading scaling curves.
Real-World Example Categories
- media transcoding farms splitting frame ranges across workers
- geospatial tile rendering across CPU cores
- fraud scoring where expensive feature extraction dominates
- scientific Monte Carlo simulations with independent trials
These workloads share a property: each unit has substantial compute compared with IPC overhead.
Pitfalls Checklist
- forgetting the
__main__guard and causing recursive child spawning - sending huge objects repeatedly instead of preloading or shared memory
- using manager proxy structures in tight loops
- assuming pool size should always equal CPU count
- ignoring backpressure and flooding pool submit queue
One Thing to Remember
Multiprocessing performance comes from balancing compute per task against orchestration cost; design around serialization boundaries, chunking, and worker lifecycle to unlock real multi-core gains.
See Also
- Python Async Await Async/await helps one Python program juggle many waiting jobs at once, like a chef who keeps multiple pots moving without standing still.
- Python Basics Python is the programming language that reads like plain English — here's why millions of beginners (and experts) choose it first.
- Python Booleans Make Booleans click with one clear analogy you can reuse whenever Python feels confusing.
- Python Break Continue Make Break Continue click with one clear analogy you can reuse whenever Python feels confusing.
- Python Closures See how Python functions can remember private information, even after the outer function has already finished.