Python Benchmark Methodology — Core Concepts

Design Python benchmarks that produce trustworthy numbers by controlling warmup, repetition, and environmental noise.

Why methodology matters

Most performance claims in Python are wrong — not because the code is wrong, but because the measurement is. A benchmark without methodology is just random numbers with confidence.

The three enemies of reliable benchmarks

1. System noise

Your OS runs hundreds of background processes. CPU frequency scales up and down. The disk cache state changes between runs. These factors introduce variance that has nothing to do with your code.

Mitigation: Close unnecessary applications, pin CPU frequency where possible, and run enough iterations to average out noise.

2. Python runtime warmup

CPython loads modules lazily, populates internal caches (like method lookup caches), and the JIT in Python 3.13+ needs time to activate. The first few runs are always slower.

Mitigation: Include explicit warmup iterations that you discard from results.

3. Garbage collection interference

Python’s garbage collector runs at unpredictable intervals. A GC pause during your measured code inflates that particular timing.

Mitigation: Either disable GC during micro-benchmarks (gc.disable()) or include enough repetitions that GC pauses become statistical noise.

Anatomy of a sound benchmark

A well-structured benchmark follows this pattern:

Setup — prepare data and state outside the timed region
Warmup — run the target function 3-10 times, discard results
Measurement — run N iterations, record each timing individually
Analysis — report median and percentiles, not just mean

The mean is misleading when outliers exist. One GC pause can double your mean while the median stays stable. Always report at least the median, the 95th percentile, and the standard deviation.

Common misconception: bigger N is always better

Running a million iterations sounds rigorous but can mask real-world behavior. If your function allocates memory, a million calls in a tight loop may keep data hot in L1 cache — something that never happens in production. Choose N large enough for statistical stability but small enough to reflect realistic access patterns.

A/B comparison protocol

When comparing two implementations:

Run them interleaved, not sequentially (ABABAB, not AAABBB)
Use the same input data for both
Run in the same process to control import and startup variance
Apply a statistical test (like a paired t-test or Mann-Whitney U) instead of eyeballing

Environment checklist

Factor	What to control
Python version	Pin exact version (3.12.3, not “3.12”)
CPU governor	Set to `performance` mode
Background load	Minimize; check with `top` or `htop`
Power state	Plugged in, not battery (laptops throttle)
Thermal state	Let the machine cool between heavy runs

Tools that enforce methodology

timeit module — handles repetition and warmup automatically
pyperf — full statistical benchmark suite with outlier detection
pytest-benchmark — integrates benchmarks into your test suite with comparison reports
asv (airspeed velocity) — tracks performance across git commits

The one thing to remember: a benchmark is an experiment, and experiments need controls — warmup, repetition, interleaved comparison, and statistical analysis.

pythonperformancetesting