Python Benchmark Methodology — Core Concepts

Why methodology matters

Most performance claims in Python are wrong — not because the code is wrong, but because the measurement is. A benchmark without methodology is just random numbers with confidence.

The three enemies of reliable benchmarks

1. System noise

Your OS runs hundreds of background processes. CPU frequency scales up and down. The disk cache state changes between runs. These factors introduce variance that has nothing to do with your code.

Mitigation: Close unnecessary applications, pin CPU frequency where possible, and run enough iterations to average out noise.

2. Python runtime warmup

CPython loads modules lazily, populates internal caches (like method lookup caches), and the JIT in Python 3.13+ needs time to activate. The first few runs are always slower.

Mitigation: Include explicit warmup iterations that you discard from results.

3. Garbage collection interference

Python’s garbage collector runs at unpredictable intervals. A GC pause during your measured code inflates that particular timing.

Mitigation: Either disable GC during micro-benchmarks (gc.disable()) or include enough repetitions that GC pauses become statistical noise.

Anatomy of a sound benchmark

A well-structured benchmark follows this pattern:

  1. Setup — prepare data and state outside the timed region
  2. Warmup — run the target function 3-10 times, discard results
  3. Measurement — run N iterations, record each timing individually
  4. Analysis — report median and percentiles, not just mean

The mean is misleading when outliers exist. One GC pause can double your mean while the median stays stable. Always report at least the median, the 95th percentile, and the standard deviation.

Common misconception: bigger N is always better

Running a million iterations sounds rigorous but can mask real-world behavior. If your function allocates memory, a million calls in a tight loop may keep data hot in L1 cache — something that never happens in production. Choose N large enough for statistical stability but small enough to reflect realistic access patterns.

A/B comparison protocol

When comparing two implementations:

  • Run them interleaved, not sequentially (ABABAB, not AAABBB)
  • Use the same input data for both
  • Run in the same process to control import and startup variance
  • Apply a statistical test (like a paired t-test or Mann-Whitney U) instead of eyeballing

Environment checklist

FactorWhat to control
Python versionPin exact version (3.12.3, not “3.12”)
CPU governorSet to performance mode
Background loadMinimize; check with top or htop
Power statePlugged in, not battery (laptops throttle)
Thermal stateLet the machine cool between heavy runs

Tools that enforce methodology

  • timeit module — handles repetition and warmup automatically
  • pyperf — full statistical benchmark suite with outlier detection
  • pytest-benchmark — integrates benchmarks into your test suite with comparison reports
  • asv (airspeed velocity) — tracks performance across git commits

The one thing to remember: a benchmark is an experiment, and experiments need controls — warmup, repetition, interleaved comparison, and statistical analysis.

pythonperformancetesting

See Also