Python Performance Regression Testing — Deep Dive
A mature performance regression testing pipeline does more than flag slowdowns — it pinpoints the exact commit, quantifies the impact, and correlates with production metrics. This guide covers building that pipeline from benchmark design through automated bisection.
Statistical change detection
Why simple thresholds fail
A “fail if >10% slower” rule works for stable benchmarks but produces false positives for noisy ones and false negatives for benchmarks with high variance. Statistical tests are more robust.
Implementing proper comparison
import json
import statistics
from scipy import stats
def load_benchmark(path):
with open(path) as f:
data = json.load(f)
return {
b['name']: b['stats']['data']
for b in data['benchmarks']
}
def compare_benchmarks(baseline_path, current_path, alpha=0.05):
baseline = load_benchmark(baseline_path)
current = load_benchmark(current_path)
results = []
for name in baseline:
if name not in current:
continue
b_data = baseline[name]
c_data = current[name]
# Mann-Whitney U test (non-parametric, no normality assumption)
statistic, p_value = stats.mannwhitneyu(
b_data, c_data, alternative='two-sided'
)
b_median = statistics.median(b_data)
c_median = statistics.median(c_data)
change_pct = (c_median - b_median) / b_median * 100
is_regression = (
p_value < alpha and # statistically significant
change_pct > 5 # and meaningfully slower (>5%)
)
results.append({
'name': name,
'baseline_median': b_median,
'current_median': c_median,
'change_pct': change_pct,
'p_value': p_value,
'significant': p_value < alpha,
'regression': is_regression,
})
return results
The Mann-Whitney U test doesn’t assume normal distribution, making it robust for benchmark data which often has long tails.
Effect size with Cohen’s d
Statistical significance alone isn’t enough. A tiny speedup can be “significant” with enough samples. Use effect size to measure practical importance:
def cohens_d(group1, group2):
n1, n2 = len(group1), len(group2)
var1, var2 = statistics.variance(group1), statistics.variance(group2)
pooled_std = ((var1 * (n1-1) + var2 * (n2-1)) / (n1+n2-2)) ** 0.5
return (statistics.mean(group2) - statistics.mean(group1)) / pooled_std
# Interpretation:
# |d| < 0.2: negligible
# 0.2 <= |d| < 0.5: small
# 0.5 <= |d| < 0.8: medium
# |d| >= 0.8: large
Only flag regressions where both p-value is significant AND effect size is at least “small” (|d| >= 0.2).
Benchmark design for regression detection
Isolating what you’re testing
# BAD: tests the database + network + serialization + business logic
def test_api_endpoint_speed(benchmark):
response = benchmark(requests.get, "http://localhost:8000/api/users")
assert response.status_code == 200
# GOOD: tests only the business logic
def test_user_serialization_speed(benchmark):
users = [create_test_user() for _ in range(100)]
result = benchmark(serialize_users, users)
assert len(result) == 100
# ALSO GOOD: integration benchmark with controlled dependencies
def test_user_endpoint_speed(benchmark, test_db, mock_cache):
"""Full endpoint with deterministic DB and cache"""
populate_test_users(test_db, count=1000)
client = TestClient(app)
result = benchmark(client.get, "/api/users?limit=100")
assert result.status_code == 200
Fixture-based data generation
import pytest
@pytest.fixture
def large_dataset():
"""Deterministic dataset for reproducible benchmarks"""
import random
rng = random.Random(42) # fixed seed
return [
{
'id': i,
'name': f'user_{i}',
'score': rng.random(),
'tags': [f'tag_{rng.randint(0,99)}' for _ in range(5)],
}
for i in range(10_000)
]
def test_filter_speed(benchmark, large_dataset):
result = benchmark(filter_high_scores, large_dataset, threshold=0.8)
assert all(r['score'] >= 0.8 for r in result)
Fixed random seeds ensure the same data across runs, eliminating one source of variance.
Automated git bisect for regression hunting
When a regression is detected, automatically find the guilty commit:
#!/usr/bin/env python3
"""auto_bisect.py — Find the commit that caused a performance regression."""
import subprocess
import json
import sys
BENCHMARK_CMD = "pytest tests/test_performance.py --benchmark-json=/tmp/bench.json -q"
BENCHMARK_NAME = "test_search_speed"
THRESHOLD_MS = 50.0 # max acceptable median time
def run_benchmark():
"""Run benchmark and return median time in ms"""
result = subprocess.run(BENCHMARK_CMD, shell=True, capture_output=True)
if result.returncode != 0:
return None # build failure, skip
with open('/tmp/bench.json') as f:
data = json.load(f)
for b in data['benchmarks']:
if b['name'] == BENCHMARK_NAME:
return b['stats']['median'] * 1000 # to ms
return None
def main():
median = run_benchmark()
if median is None:
sys.exit(125) # skip this commit
if median > THRESHOLD_MS:
print(f"SLOW: {median:.1f}ms > {THRESHOLD_MS}ms")
sys.exit(1) # bad commit
else:
print(f"OK: {median:.1f}ms <= {THRESHOLD_MS}ms")
sys.exit(0) # good commit
if __name__ == '__main__':
main()
Usage:
# Find the commit between v1.0 (good) and HEAD (bad)
git bisect start HEAD v1.0
git bisect run python auto_bisect.py
# Git automatically finds the first bad commit
CI pipeline architecture
Multi-stage performance pipeline
# .github/workflows/performance.yml
name: Performance Regression Tests
on:
pull_request:
types: [opened, synchronize]
jobs:
# Stage 1: Quick smoke test (runs on every push)
smoke:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.12" }
- run: pip install -e ".[test]"
- run: |
pytest tests/test_performance.py \
-k "smoke" \
--benchmark-min-rounds=10 \
--benchmark-disable-gc
# Stage 2: Full benchmark (runs on self-hosted for consistency)
full-benchmark:
needs: smoke
runs-on: [self-hosted, benchmark]
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- name: Checkout base branch for baseline
run: |
git checkout ${{ github.base_ref }}
pip install -e ".[test]"
pytest tests/test_performance.py \
--benchmark-json=baseline.json \
--benchmark-min-rounds=50
- name: Checkout PR branch
run: |
git checkout ${{ github.sha }}
pip install -e ".[test]"
pytest tests/test_performance.py \
--benchmark-json=current.json \
--benchmark-min-rounds=50
- name: Compare
run: |
python scripts/compare_benchmarks.py \
baseline.json current.json \
--threshold-pct=10 \
--significance=0.05 \
--output=comparison.md
- name: Post results
uses: marocchino/sticky-pull-request-comment@v2
with:
path: comparison.md
Self-hosted runner preparation
#!/bin/bash
# prepare-benchmark-runner.sh
# Run on dedicated benchmark machine
# Pin CPU frequency
sudo cpupower frequency-set -g performance
# Disable turbo boost (Intel)
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# Isolate CPUs 2-3 for benchmarks
# Add to kernel cmdline: isolcpus=2,3 nohz_full=2,3
# Disable hyperthreading
echo off | sudo tee /sys/devices/system/cpu/smt/control
# Set swappiness to 0
sudo sysctl vm.swappiness=0
Production correlation
Benchmarks in CI approximate production but don’t match it exactly. Correlate CI benchmarks with production metrics:
# Emit benchmark results as metrics
import datadog
def report_benchmark_results(results, commit_sha):
for bench in results:
datadog.statsd.gauge(
f'benchmark.{bench["name"]}.median_ms',
bench['current_median'] * 1000,
tags=[f'commit:{commit_sha}', 'env:ci']
)
# In production, track the same operations
@datadog.statsd.timed('api.search.duration')
def search_products(query, limit):
...
By plotting CI benchmark times alongside production latency, you can calibrate your CI thresholds. If a 10% CI regression corresponds to a 5% production regression, you can set tighter CI thresholds.
Performance budget tracking
# performance_budget.yaml
budgets:
api_search:
p50_ms: 50
p99_ms: 200
note: "Search endpoint, 1000 products"
json_serialize:
p50_ms: 5
p99_ms: 15
note: "Serialize 100-item response"
startup_time:
p50_ms: 2000
p99_ms: 5000
note: "Application cold start"
# check_budgets.py
import yaml
import json
def check_budgets(benchmark_results_path, budget_path):
with open(budget_path) as f:
budgets = yaml.safe_load(f)['budgets']
with open(benchmark_results_path) as f:
results = json.load(f)
violations = []
for bench in results['benchmarks']:
name = bench['name'].replace('test_', '')
if name in budgets:
budget = budgets[name]
median_ms = bench['stats']['median'] * 1000
p99_ms = sorted(bench['stats']['data'])[
int(0.99 * len(bench['stats']['data']))
] * 1000
if median_ms > budget['p50_ms']:
violations.append(
f"{name}: p50={median_ms:.1f}ms > budget={budget['p50_ms']}ms"
)
if p99_ms > budget['p99_ms']:
violations.append(
f"{name}: p99={p99_ms:.1f}ms > budget={budget['p99_ms']}ms"
)
return violations
Maintaining benchmark quality
Benchmarks rot just like code. Maintenance tasks:
- Monthly: Review benchmark variance. High-variance benchmarks need more iterations or better isolation.
- Quarterly: Update baselines to reflect intentional performance changes.
- Per release: Archive benchmark results alongside release artifacts.
- When adding features: Add benchmarks for new critical paths before merging.
The one thing to remember: effective performance regression testing combines statistical rigor (Mann-Whitney tests, effect size), CI automation (baseline comparison on every PR), and production correlation — together they catch regressions that “it works on my machine” testing never will.
See Also
- Python Algorithmic Complexity Understand Algorithmic Complexity through a practical analogy so your Python decisions become faster and clearer.
- Python Async Performance Tuning Making your async Python faster is like organizing a busy restaurant kitchen — it's all about flow.
- Python Benchmark Methodology Why timing Python code once means nothing, and how fair testing works like a science experiment.
- Python C Extension Performance How Python borrows C's speed for the hard parts — like hiring a specialist for the toughest job on the worksite.
- Python Caching Strategies Understand Python caching strategies with a shortcut-road analogy so your app gets faster without taking wrong turns.