Python Performance Regression Testing — Deep Dive

A mature performance regression testing pipeline does more than flag slowdowns — it pinpoints the exact commit, quantifies the impact, and correlates with production metrics. This guide covers building that pipeline from benchmark design through automated bisection.

Statistical change detection

Why simple thresholds fail

A “fail if >10% slower” rule works for stable benchmarks but produces false positives for noisy ones and false negatives for benchmarks with high variance. Statistical tests are more robust.

Implementing proper comparison

import json
import statistics
from scipy import stats

def load_benchmark(path):
    with open(path) as f:
        data = json.load(f)
    return {
        b['name']: b['stats']['data']
        for b in data['benchmarks']
    }

def compare_benchmarks(baseline_path, current_path, alpha=0.05):
    baseline = load_benchmark(baseline_path)
    current = load_benchmark(current_path)
    
    results = []
    for name in baseline:
        if name not in current:
            continue
        
        b_data = baseline[name]
        c_data = current[name]
        
        # Mann-Whitney U test (non-parametric, no normality assumption)
        statistic, p_value = stats.mannwhitneyu(
            b_data, c_data, alternative='two-sided'
        )
        
        b_median = statistics.median(b_data)
        c_median = statistics.median(c_data)
        change_pct = (c_median - b_median) / b_median * 100
        
        is_regression = (
            p_value < alpha and  # statistically significant
            change_pct > 5       # and meaningfully slower (>5%)
        )
        
        results.append({
            'name': name,
            'baseline_median': b_median,
            'current_median': c_median,
            'change_pct': change_pct,
            'p_value': p_value,
            'significant': p_value < alpha,
            'regression': is_regression,
        })
    
    return results

The Mann-Whitney U test doesn’t assume normal distribution, making it robust for benchmark data which often has long tails.

Effect size with Cohen’s d

Statistical significance alone isn’t enough. A tiny speedup can be “significant” with enough samples. Use effect size to measure practical importance:

def cohens_d(group1, group2):
    n1, n2 = len(group1), len(group2)
    var1, var2 = statistics.variance(group1), statistics.variance(group2)
    pooled_std = ((var1 * (n1-1) + var2 * (n2-1)) / (n1+n2-2)) ** 0.5
    return (statistics.mean(group2) - statistics.mean(group1)) / pooled_std

# Interpretation:
# |d| < 0.2: negligible
# 0.2 <= |d| < 0.5: small
# 0.5 <= |d| < 0.8: medium
# |d| >= 0.8: large

Only flag regressions where both p-value is significant AND effect size is at least “small” (|d| >= 0.2).

Benchmark design for regression detection

Isolating what you’re testing

# BAD: tests the database + network + serialization + business logic
def test_api_endpoint_speed(benchmark):
    response = benchmark(requests.get, "http://localhost:8000/api/users")
    assert response.status_code == 200

# GOOD: tests only the business logic
def test_user_serialization_speed(benchmark):
    users = [create_test_user() for _ in range(100)]
    result = benchmark(serialize_users, users)
    assert len(result) == 100

# ALSO GOOD: integration benchmark with controlled dependencies
def test_user_endpoint_speed(benchmark, test_db, mock_cache):
    """Full endpoint with deterministic DB and cache"""
    populate_test_users(test_db, count=1000)
    client = TestClient(app)
    result = benchmark(client.get, "/api/users?limit=100")
    assert result.status_code == 200

Fixture-based data generation

import pytest

@pytest.fixture
def large_dataset():
    """Deterministic dataset for reproducible benchmarks"""
    import random
    rng = random.Random(42)  # fixed seed
    return [
        {
            'id': i,
            'name': f'user_{i}',
            'score': rng.random(),
            'tags': [f'tag_{rng.randint(0,99)}' for _ in range(5)],
        }
        for i in range(10_000)
    ]

def test_filter_speed(benchmark, large_dataset):
    result = benchmark(filter_high_scores, large_dataset, threshold=0.8)
    assert all(r['score'] >= 0.8 for r in result)

Fixed random seeds ensure the same data across runs, eliminating one source of variance.

Automated git bisect for regression hunting

When a regression is detected, automatically find the guilty commit:

#!/usr/bin/env python3
"""auto_bisect.py — Find the commit that caused a performance regression."""
import subprocess
import json
import sys

BENCHMARK_CMD = "pytest tests/test_performance.py --benchmark-json=/tmp/bench.json -q"
BENCHMARK_NAME = "test_search_speed"
THRESHOLD_MS = 50.0  # max acceptable median time

def run_benchmark():
    """Run benchmark and return median time in ms"""
    result = subprocess.run(BENCHMARK_CMD, shell=True, capture_output=True)
    if result.returncode != 0:
        return None  # build failure, skip
    
    with open('/tmp/bench.json') as f:
        data = json.load(f)
    
    for b in data['benchmarks']:
        if b['name'] == BENCHMARK_NAME:
            return b['stats']['median'] * 1000  # to ms
    return None

def main():
    median = run_benchmark()
    if median is None:
        sys.exit(125)  # skip this commit
    
    if median > THRESHOLD_MS:
        print(f"SLOW: {median:.1f}ms > {THRESHOLD_MS}ms")
        sys.exit(1)  # bad commit
    else:
        print(f"OK: {median:.1f}ms <= {THRESHOLD_MS}ms")
        sys.exit(0)  # good commit

if __name__ == '__main__':
    main()

Usage:

# Find the commit between v1.0 (good) and HEAD (bad)
git bisect start HEAD v1.0
git bisect run python auto_bisect.py
# Git automatically finds the first bad commit

CI pipeline architecture

Multi-stage performance pipeline

# .github/workflows/performance.yml
name: Performance Regression Tests

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  # Stage 1: Quick smoke test (runs on every push)
  smoke:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install -e ".[test]"
      - run: |
          pytest tests/test_performance.py \
            -k "smoke" \
            --benchmark-min-rounds=10 \
            --benchmark-disable-gc

  # Stage 2: Full benchmark (runs on self-hosted for consistency)
  full-benchmark:
    needs: smoke
    runs-on: [self-hosted, benchmark]
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }

      - name: Checkout base branch for baseline
        run: |
          git checkout ${{ github.base_ref }}
          pip install -e ".[test]"
          pytest tests/test_performance.py \
            --benchmark-json=baseline.json \
            --benchmark-min-rounds=50

      - name: Checkout PR branch
        run: |
          git checkout ${{ github.sha }}
          pip install -e ".[test]"
          pytest tests/test_performance.py \
            --benchmark-json=current.json \
            --benchmark-min-rounds=50

      - name: Compare
        run: |
          python scripts/compare_benchmarks.py \
            baseline.json current.json \
            --threshold-pct=10 \
            --significance=0.05 \
            --output=comparison.md

      - name: Post results
        uses: marocchino/sticky-pull-request-comment@v2
        with:
          path: comparison.md

Self-hosted runner preparation

#!/bin/bash
# prepare-benchmark-runner.sh
# Run on dedicated benchmark machine

# Pin CPU frequency
sudo cpupower frequency-set -g performance

# Disable turbo boost (Intel)
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

# Isolate CPUs 2-3 for benchmarks
# Add to kernel cmdline: isolcpus=2,3 nohz_full=2,3

# Disable hyperthreading
echo off | sudo tee /sys/devices/system/cpu/smt/control

# Set swappiness to 0
sudo sysctl vm.swappiness=0

Production correlation

Benchmarks in CI approximate production but don’t match it exactly. Correlate CI benchmarks with production metrics:

# Emit benchmark results as metrics
import datadog

def report_benchmark_results(results, commit_sha):
    for bench in results:
        datadog.statsd.gauge(
            f'benchmark.{bench["name"]}.median_ms',
            bench['current_median'] * 1000,
            tags=[f'commit:{commit_sha}', 'env:ci']
        )

# In production, track the same operations
@datadog.statsd.timed('api.search.duration')
def search_products(query, limit):
    ...

By plotting CI benchmark times alongside production latency, you can calibrate your CI thresholds. If a 10% CI regression corresponds to a 5% production regression, you can set tighter CI thresholds.

Performance budget tracking

# performance_budget.yaml
budgets:
  api_search:
    p50_ms: 50
    p99_ms: 200
    note: "Search endpoint, 1000 products"
  
  json_serialize:
    p50_ms: 5
    p99_ms: 15
    note: "Serialize 100-item response"
  
  startup_time:
    p50_ms: 2000
    p99_ms: 5000
    note: "Application cold start"
# check_budgets.py
import yaml
import json

def check_budgets(benchmark_results_path, budget_path):
    with open(budget_path) as f:
        budgets = yaml.safe_load(f)['budgets']
    
    with open(benchmark_results_path) as f:
        results = json.load(f)
    
    violations = []
    for bench in results['benchmarks']:
        name = bench['name'].replace('test_', '')
        if name in budgets:
            budget = budgets[name]
            median_ms = bench['stats']['median'] * 1000
            p99_ms = sorted(bench['stats']['data'])[
                int(0.99 * len(bench['stats']['data']))
            ] * 1000
            
            if median_ms > budget['p50_ms']:
                violations.append(
                    f"{name}: p50={median_ms:.1f}ms > budget={budget['p50_ms']}ms"
                )
            if p99_ms > budget['p99_ms']:
                violations.append(
                    f"{name}: p99={p99_ms:.1f}ms > budget={budget['p99_ms']}ms"
                )
    
    return violations

Maintaining benchmark quality

Benchmarks rot just like code. Maintenance tasks:

  1. Monthly: Review benchmark variance. High-variance benchmarks need more iterations or better isolation.
  2. Quarterly: Update baselines to reflect intentional performance changes.
  3. Per release: Archive benchmark results alongside release artifacts.
  4. When adding features: Add benchmarks for new critical paths before merging.

The one thing to remember: effective performance regression testing combines statistical rigor (Mann-Whitney tests, effect size), CI automation (baseline comparison on every PR), and production correlation — together they catch regressions that “it works on my machine” testing never will.

pythonperformanceci-cd

See Also