Python Performance Regression Testing — Core Concepts

Set up automated performance gates in CI using pytest-benchmark, codspeed, and statistical comparison against baselines.

What performance regression testing catches

Code changes cause performance regressions more often than you’d expect. Common causes:

Accidental N+1 queries — a new ORM method triggers individual queries inside a loop
Changed data structures — switching from a set lookup to a list scan
Import-time overhead — adding a heavy import to a frequently-used module
Serialization bloat — new fields that increase JSON payload size
Cache invalidation — a refactor that breaks caching logic

Without automated testing, these regressions accumulate silently.

Using pytest-benchmark

The simplest way to add performance tests to an existing pytest suite:

# tests/test_performance.py
import json

def test_json_serialization_speed(benchmark):
    data = {"users": [{"name": f"user_{i}", "age": i} for i in range(1000)]}
    result = benchmark(json.dumps, data)
    assert isinstance(result, str)

def test_search_speed(benchmark):
    from myapp.search import search_products
    results = benchmark(search_products, query="laptop", limit=100)
    assert len(results) <= 100

Run with comparison:

# Save baseline
pytest tests/test_performance.py --benchmark-save=baseline

# After changes, compare
pytest tests/test_performance.py --benchmark-compare=0001_baseline

pytest-benchmark handles warmup, repetition, and statistical analysis automatically.

Setting regression thresholds

The critical question: how much slower is too slow?

# Fail if any benchmark is >10% slower than baseline
pytest tests/test_performance.py \
    --benchmark-compare=0001_baseline \
    --benchmark-compare-fail=mean:10%

Choosing the right threshold:

Threshold	When to use
5%	Latency-critical paths (API response, real-time)
10%	Standard application code (default recommendation)
20%	Infrequently-run code (batch jobs, admin tools)
50%	Very noisy benchmarks or early-stage projects

Too tight a threshold causes false positives (benchmark noise triggers failures). Too loose misses real regressions. Start at 10% and adjust based on your benchmark stability.

CI integration

GitHub Actions example

name: Performance Tests
on: [pull_request]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install -r requirements.txt pytest-benchmark

      - name: Download baseline
        uses: actions/cache@v4
        with:
          path: .benchmarks
          key: benchmarks-${{ github.base_ref }}

      - name: Run benchmarks
        run: |
          pytest tests/test_performance.py \
            --benchmark-save=current \
            --benchmark-compare=0001_baseline \
            --benchmark-compare-fail=mean:10% \
            --benchmark-json=benchmark-results.json

      - name: Comment results on PR
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            // Post benchmark comparison as PR comment

Handling noisy CI environments

Cloud CI runners have variable performance. Mitigate noise:

Use dedicated runners — self-hosted with consistent hardware
Increase repetitions — --benchmark-min-rounds=50
Compare medians, not means — --benchmark-compare-fail=median:15%
Warm up the runner — run benchmarks twice, use second run
Pin CPU frequency — if self-hosted, set governor to performance

Tracking trends over time

Individual comparisons catch sudden regressions. Trend tracking catches gradual degradation.

Using codspeed

# Integrates with GitHub, tracks performance across commits
- uses: CodSpeedHQ/action@v3
  with:
    run: pytest tests/test_performance.py --codspeed

CodSpeed provides dashboards showing performance trends across commits and branches, with automatic regression detection.

Using asv (airspeed velocity)

# Track benchmarks across git history
asv run v1.0..HEAD
asv publish
asv preview  # opens browser with performance graphs

asv creates a website showing performance over time, making it easy to identify which commit introduced a regression.

Common misconception: performance tests are flaky

Performance tests are only flaky when the environment is uncontrolled. With proper warmup, sufficient iterations, and statistical thresholds, they can be as reliable as functional tests. The key is accepting that performance is a distribution, not a single number, and setting thresholds based on the variance of your specific benchmarks.

What to benchmark

Don’t benchmark everything. Focus on:

Critical user paths — login, search, checkout, API response
Data-heavy operations — serialization, database queries, file processing
Known hotspots — functions that profiling has identified as bottlenecks
Algorithmic code — sorting, searching, graph traversal

Skip:

Simple getters/setters
Configuration loading (runs once)
Test utilities

The one thing to remember: automated performance regression tests turn “it seems slower” into “commit abc123 made search 15% slower” — set up pytest-benchmark in CI with a 10% threshold and you’ll catch most regressions before they ship.

pythonperformanceci-cd