Python Performance Regression Testing — Core Concepts
What performance regression testing catches
Code changes cause performance regressions more often than you’d expect. Common causes:
- Accidental N+1 queries — a new ORM method triggers individual queries inside a loop
- Changed data structures — switching from a set lookup to a list scan
- Import-time overhead — adding a heavy import to a frequently-used module
- Serialization bloat — new fields that increase JSON payload size
- Cache invalidation — a refactor that breaks caching logic
Without automated testing, these regressions accumulate silently.
Using pytest-benchmark
The simplest way to add performance tests to an existing pytest suite:
# tests/test_performance.py
import json
def test_json_serialization_speed(benchmark):
data = {"users": [{"name": f"user_{i}", "age": i} for i in range(1000)]}
result = benchmark(json.dumps, data)
assert isinstance(result, str)
def test_search_speed(benchmark):
from myapp.search import search_products
results = benchmark(search_products, query="laptop", limit=100)
assert len(results) <= 100
Run with comparison:
# Save baseline
pytest tests/test_performance.py --benchmark-save=baseline
# After changes, compare
pytest tests/test_performance.py --benchmark-compare=0001_baseline
pytest-benchmark handles warmup, repetition, and statistical analysis automatically.
Setting regression thresholds
The critical question: how much slower is too slow?
# Fail if any benchmark is >10% slower than baseline
pytest tests/test_performance.py \
--benchmark-compare=0001_baseline \
--benchmark-compare-fail=mean:10%
Choosing the right threshold:
| Threshold | When to use |
|---|---|
| 5% | Latency-critical paths (API response, real-time) |
| 10% | Standard application code (default recommendation) |
| 20% | Infrequently-run code (batch jobs, admin tools) |
| 50% | Very noisy benchmarks or early-stage projects |
Too tight a threshold causes false positives (benchmark noise triggers failures). Too loose misses real regressions. Start at 10% and adjust based on your benchmark stability.
CI integration
GitHub Actions example
name: Performance Tests
on: [pull_request]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: pip install -r requirements.txt pytest-benchmark
- name: Download baseline
uses: actions/cache@v4
with:
path: .benchmarks
key: benchmarks-${{ github.base_ref }}
- name: Run benchmarks
run: |
pytest tests/test_performance.py \
--benchmark-save=current \
--benchmark-compare=0001_baseline \
--benchmark-compare-fail=mean:10% \
--benchmark-json=benchmark-results.json
- name: Comment results on PR
if: always()
uses: actions/github-script@v7
with:
script: |
// Post benchmark comparison as PR comment
Handling noisy CI environments
Cloud CI runners have variable performance. Mitigate noise:
- Use dedicated runners — self-hosted with consistent hardware
- Increase repetitions —
--benchmark-min-rounds=50 - Compare medians, not means —
--benchmark-compare-fail=median:15% - Warm up the runner — run benchmarks twice, use second run
- Pin CPU frequency — if self-hosted, set governor to performance
Tracking trends over time
Individual comparisons catch sudden regressions. Trend tracking catches gradual degradation.
Using codspeed
# Integrates with GitHub, tracks performance across commits
- uses: CodSpeedHQ/action@v3
with:
run: pytest tests/test_performance.py --codspeed
CodSpeed provides dashboards showing performance trends across commits and branches, with automatic regression detection.
Using asv (airspeed velocity)
# Track benchmarks across git history
asv run v1.0..HEAD
asv publish
asv preview # opens browser with performance graphs
asv creates a website showing performance over time, making it easy to identify which commit introduced a regression.
Common misconception: performance tests are flaky
Performance tests are only flaky when the environment is uncontrolled. With proper warmup, sufficient iterations, and statistical thresholds, they can be as reliable as functional tests. The key is accepting that performance is a distribution, not a single number, and setting thresholds based on the variance of your specific benchmarks.
What to benchmark
Don’t benchmark everything. Focus on:
- Critical user paths — login, search, checkout, API response
- Data-heavy operations — serialization, database queries, file processing
- Known hotspots — functions that profiling has identified as bottlenecks
- Algorithmic code — sorting, searching, graph traversal
Skip:
- Simple getters/setters
- Configuration loading (runs once)
- Test utilities
The one thing to remember: automated performance regression tests turn “it seems slower” into “commit abc123 made search 15% slower” — set up pytest-benchmark in CI with a 10% threshold and you’ll catch most regressions before they ship.
See Also
- Python Algorithmic Complexity Understand Algorithmic Complexity through a practical analogy so your Python decisions become faster and clearer.
- Python Async Performance Tuning Making your async Python faster is like organizing a busy restaurant kitchen — it's all about flow.
- Python Benchmark Methodology Why timing Python code once means nothing, and how fair testing works like a science experiment.
- Python C Extension Performance How Python borrows C's speed for the hard parts — like hiring a specialist for the toughest job on the worksite.
- Python Caching Strategies Understand Python caching strategies with a shortcut-road analogy so your app gets faster without taking wrong turns.