Python Alerting Patterns — Core Concepts
Alerting bridges the gap between metrics dashboards and human action. Good alerts detect real problems fast. Bad alerts create noise that trains teams to ignore pages. The difference is design, not tooling.
Alert on symptoms, not causes
The most common mistake is alerting on infrastructure metrics instead of user-visible symptoms.
| Bad alert (cause) | Good alert (symptom) |
|---|---|
| CPU usage > 80% | p95 latency > 500ms for 5 min |
| Disk usage > 90% | Error rate > 1% for 5 min |
| Database connections > 100 | Order success rate < 99% for 10 min |
| Memory usage > 4GB | Homepage load time > 3s for 5 min |
CPU can spike during a deploy and recover. High latency means users are waiting. Alert on what users experience.
Severity levels
Not all problems need the same response:
| Severity | Meaning | Notification | Example |
|---|---|---|---|
| Critical (P1) | Service down or severely degraded | Page on-call immediately | Error rate > 10% for 2 min |
| Warning (P2) | Degraded but functional | Slack channel + ticket | p95 latency > 1s for 10 min |
| Info (P3) | Worth investigating soon | Slack only | Error rate > 0.5% for 30 min |
Most teams need 3-5 alert rules per service, not 30.
Threshold design
Static thresholds
Fixed numbers based on expected behavior:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
The for: 5m clause prevents alerting on brief spikes. The error must persist for 5 minutes before firing.
SLO-based alerts (error budgets)
Instead of arbitrary thresholds, alert when you’re burning through your error budget too fast:
- SLO: 99.9% of requests succeed (0.1% error budget per month).
- Alert: If the current burn rate would exhaust the monthly budget in 1 hour, page immediately.
# Fast burn: 14.4x budget consumption → pages immediately
- alert: SLOBudgetFastBurn
expr: |
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
# Slow burn: 6x budget consumption → ticket
- alert: SLOBudgetSlowBurn
expr: |
(sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) > (6 * 0.001)
for: 15m
labels:
severity: warning
Implementing alerts in Python
Prometheus Alertmanager (most common)
Define alert rules in Prometheus, route notifications via Alertmanager:
# prometheus/alert_rules.yml
groups:
- name: python-service
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "p95 latency above 500ms"
dashboard: "https://grafana.internal/d/python-svc"
runbook: "https://wiki.internal/runbooks/high-latency"
Programmatic alerts from Python
For custom business logic alerts:
import httpx
from datetime import datetime
def check_order_health():
"""Run periodically via cron or APScheduler."""
recent_orders = get_recent_orders(minutes=10)
failed = [o for o in recent_orders if o.status == "failed"]
failure_rate = len(failed) / max(len(recent_orders), 1)
if failure_rate > 0.05:
send_alert(
severity="critical",
title="Order failure rate above 5%",
details=f"{len(failed)}/{len(recent_orders)} orders failed in last 10 min",
runbook="https://wiki.internal/runbooks/order-failures"
)
def send_alert(severity, title, details, runbook):
httpx.post("https://hooks.slack.com/services/...", json={
"text": f":rotating_light: *[{severity.upper()}]* {title}\n{details}\nRunbook: {runbook}"
})
Alert routing
Alertmanager routes alerts to the right team based on labels:
# alertmanager.yml
route:
receiver: default-slack
routes:
- match:
severity: critical
receiver: pagerduty-oncall
repeat_interval: 5m
- match:
severity: warning
receiver: engineering-slack
repeat_interval: 30m
- match:
team: payments
receiver: payments-team-slack
receivers:
- name: pagerduty-oncall
pagerduty_configs:
- service_key: "abc123"
- name: engineering-slack
slack_configs:
- channel: "#alerts"
- name: payments-team-slack
slack_configs:
- channel: "#payments-alerts"
Runbooks
Every alert should link to a runbook — a document that tells the responder what to do:
- What is this alert? One-sentence explanation.
- Who is affected? Users, internal services, batch jobs.
- What to check first? Dashboard link, key queries.
- Common causes and fixes. Step-by-step remediation.
- Escalation path. Who to contact if you can’t resolve it.
Without runbooks, alerts just create panic. With them, even a junior engineer can start diagnosing at 3 AM.
Common misconception
“More alerts mean better coverage.” The opposite is true. Teams with 50+ alert rules typically have 80% noise — alerts that fire often but require no action. Each noisy alert degrades trust in the entire system. Start with 3-5 alerts per service, make them meaningful, and add more only when you have a gap.
One thing to remember: A good alert has three properties: it fires when something is actually broken, it tells you enough to start fixing it, and it goes to someone who can fix it. If any of these are missing, the alert is noise.
See Also
- Python Correlation Ids Correlation IDs are name tags for requests — they let you follow one visitor's journey through a crowded theme park of services.
- Python Grafana Dashboards Python Grafana turns boring numbers from your Python app into colorful, real-time dashboards — like a car's dashboard but for your code.
- Python Log Aggregation Elk ELK collects scattered log files from all your services into one searchable place — like gathering every sticky note in the office into a single filing cabinet.
- Python Logging Best Practices Treat logs like a flight recorder so you can understand failures after they happen, not just during development.
- Python Logging Handlers Think of logging handlers as mailboxes that decide where your app's messages end up — screen, file, or faraway server.