Python Alerting Patterns — Core Concepts

Design alerts that catch real incidents without drowning your team in noise — covering thresholds, SLOs, routing, and runbooks for Python services.

Alerting bridges the gap between metrics dashboards and human action. Good alerts detect real problems fast. Bad alerts create noise that trains teams to ignore pages. The difference is design, not tooling.

Alert on symptoms, not causes

The most common mistake is alerting on infrastructure metrics instead of user-visible symptoms.

Bad alert (cause)	Good alert (symptom)
CPU usage > 80%	p95 latency > 500ms for 5 min
Disk usage > 90%	Error rate > 1% for 5 min
Database connections > 100	Order success rate < 99% for 10 min
Memory usage > 4GB	Homepage load time > 3s for 5 min

CPU can spike during a deploy and recover. High latency means users are waiting. Alert on what users experience.

Severity levels

Not all problems need the same response:

Severity	Meaning	Notification	Example
Critical (P1)	Service down or severely degraded	Page on-call immediately	Error rate > 10% for 2 min
Warning (P2)	Degraded but functional	Slack channel + ticket	p95 latency > 1s for 10 min
Info (P3)	Worth investigating soon	Slack only	Error rate > 0.5% for 30 min

Most teams need 3-5 alert rules per service, not 30.

Threshold design

Static thresholds

Fixed numbers based on expected behavior:

- alert: HighErrorRate
  expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.01
  for: 5m
  labels:
    severity: critical

The for: 5m clause prevents alerting on brief spikes. The error must persist for 5 minutes before firing.

SLO-based alerts (error budgets)

Instead of arbitrary thresholds, alert when you’re burning through your error budget too fast:

SLO: 99.9% of requests succeed (0.1% error budget per month).
Alert: If the current burn rate would exhaust the monthly budget in 1 hour, page immediately.

# Fast burn: 14.4x budget consumption → pages immediately
- alert: SLOBudgetFastBurn
  expr: |
    (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > (14.4 * 0.001)
  for: 2m
  labels:
    severity: critical

# Slow burn: 6x budget consumption → ticket
- alert: SLOBudgetSlowBurn
  expr: |
    (sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) > (6 * 0.001)
  for: 15m
  labels:
    severity: warning

Implementing alerts in Python

Prometheus Alertmanager (most common)

Define alert rules in Prometheus, route notifications via Alertmanager:

# prometheus/alert_rules.yml
groups:
  - name: python-service
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p95 latency above 500ms"
          dashboard: "https://grafana.internal/d/python-svc"
          runbook: "https://wiki.internal/runbooks/high-latency"

Programmatic alerts from Python

For custom business logic alerts:

import httpx
from datetime import datetime

def check_order_health():
    """Run periodically via cron or APScheduler."""
    recent_orders = get_recent_orders(minutes=10)
    failed = [o for o in recent_orders if o.status == "failed"]
    failure_rate = len(failed) / max(len(recent_orders), 1)

    if failure_rate > 0.05:
        send_alert(
            severity="critical",
            title="Order failure rate above 5%",
            details=f"{len(failed)}/{len(recent_orders)} orders failed in last 10 min",
            runbook="https://wiki.internal/runbooks/order-failures"
        )

def send_alert(severity, title, details, runbook):
    httpx.post("https://hooks.slack.com/services/...", json={
        "text": f":rotating_light: *[{severity.upper()}]* {title}\n{details}\nRunbook: {runbook}"
    })

Alert routing

Alertmanager routes alerts to the right team based on labels:

# alertmanager.yml
route:
  receiver: default-slack
  routes:
    - match:
        severity: critical
      receiver: pagerduty-oncall
      repeat_interval: 5m
    - match:
        severity: warning
      receiver: engineering-slack
      repeat_interval: 30m
    - match:
        team: payments
      receiver: payments-team-slack

receivers:
  - name: pagerduty-oncall
    pagerduty_configs:
      - service_key: "abc123"
  - name: engineering-slack
    slack_configs:
      - channel: "#alerts"
  - name: payments-team-slack
    slack_configs:
      - channel: "#payments-alerts"

Runbooks

Every alert should link to a runbook — a document that tells the responder what to do:

What is this alert? One-sentence explanation.
Who is affected? Users, internal services, batch jobs.
What to check first? Dashboard link, key queries.
Common causes and fixes. Step-by-step remediation.
Escalation path. Who to contact if you can’t resolve it.

Without runbooks, alerts just create panic. With them, even a junior engineer can start diagnosing at 3 AM.

Common misconception

“More alerts mean better coverage.” The opposite is true. Teams with 50+ alert rules typically have 80% noise — alerts that fire often but require no action. Each noisy alert degrades trust in the entire system. Start with 3-5 alerts per service, make them meaningful, and add more only when you have a gap.

One thing to remember: A good alert has three properties: it fires when something is actually broken, it tells you enough to start fixing it, and it goes to someone who can fix it. If any of these are missing, the alert is noise.

pythonobservabilitysreoperations