Python Alerting Patterns — Core Concepts

Alerting bridges the gap between metrics dashboards and human action. Good alerts detect real problems fast. Bad alerts create noise that trains teams to ignore pages. The difference is design, not tooling.

Alert on symptoms, not causes

The most common mistake is alerting on infrastructure metrics instead of user-visible symptoms.

Bad alert (cause)Good alert (symptom)
CPU usage > 80%p95 latency > 500ms for 5 min
Disk usage > 90%Error rate > 1% for 5 min
Database connections > 100Order success rate < 99% for 10 min
Memory usage > 4GBHomepage load time > 3s for 5 min

CPU can spike during a deploy and recover. High latency means users are waiting. Alert on what users experience.

Severity levels

Not all problems need the same response:

SeverityMeaningNotificationExample
Critical (P1)Service down or severely degradedPage on-call immediatelyError rate > 10% for 2 min
Warning (P2)Degraded but functionalSlack channel + ticketp95 latency > 1s for 10 min
Info (P3)Worth investigating soonSlack onlyError rate > 0.5% for 30 min

Most teams need 3-5 alert rules per service, not 30.

Threshold design

Static thresholds

Fixed numbers based on expected behavior:

- alert: HighErrorRate
  expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.01
  for: 5m
  labels:
    severity: critical

The for: 5m clause prevents alerting on brief spikes. The error must persist for 5 minutes before firing.

SLO-based alerts (error budgets)

Instead of arbitrary thresholds, alert when you’re burning through your error budget too fast:

  • SLO: 99.9% of requests succeed (0.1% error budget per month).
  • Alert: If the current burn rate would exhaust the monthly budget in 1 hour, page immediately.
# Fast burn: 14.4x budget consumption → pages immediately
- alert: SLOBudgetFastBurn
  expr: |
    (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > (14.4 * 0.001)
  for: 2m
  labels:
    severity: critical

# Slow burn: 6x budget consumption → ticket
- alert: SLOBudgetSlowBurn
  expr: |
    (sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) > (6 * 0.001)
  for: 15m
  labels:
    severity: warning

Implementing alerts in Python

Prometheus Alertmanager (most common)

Define alert rules in Prometheus, route notifications via Alertmanager:

# prometheus/alert_rules.yml
groups:
  - name: python-service
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p95 latency above 500ms"
          dashboard: "https://grafana.internal/d/python-svc"
          runbook: "https://wiki.internal/runbooks/high-latency"

Programmatic alerts from Python

For custom business logic alerts:

import httpx
from datetime import datetime

def check_order_health():
    """Run periodically via cron or APScheduler."""
    recent_orders = get_recent_orders(minutes=10)
    failed = [o for o in recent_orders if o.status == "failed"]
    failure_rate = len(failed) / max(len(recent_orders), 1)

    if failure_rate > 0.05:
        send_alert(
            severity="critical",
            title="Order failure rate above 5%",
            details=f"{len(failed)}/{len(recent_orders)} orders failed in last 10 min",
            runbook="https://wiki.internal/runbooks/order-failures"
        )

def send_alert(severity, title, details, runbook):
    httpx.post("https://hooks.slack.com/services/...", json={
        "text": f":rotating_light: *[{severity.upper()}]* {title}\n{details}\nRunbook: {runbook}"
    })

Alert routing

Alertmanager routes alerts to the right team based on labels:

# alertmanager.yml
route:
  receiver: default-slack
  routes:
    - match:
        severity: critical
      receiver: pagerduty-oncall
      repeat_interval: 5m
    - match:
        severity: warning
      receiver: engineering-slack
      repeat_interval: 30m
    - match:
        team: payments
      receiver: payments-team-slack

receivers:
  - name: pagerduty-oncall
    pagerduty_configs:
      - service_key: "abc123"
  - name: engineering-slack
    slack_configs:
      - channel: "#alerts"
  - name: payments-team-slack
    slack_configs:
      - channel: "#payments-alerts"

Runbooks

Every alert should link to a runbook — a document that tells the responder what to do:

  1. What is this alert? One-sentence explanation.
  2. Who is affected? Users, internal services, batch jobs.
  3. What to check first? Dashboard link, key queries.
  4. Common causes and fixes. Step-by-step remediation.
  5. Escalation path. Who to contact if you can’t resolve it.

Without runbooks, alerts just create panic. With them, even a junior engineer can start diagnosing at 3 AM.

Common misconception

“More alerts mean better coverage.” The opposite is true. Teams with 50+ alert rules typically have 80% noise — alerts that fire often but require no action. Each noisy alert degrades trust in the entire system. Start with 3-5 alerts per service, make them meaningful, and add more only when you have a gap.

One thing to remember: A good alert has three properties: it fires when something is actually broken, it tells you enough to start fixing it, and it goes to someone who can fix it. If any of these are missing, the alert is noise.

pythonobservabilitysreoperations

See Also

  • Python Correlation Ids Correlation IDs are name tags for requests — they let you follow one visitor's journey through a crowded theme park of services.
  • Python Grafana Dashboards Python Grafana turns boring numbers from your Python app into colorful, real-time dashboards — like a car's dashboard but for your code.
  • Python Log Aggregation Elk ELK collects scattered log files from all your services into one searchable place — like gathering every sticky note in the office into a single filing cabinet.
  • Python Logging Best Practices Treat logs like a flight recorder so you can understand failures after they happen, not just during development.
  • Python Logging Handlers Think of logging handlers as mailboxes that decide where your app's messages end up — screen, file, or faraway server.