Prometheus Metrics in Python — Deep Dive

Design production Prometheus instrumentation in Python with custom collectors, multiprocess mode, exemplars, and cardinality-safe label strategies.

Prometheus instrumentation in Python has subtleties that do not appear in tutorials. Multiprocess deployments break the default in-memory metric storage. High-cardinality labels silently degrade Prometheus performance. And the difference between a histogram and a summary matters more than most teams realize until they try to aggregate percentiles across replicas.

Multiprocess mode

The default prometheus_client stores metrics in process memory. This works for single-process services but breaks with Gunicorn (pre-fork model), where each worker is a separate process with its own counters.

The solution is multiprocess mode, which uses memory-mapped files:

import os
os.environ["PROMETHEUS_MULTIPROC_DIR"] = "/tmp/prometheus_multiproc"

from prometheus_client import CollectorRegistry, multiprocess, generate_latest

def metrics_app(environ, start_response):
    registry = CollectorRegistry()
    multiprocess.MultiProcessCollector(registry)
    data = generate_latest(registry)
    start_response("200 OK", [("Content-Type", "text/plain")])
    return [data]

Each worker writes metrics to shared files. The metrics endpoint reads all files and aggregates them. Critical details:

Clean the directory on startup. Stale files from previous runs cause phantom metrics.
Gauges need aggregation modes. Use multiprocess_mode parameter: "all" (report per-pid), "liveall" (only living pids), "livesum", "max", "min".
Summaries do not work in multiprocess mode. Use histograms instead.

from prometheus_client import Gauge

ACTIVE_REQUESTS = Gauge(
    "active_requests", "Currently active requests",
    multiprocess_mode="livesum"
)

Gunicorn child exit hook

Clean up dead worker files:

# gunicorn.conf.py
from prometheus_client import multiprocess

def child_exit(server, worker):
    multiprocess.mark_process_dead(worker.pid)

Custom collectors

For metrics that are expensive to compute or come from external sources, custom collectors avoid continuous computation:

from prometheus_client.core import GaugeMetricFamily, REGISTRY

class DatabasePoolCollector:
    def collect(self):
        pool_stats = get_pool_stats()  # Only called during scrape
        
        gauge = GaugeMetricFamily(
            "db_pool_connections",
            "Database connection pool stats",
            labels=["state"]
        )
        gauge.add_metric(["active"], pool_stats.active)
        gauge.add_metric(["idle"], pool_stats.idle)
        gauge.add_metric(["waiting"], pool_stats.waiting)
        yield gauge

REGISTRY.register(DatabasePoolCollector())

Custom collectors are invoked only when Prometheus scrapes, so expensive computations happen at most once per scrape interval.

Histogram bucket design

Bucket boundaries determine the accuracy of quantile calculations. Poor bucket choices produce misleading percentiles.

Strategy: SLA-driven buckets

Define buckets around your SLA thresholds:

# If SLA is "99% of requests under 500ms"
REQUEST_DURATION = Histogram(
    "http_request_duration_seconds",
    "Request duration",
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 10.0]
)

Dense buckets around the SLA threshold (500ms) give better resolution where it matters.

Strategy: Exponential buckets

For metrics with wide range:

from prometheus_client import Histogram

# Generates: 0.01, 0.02, 0.04, 0.08, ..., 10.24
PROCESS_TIME = Histogram(
    "batch_process_seconds",
    "Batch processing time",
    buckets=Histogram.DEFAULT_BUCKETS  # or use exponential_buckets(0.01, 2, 11)
)

Histogram vs Summary tradeoff

Histograms allow server-side quantile calculation via histogram_quantile(). This means you can aggregate across instances:

# P99 across all replicas — works with histograms
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Summaries compute quantiles client-side and cannot be aggregated. A p99 from instance A and a p99 from instance B cannot be combined into a meaningful p99. Use histograms unless you have a specific reason not to.

Exemplars for trace correlation

Exemplars attach trace IDs to metric samples, bridging metrics and traces:

from prometheus_client import Histogram
from opentelemetry import trace

REQUEST_DURATION = Histogram("http_request_duration_seconds", "Request duration")

span = trace.get_current_span()
trace_id = format(span.get_span_context().trace_id, "032x")

REQUEST_DURATION.observe(0.25, exemplar={"traceID": trace_id})

In Grafana, clicking on a histogram bucket sample shows the associated trace ID, letting you jump from “p99 latency spiked” to the specific slow trace.

Cardinality management

Each unique combination of metric name and label values creates a time series. Prometheus performance degrades significantly above 1-2 million active series.

Cardinality estimation

series = metric_count × label1_cardinality × label2_cardinality × ...

A metric with 3 labels of cardinality (5, 200, 3) = 3,000 series. Add a user_id label with 100K users = 300 million series. That will kill Prometheus.

Defensive patterns

Validate label values before applying them:

ALLOWED_ENDPOINTS = {"/api/orders", "/api/users", "/api/health", "/api/products"}

def safe_endpoint(path):
    return path if path in ALLOWED_ENDPOINTS else "other"

Use le buckets wisely — each histogram bucket is a separate series. 10 buckets × 50 label combinations = 500 series per histogram metric.
Monitor cardinality with Prometheus itself:

# Top 10 metrics by series count
topk(10, count by (__name__)({__name__=~".+"}))

Alerting rules

Define alerts in Prometheus or Alertmanager:

groups:
  - name: python-service
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for 5 minutes"
          
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
          > 1.0
        for: 10m
        labels:
          severity: warning

The for clause prevents flapping — the condition must persist for the specified duration before firing.

Testing metrics

Verify instrumentation in tests:

from prometheus_client import REGISTRY

def test_request_counter_increments():
    before = REGISTRY.get_sample_value(
        "http_requests_total",
        {"method": "GET", "endpoint": "/api/orders", "status": "200"}
    ) or 0
    
    client.get("/api/orders")
    
    after = REGISTRY.get_sample_value(
        "http_requests_total",
        {"method": "GET", "endpoint": "/api/orders", "status": "200"}
    )
    
    assert after == before + 1

For integration tests, scrape the /metrics endpoint and parse the output.

Push gateway for batch jobs

Short-lived batch jobs may terminate before Prometheus scrapes. The Pushgateway accepts pushed metrics:

from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

registry = CollectorRegistry()
duration = Gauge("batch_duration_seconds", "Batch job duration", registry=registry)

with duration.time():
    run_batch()

push_to_gateway("localhost:9091", job="nightly_etl", registry=registry)

Use pushgateway sparingly — it is designed for batch jobs, not as a general replacement for the pull model.

One thing to remember: Production Prometheus in Python demands multiprocess-aware metric storage, cardinality-conscious label design, and histogram buckets aligned to your SLA thresholds — these operational details determine whether your monitoring helps or hinders.

pythonprometheusprometheus-client