Python Metrics Collection — Deep Dive

Metrics seem simple — increment a counter, observe a histogram. In production, the details matter: multi-process safety, cardinality control, custom collectors, and bridging between Prometheus and OpenTelemetry ecosystems.

prometheus_client internals

The prometheus_client library stores metrics in a CollectorRegistry. When Prometheus scrapes /metrics, the library iterates all registered collectors and serializes their values to the Prometheus text exposition format.

Metric storage

Each metric object (Counter, Gauge, Histogram) stores its values in a ValueClass — by default a Python dict protected by a Lock. For labeled metrics, each unique label combination creates a new child metric:

REQUEST_COUNT = Counter("http_requests_total", "...", ["method", "status"])

# Accessing labels creates a child metric stored in a dict
REQUEST_COUNT.labels(method="GET", status="200").inc()
# Internally: {("GET", "200"): CounterValue(1.0)}

Label gotchas

# These are DIFFERENT time series:
REQUEST_COUNT.labels(method="GET", status="200")
REQUEST_COUNT.labels(status="200", method="GET")  # order matters in positional args!

# Use keyword arguments to avoid this:
REQUEST_COUNT.labels(method="GET", status="200")  # always correct

Multi-process mode (gunicorn)

The default in-memory storage doesn’t work with pre-fork servers like gunicorn, where each worker is a separate process. The library provides a multi-process mode:

import os
os.environ["PROMETHEUS_MULTIPROC_DIR"] = "/tmp/prometheus_multiproc"

from prometheus_client import CollectorRegistry, multiprocess, generate_latest

def metrics_app(environ, start_response):
    registry = CollectorRegistry()
    multiprocess.MultiProcessCollector(registry)
    data = generate_latest(registry)
    start_response("200 OK", [("Content-Type", "text/plain")])
    return [data]

Each worker writes metrics to memory-mapped files in the shared directory. The collector reads all files and aggregates values.

Aggregation rules:

  • Counters: summed across workers
  • Gauges: depends on multiprocess_mode parameter:
    • "all" — expose per-PID values
    • "liveall" — expose per-PID values for live workers
    • "max" / "min" — take the max/min across workers
    • "livesum" — sum values from live workers (useful for active connections)
ACTIVE_CONNECTIONS = Gauge(
    "active_connections",
    "Currently active connections",
    multiprocess_mode="livesum"
)

Cleanup

Dead worker files linger. Add cleanup to gunicorn’s child_exit hook:

# gunicorn.conf.py
from prometheus_client import multiprocess

def child_exit(server, worker):
    multiprocess.mark_process_dead(worker.pid)

Custom collectors

For metrics that come from external sources (database stats, OS metrics, third-party APIs), write a custom collector:

from prometheus_client.core import GaugeMetricFamily, REGISTRY

class DatabasePoolCollector:
    def __init__(self, pool):
        self.pool = pool

    def describe(self):
        yield GaugeMetricFamily("db_pool_size", "Connection pool size")
        yield GaugeMetricFamily("db_pool_checked_out", "Connections in use")

    def collect(self):
        size = GaugeMetricFamily("db_pool_size", "Connection pool size")
        size.add_metric([], self.pool.size())
        yield size

        in_use = GaugeMetricFamily("db_pool_checked_out", "Connections in use")
        in_use.add_metric([], self.pool.checkedout())
        yield in_use

REGISTRY.register(DatabasePoolCollector(db_pool))

collect() is called on every scrape. Keep it fast — Prometheus scrapes typically timeout at 10 seconds.

Histogram bucket design

Default buckets (.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10) are designed for HTTP latencies. For other use cases, choose buckets carefully:

# For database query times (mostly 1-100ms)
DB_LATENCY = Histogram(
    "db_query_duration_seconds",
    "Database query latency",
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)

# For file processing (seconds to minutes)
PROCESSING_TIME = Histogram(
    "file_processing_seconds",
    "File processing duration",
    buckets=[1, 5, 10, 30, 60, 120, 300, 600]
)

Each bucket is a separate time series. 10 buckets × 5 label values = 50 time series. Balance granularity against storage costs.

Exponential buckets helper

from prometheus_client import Histogram

# Generates: [0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12]
LATENCY = Histogram(
    "request_latency_seconds",
    "Request latency",
    buckets=Histogram.DEFAULT_BUCKETS  # or generate custom exponential
)

Cardinality control

High cardinality is the number one cause of metrics system outages. Practical limits:

BackendSafe cardinality per metric
Prometheus (single instance)~10,000 series
Prometheus + Thanos~100,000 series
DatadogBilled per custom metric
VictoriaMetrics~1,000,000 series

Strategies to control cardinality

  1. Bucket URL paths: Replace /users/12345 with /users/{id}.
  2. Drop low-value labels: user_agent has thousands of values — rarely useful in metrics (use logs instead).
  3. Use exemplars instead of labels: Attach a single trace ID to a histogram observation instead of high-cardinality labels.
from prometheus_client import Histogram

REQUEST_LATENCY = Histogram("http_request_seconds", "Latency")

# Exemplar links this metric observation to a specific trace
REQUEST_LATENCY.observe(0.25, exemplar={"traceID": "abc123"})

Grafana can link from the metric graph to the specific trace via the exemplar.

OpenTelemetry metrics integration

Using OTel SDK with Prometheus exporter

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader

reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)

meter = metrics.get_meter("my-service")
counter = meter.create_counter("http_requests", description="Total requests")

# Prometheus scrapes the same /metrics endpoint

Bridging prometheus_client and OTel

If your codebase uses prometheus_client but your infrastructure expects OTLP:

from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

# Export OTel metrics via OTLP
reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://collector:4317"),
    export_interval_millis=15000
)
provider = MeterProvider(metric_readers=[reader])

For the prometheus_client metrics, use the target_info collector to attach resource attributes, and run both systems side by side during migration.

Framework-specific patterns

FastAPI with prometheus-fastapi-instrumentator

from prometheus_fastapi_instrumentator import Instrumentator

app = FastAPI()
Instrumentator().instrument(app).expose(app, endpoint="/metrics")

This auto-creates http_request_duration_seconds, http_requests_total, and http_request_size_bytes with sensible labels.

Django with django-prometheus

# settings.py
INSTALLED_APPS = [..., "django_prometheus"]
MIDDLEWARE = [
    "django_prometheus.middleware.PrometheusBeforeMiddleware",
    ...,
    "django_prometheus.middleware.PrometheusAfterMiddleware",
]

# urls.py
urlpatterns = [
    path("", include("django_prometheus.urls")),
]

Also instruments database connections and cache backends automatically.

Testing metrics

from prometheus_client import REGISTRY, Counter

def test_request_counter_increments():
    counter = Counter("test_requests", "Test", ["status"], registry=REGISTRY)
    counter.labels(status="200").inc()
    counter.labels(status="200").inc()

    # Read the current value
    sample = REGISTRY.get_sample_value(
        "test_requests_total", {"status": "200"}
    )
    assert sample == 2.0

Use a separate CollectorRegistry() in tests to avoid pollution between test cases.

Operational recommendations

  1. Four golden signals per service: Request rate, error rate, latency (histogram), and saturation (active connections / queue depth).
  2. Scrape interval: 15 seconds is the Prometheus default. Go to 5 seconds only for critical services — it quadruples storage.
  3. Retention: Keep raw metrics for 15 days, downsample to 5-minute intervals for 90 days, 1-hour intervals for 1 year.
  4. Alert on symptoms, not causes: Alert on “error rate > 1%” not “database CPU > 80%.” Symptom-based alerts reduce noise.

One thing to remember: Good metrics start with the four golden signals. Get those right with proper labels and histogram buckets, and you’ll catch 90% of production issues. Everything else is optimization.

pythonobservabilityprometheusarchitecture

See Also

  • Python Alerting Patterns Alerting is a smoke detector for your code — it wakes you up when something is burning, not when someone is cooking.
  • Python Correlation Ids Correlation IDs are name tags for requests — they let you follow one visitor's journey through a crowded theme park of services.
  • Python Grafana Dashboards Python Grafana turns boring numbers from your Python app into colorful, real-time dashboards — like a car's dashboard but for your code.
  • Python Log Aggregation Elk ELK collects scattered log files from all your services into one searchable place — like gathering every sticky note in the office into a single filing cabinet.
  • Python Logging Best Practices Treat logs like a flight recorder so you can understand failures after they happen, not just during development.