Python Metrics Collection — Deep Dive
Metrics seem simple — increment a counter, observe a histogram. In production, the details matter: multi-process safety, cardinality control, custom collectors, and bridging between Prometheus and OpenTelemetry ecosystems.
prometheus_client internals
The prometheus_client library stores metrics in a CollectorRegistry. When Prometheus scrapes /metrics, the library iterates all registered collectors and serializes their values to the Prometheus text exposition format.
Metric storage
Each metric object (Counter, Gauge, Histogram) stores its values in a ValueClass — by default a Python dict protected by a Lock. For labeled metrics, each unique label combination creates a new child metric:
REQUEST_COUNT = Counter("http_requests_total", "...", ["method", "status"])
# Accessing labels creates a child metric stored in a dict
REQUEST_COUNT.labels(method="GET", status="200").inc()
# Internally: {("GET", "200"): CounterValue(1.0)}
Label gotchas
# These are DIFFERENT time series:
REQUEST_COUNT.labels(method="GET", status="200")
REQUEST_COUNT.labels(status="200", method="GET") # order matters in positional args!
# Use keyword arguments to avoid this:
REQUEST_COUNT.labels(method="GET", status="200") # always correct
Multi-process mode (gunicorn)
The default in-memory storage doesn’t work with pre-fork servers like gunicorn, where each worker is a separate process. The library provides a multi-process mode:
import os
os.environ["PROMETHEUS_MULTIPROC_DIR"] = "/tmp/prometheus_multiproc"
from prometheus_client import CollectorRegistry, multiprocess, generate_latest
def metrics_app(environ, start_response):
registry = CollectorRegistry()
multiprocess.MultiProcessCollector(registry)
data = generate_latest(registry)
start_response("200 OK", [("Content-Type", "text/plain")])
return [data]
Each worker writes metrics to memory-mapped files in the shared directory. The collector reads all files and aggregates values.
Aggregation rules:
- Counters: summed across workers
- Gauges: depends on
multiprocess_modeparameter:"all"— expose per-PID values"liveall"— expose per-PID values for live workers"max"/"min"— take the max/min across workers"livesum"— sum values from live workers (useful for active connections)
ACTIVE_CONNECTIONS = Gauge(
"active_connections",
"Currently active connections",
multiprocess_mode="livesum"
)
Cleanup
Dead worker files linger. Add cleanup to gunicorn’s child_exit hook:
# gunicorn.conf.py
from prometheus_client import multiprocess
def child_exit(server, worker):
multiprocess.mark_process_dead(worker.pid)
Custom collectors
For metrics that come from external sources (database stats, OS metrics, third-party APIs), write a custom collector:
from prometheus_client.core import GaugeMetricFamily, REGISTRY
class DatabasePoolCollector:
def __init__(self, pool):
self.pool = pool
def describe(self):
yield GaugeMetricFamily("db_pool_size", "Connection pool size")
yield GaugeMetricFamily("db_pool_checked_out", "Connections in use")
def collect(self):
size = GaugeMetricFamily("db_pool_size", "Connection pool size")
size.add_metric([], self.pool.size())
yield size
in_use = GaugeMetricFamily("db_pool_checked_out", "Connections in use")
in_use.add_metric([], self.pool.checkedout())
yield in_use
REGISTRY.register(DatabasePoolCollector(db_pool))
collect() is called on every scrape. Keep it fast — Prometheus scrapes typically timeout at 10 seconds.
Histogram bucket design
Default buckets (.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10) are designed for HTTP latencies. For other use cases, choose buckets carefully:
# For database query times (mostly 1-100ms)
DB_LATENCY = Histogram(
"db_query_duration_seconds",
"Database query latency",
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)
# For file processing (seconds to minutes)
PROCESSING_TIME = Histogram(
"file_processing_seconds",
"File processing duration",
buckets=[1, 5, 10, 30, 60, 120, 300, 600]
)
Each bucket is a separate time series. 10 buckets × 5 label values = 50 time series. Balance granularity against storage costs.
Exponential buckets helper
from prometheus_client import Histogram
# Generates: [0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12]
LATENCY = Histogram(
"request_latency_seconds",
"Request latency",
buckets=Histogram.DEFAULT_BUCKETS # or generate custom exponential
)
Cardinality control
High cardinality is the number one cause of metrics system outages. Practical limits:
| Backend | Safe cardinality per metric |
|---|---|
| Prometheus (single instance) | ~10,000 series |
| Prometheus + Thanos | ~100,000 series |
| Datadog | Billed per custom metric |
| VictoriaMetrics | ~1,000,000 series |
Strategies to control cardinality
- Bucket URL paths: Replace
/users/12345with/users/{id}. - Drop low-value labels:
user_agenthas thousands of values — rarely useful in metrics (use logs instead). - Use exemplars instead of labels: Attach a single trace ID to a histogram observation instead of high-cardinality labels.
from prometheus_client import Histogram
REQUEST_LATENCY = Histogram("http_request_seconds", "Latency")
# Exemplar links this metric observation to a specific trace
REQUEST_LATENCY.observe(0.25, exemplar={"traceID": "abc123"})
Grafana can link from the metric graph to the specific trace via the exemplar.
OpenTelemetry metrics integration
Using OTel SDK with Prometheus exporter
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter("my-service")
counter = meter.create_counter("http_requests", description="Total requests")
# Prometheus scrapes the same /metrics endpoint
Bridging prometheus_client and OTel
If your codebase uses prometheus_client but your infrastructure expects OTLP:
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
# Export OTel metrics via OTLP
reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="http://collector:4317"),
export_interval_millis=15000
)
provider = MeterProvider(metric_readers=[reader])
For the prometheus_client metrics, use the target_info collector to attach resource attributes, and run both systems side by side during migration.
Framework-specific patterns
FastAPI with prometheus-fastapi-instrumentator
from prometheus_fastapi_instrumentator import Instrumentator
app = FastAPI()
Instrumentator().instrument(app).expose(app, endpoint="/metrics")
This auto-creates http_request_duration_seconds, http_requests_total, and http_request_size_bytes with sensible labels.
Django with django-prometheus
# settings.py
INSTALLED_APPS = [..., "django_prometheus"]
MIDDLEWARE = [
"django_prometheus.middleware.PrometheusBeforeMiddleware",
...,
"django_prometheus.middleware.PrometheusAfterMiddleware",
]
# urls.py
urlpatterns = [
path("", include("django_prometheus.urls")),
]
Also instruments database connections and cache backends automatically.
Testing metrics
from prometheus_client import REGISTRY, Counter
def test_request_counter_increments():
counter = Counter("test_requests", "Test", ["status"], registry=REGISTRY)
counter.labels(status="200").inc()
counter.labels(status="200").inc()
# Read the current value
sample = REGISTRY.get_sample_value(
"test_requests_total", {"status": "200"}
)
assert sample == 2.0
Use a separate CollectorRegistry() in tests to avoid pollution between test cases.
Operational recommendations
- Four golden signals per service: Request rate, error rate, latency (histogram), and saturation (active connections / queue depth).
- Scrape interval: 15 seconds is the Prometheus default. Go to 5 seconds only for critical services — it quadruples storage.
- Retention: Keep raw metrics for 15 days, downsample to 5-minute intervals for 90 days, 1-hour intervals for 1 year.
- Alert on symptoms, not causes: Alert on “error rate > 1%” not “database CPU > 80%.” Symptom-based alerts reduce noise.
One thing to remember: Good metrics start with the four golden signals. Get those right with proper labels and histogram buckets, and you’ll catch 90% of production issues. Everything else is optimization.
See Also
- Python Alerting Patterns Alerting is a smoke detector for your code — it wakes you up when something is burning, not when someone is cooking.
- Python Correlation Ids Correlation IDs are name tags for requests — they let you follow one visitor's journey through a crowded theme park of services.
- Python Grafana Dashboards Python Grafana turns boring numbers from your Python app into colorful, real-time dashboards — like a car's dashboard but for your code.
- Python Log Aggregation Elk ELK collects scattered log files from all your services into one searchable place — like gathering every sticky note in the office into a single filing cabinet.
- Python Logging Best Practices Treat logs like a flight recorder so you can understand failures after they happen, not just during development.