Python Grafana Dashboards — Deep Dive
Production Grafana goes beyond clicking panels in the UI. Teams managing dozens of Python services need dashboard-as-code, automated provisioning, consistent panel design, and alert integration. This guide covers the engineering side of Grafana for Python teams.
Dashboard-as-code with JSON models
Every Grafana dashboard is a JSON document. You can export, version-control, and import them:
# Export via API
curl -H "Authorization: Bearer $GRAFANA_API_KEY" \
"$GRAFANA_URL/api/dashboards/uid/my-python-svc" | jq .dashboard > dashboard.json
# Import via API
curl -X POST -H "Authorization: Bearer $GRAFANA_API_KEY" \
-H "Content-Type: application/json" \
-d @dashboard.json \
"$GRAFANA_URL/api/dashboards/db"
Store dashboards in your service’s Git repository alongside the code that generates the metrics they display.
Generating dashboards with Python
For teams managing many services, generating dashboard JSON programmatically avoids copy-paste drift:
import json
def make_panel(title, expr, panel_id, grid_pos):
return {
"id": panel_id,
"type": "timeseries",
"title": title,
"gridPos": grid_pos,
"targets": [{
"expr": expr,
"legendFormat": "{{instance}}",
"refId": "A"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps" if "rate" in expr else "s"
}
}
}
def generate_service_dashboard(service_name: str) -> dict:
prefix = service_name.replace("-", "_")
panels = [
make_panel(
"Request Rate",
f'rate({prefix}_http_requests_total[5m])',
1, {"x": 0, "y": 0, "w": 12, "h": 8}
),
make_panel(
"Error Rate",
f'rate({prefix}_http_requests_total{{status=~"5.."}}[5m])',
2, {"x": 12, "y": 0, "w": 12, "h": 8}
),
make_panel(
"p95 Latency",
f'histogram_quantile(0.95, rate({prefix}_http_request_duration_seconds_bucket[5m]))',
3, {"x": 0, "y": 8, "w": 12, "h": 8}
),
make_panel(
"Active Connections",
f'{prefix}_active_connections',
4, {"x": 12, "y": 8, "w": 12, "h": 8}
),
]
return {
"dashboard": {
"uid": f"{service_name}-overview",
"title": f"{service_name} Overview",
"tags": ["python", "auto-generated"],
"timezone": "utc",
"panels": panels,
"templating": {
"list": [{
"name": "environment",
"type": "query",
"query": f'label_values({prefix}_http_requests_total, environment)',
"current": {"text": "production", "value": "production"}
}]
},
"time": {"from": "now-6h", "to": "now"},
"refresh": "30s"
},
"overwrite": True
}
Run this in CI/CD to regenerate dashboards whenever metric names change.
Grafonnet (Jsonnet-based generation)
For larger organizations, Grafonnet provides a Jsonnet library for dashboard generation:
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;
dashboard.new(
'Python Service',
tags=['python'],
time_from='now-6h',
)
.addPanel(
graphPanel.new(
'Request Rate',
datasource='Prometheus',
).addTarget(
prometheus.target('rate(http_requests_total[5m])')
),
gridPos={x: 0, y: 0, w: 12, h: 8}
)
Grafonnet generates the same JSON but with reusable components and type safety.
Provisioning with Grafana provisioning files
For Docker/Kubernetes deployments, Grafana reads provisioning YAML at startup:
# provisioning/dashboards/default.yaml
apiVersion: 1
providers:
- name: 'python-services'
orgId: 1
folder: 'Python Services'
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/python
foldersFromFilesStructure: true
Mount your JSON dashboards at /var/lib/grafana/dashboards/python/ and Grafana loads them automatically on boot.
Advanced PromQL patterns for Python services
Apdex score
Application Performance Index — a single number (0-1) summarizing user satisfaction:
(
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
+
sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m]))
) / 2 / sum(rate(http_request_duration_seconds_count[5m]))
Requests under 500ms are “satisfied,” under 2s are “tolerating,” above 2s are “frustrated.”
Error budget burn rate
For SLO-driven dashboards:
# 99.9% SLO = 0.1% error budget
# Burn rate = how fast are we consuming the budget?
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) / 0.001
A burn rate of 1.0 means you’re consuming budget at exactly the allowed pace. Above 1.0, you’ll exhaust the budget before the window ends.
Multi-service dependency graph
# Show which downstream services are causing errors
sum by (downstream_service) (
rate(outbound_requests_total{status=~"5.."}[5m])
)
Requires your Python service to label outbound request metrics with the target service name.
Alert rules in Grafana
Unified alerting (Grafana 9+)
# provisioning/alerting/rules.yaml
apiVersion: 1
groups:
- orgId: 1
name: python-service-alerts
folder: Python Services
interval: 1m
rules:
- uid: high-error-rate
title: "High Error Rate"
condition: C
data:
- refId: A
queryType: ""
model:
expr: 'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
- refId: C
queryType: ""
model:
type: threshold
conditions:
- evaluator: {type: gt, params: [0.01]}
for: 5m
annotations:
summary: "Error rate exceeds 1% for {{ $labels.service }}"
labels:
severity: critical
Contact points
Route alerts to Slack, PagerDuty, or email:
# provisioning/alerting/contactpoints.yaml
apiVersion: 1
contactPoints:
- orgId: 1
name: engineering-oncall
receivers:
- uid: slack-alerts
type: slack
settings:
url: "https://hooks.slack.com/services/..."
channel: "#alerts"
Panel design patterns
RED method dashboard
Rate, Errors, Duration — the standard for request-driven services:
Row 1: [Request Rate] [Error Rate %] [p50/p95/p99 Latency]
Row 2: [Request Rate by Endpoint] [Error Rate by Endpoint]
Row 3: [Latency Heatmap] [Latency by Endpoint]
USE method dashboard
Utilization, Saturation, Errors — for infrastructure components:
Row 1: [CPU Utilization] [Memory Utilization] [Disk I/O]
Row 2: [Connection Pool Saturation] [Thread Pool Saturation]
Row 3: [OOM Errors] [Connection Timeouts] [Disk Full Events]
Python-specific panels
Add panels for Python runtime metrics exposed by prometheus_client:
# GC collection time
rate(python_gc_collections_total[5m])
# Process memory (RSS)
process_resident_memory_bytes
# Open file descriptors
process_open_fds / process_max_fds
Grafana API automation from Python
import httpx
class GrafanaClient:
def __init__(self, url: str, api_key: str):
self.client = httpx.Client(
base_url=url,
headers={"Authorization": f"Bearer {api_key}"}
)
def create_or_update_dashboard(self, dashboard_json: dict):
response = self.client.post("/api/dashboards/db", json=dashboard_json)
response.raise_for_status()
return response.json()
def create_annotation(self, text: str, tags: list[str]):
response = self.client.post("/api/annotations", json={
"text": text,
"tags": tags,
"time": int(time.time() * 1000)
})
response.raise_for_status()
def get_dashboard(self, uid: str) -> dict:
response = self.client.get(f"/api/dashboards/uid/{uid}")
response.raise_for_status()
return response.json()
Use this in CI/CD pipelines to update dashboards, create deployment annotations, and validate that dashboards compile correctly.
Performance and operational tips
-
Query caching: Enable Grafana’s query caching for dashboards viewed by many users. Set
min_intervalon panels to prevent unnecessary sub-second queries. -
Recording rules: Pre-compute expensive PromQL in Prometheus to speed up dashboard loads:
groups: - name: python-service-recording rules: - record: service:http_request_rate:5m expr: sum(rate(http_requests_total[5m])) by (service) -
Dashboard loading time: Each panel fires a separate query. Dashboards with 20+ panels can take 10+ seconds to load. Use rows with collapse, and keep the default view to 6-10 panels.
-
Version control: Use Grafana’s built-in dashboard versioning for UI changes, and Git for provisioned dashboards. Never edit provisioned dashboards in the UI — they’ll be overwritten on next restart.
One thing to remember: The best Grafana dashboards are generated from code, provisioned automatically, and designed around the RED or USE methodology. Manual dashboard creation doesn’t scale past a handful of services — automate early.
See Also
- Python Alerting Patterns Alerting is a smoke detector for your code — it wakes you up when something is burning, not when someone is cooking.
- Python Correlation Ids Correlation IDs are name tags for requests — they let you follow one visitor's journey through a crowded theme park of services.
- Python Log Aggregation Elk ELK collects scattered log files from all your services into one searchable place — like gathering every sticky note in the office into a single filing cabinet.
- Python Logging Best Practices Treat logs like a flight recorder so you can understand failures after they happen, not just during development.
- Python Logging Handlers Think of logging handlers as mailboxes that decide where your app's messages end up — screen, file, or faraway server.