Distributed Tracing with OpenTelemetry in Python — Core Concepts
When a single user action triggers work across multiple services, diagnosing performance problems or errors becomes difficult. Logs tell you what happened in one service; metrics tell you aggregate statistics. Distributed tracing fills the gap by connecting events across services into a single timeline.
OpenTelemetry (OTel) is the industry-standard framework for producing traces, metrics, and logs. For Python developers, it provides libraries that instrument your code and export telemetry to backends like Jaeger, Zipkin, or Grafana Tempo.
Traces and spans
A trace represents the full journey of a request through your system. It is identified by a unique trace ID.
A span represents one unit of work within that trace. Each span has:
- A name (e.g., “process_payment”)
- Start and end timestamps
- A parent span (creating a tree structure)
- Attributes (key-value metadata like
user.id=42) - Status (OK, ERROR)
When service A calls service B, service A creates a child span. The parent-child relationship builds the tree that visualization tools render as a waterfall diagram.
Context propagation
The trace ID must travel between services. This happens through context propagation — typically via HTTP headers.
When service A makes an HTTP request to service B, OpenTelemetry injects the trace ID and span ID into headers (usually traceparent). Service B extracts these headers and creates its spans as children of the incoming span.
This works automatically with OpenTelemetry’s instrumentation libraries for common frameworks.
Instrumenting Python services
OpenTelemetry provides auto-instrumentation for popular frameworks:
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-flask \
opentelemetry-instrumentation-requests \
opentelemetry-exporter-otlp
Basic setup:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Configure the tracer
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
With Flask auto-instrumentation:
from opentelemetry.instrumentation.flask import FlaskInstrumentor
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
Every incoming request now automatically creates a span. Outgoing HTTP calls via requests or httpx (with their instrumentors) automatically propagate the trace context.
Adding custom spans
Auto-instrumentation covers HTTP boundaries. For internal logic, add manual spans:
tracer = trace.get_tracer("order-service")
with tracer.start_as_current_span("validate_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("order.items", len(items))
if not valid:
span.set_status(trace.StatusCode.ERROR, "Invalid order")
span.record_exception(ValidationError("Missing address"))
Custom spans add granularity. Without them, you see “service A called service B in 200ms.” With them, you see “service A spent 10ms validating, 180ms querying the database, and 10ms formatting the response.”
The collector
The OpenTelemetry Collector is a separate process that receives telemetry, processes it (sampling, enrichment), and exports it to backends. Python services send data to the collector rather than directly to Jaeger or Grafana.
This architecture means you can switch backends without changing application code. It also centralizes sampling decisions and reduces the number of connections each service maintains.
Common misconception
“Tracing adds significant overhead to every request.” With proper sampling (e.g., trace 1% of requests in high-traffic services), the overhead is negligible — typically under 1ms of added latency. Head-based sampling decides at the start of a trace whether to record it, so unsampled requests carry almost zero cost.
When distributed tracing matters most
Tracing pays for itself when you have more than two or three services interacting per request. A monolith rarely needs it — application profiling tools work better. But once requests fan out across services, tracing becomes the only reliable way to understand end-to-end latency and failure cascades.
One thing to remember: OpenTelemetry connects the dots between services by propagating trace context — install the auto-instrumentors, point them at a collector, and you get request-level visibility across your entire Python architecture.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.