Distributed Tracing with OpenTelemetry in Python — Deep Dive

Production OpenTelemetry tracing in Python: custom propagators, tail-based sampling, span links, baggage, and performance-conscious instrumentation.

Getting OpenTelemetry traces working in a demo takes an afternoon. Making them useful in production — where sampling decisions affect costs, context propagation crosses async boundaries, and instrumentation must not degrade latency — takes deliberate engineering.

SDK architecture

The Python OpenTelemetry SDK has a layered design:

API layer (opentelemetry-api): Defines interfaces. Application code imports only this.
SDK layer (opentelemetry-sdk): Implements the API. Configures providers, processors, and exporters.
Instrumentation libraries: Automatically wrap frameworks and libraries.
Exporters: Send data to backends via OTLP, Jaeger, Zipkin, or custom protocols.

This separation means library authors can instrument their code against the API without forcing SDK dependencies on users.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.semconv.resource import ResourceAttributes

resource = Resource.create({
    ResourceAttributes.SERVICE_NAME: "order-service",
    ResourceAttributes.SERVICE_VERSION: "2.4.1",
    ResourceAttributes.DEPLOYMENT_ENVIRONMENT: "production",
})

provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(endpoint="http://otel-collector:4317"),
        max_queue_size=2048,
        max_export_batch_size=512,
        schedule_delay_millis=5000,
    )
)
trace.set_tracer_provider(provider)

The Resource attaches metadata to every span. The BatchSpanProcessor buffers spans and exports them in batches, reducing network overhead.

Context propagation in depth

W3C Trace Context

The default propagator uses W3C traceparent and tracestate headers:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             version-trace_id-parent_id-flags

The tracestate header carries vendor-specific data. If you use multiple tracing systems during migration, both can coexist via tracestate.

Propagation across async boundaries

In asyncio applications, context is stored in contextvars and flows automatically through await chains. However, manually spawned tasks need explicit context propagation:

import asyncio
from opentelemetry import context

async def background_work():
    # This runs in the correct trace context
    with tracer.start_as_current_span("background"):
        await do_work()

# Capture current context before spawning
ctx = context.get_current()

# Propagate context to the new task
task = asyncio.create_task(
    context.attach(ctx) or background_work()
)

For thread pools, use opentelemetry.context.attach explicitly or use the opentelemetry-instrumentation-threading package.

Propagation through message queues

When publishing to Kafka, RabbitMQ, or NATS, inject trace context into message headers:

from opentelemetry.propagators import inject

headers = {}
inject(headers)
# headers now contains {"traceparent": "00-...", "tracestate": "..."}
# Include these headers in your message

On the consumer side:

from opentelemetry.propagators import extract

ctx = extract(carrier=message.headers)
with tracer.start_as_current_span("process_message", context=ctx):
    handle(message)

This creates a causal chain from producer to consumer spans, even across different services and languages.

Span links and events

Span links

When a span is causally related to another span but is not a direct child, use links:

# Batch processor that handles messages from multiple traces
link1 = trace.Link(msg1_span_context)
link2 = trace.Link(msg2_span_context)

with tracer.start_as_current_span("process_batch", links=[link1, link2]):
    process([msg1, msg2])

Links are useful for batch operations, fan-in patterns, and retry relationships where the new attempt relates to the original but is not a child.

Span events

Events are timestamped annotations within a span:

with tracer.start_as_current_span("checkout") as span:
    span.add_event("inventory_checked", {"items": 3})
    # ... processing ...
    span.add_event("payment_authorized", {"amount": 42.50})

Events appear as markers on the span timeline. They are lighter than child spans when you want to annotate without creating new timing units.

Baggage

Baggage propagates key-value pairs across all services in a trace without adding them to every span:

from opentelemetry import baggage

ctx = baggage.set_baggage("tenant.id", "acme-corp")
# All downstream services can read this
tenant = baggage.get_baggage("tenant.id")

Use baggage sparingly — it adds to every outgoing request header. Good for tenant ID, experiment cohort, or priority level. Bad for large payloads.

Sampling strategies

Head-based sampling

Decided at trace creation. The TraceIdRatioBased sampler is the simplest:

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

sampler = TraceIdRatioBased(0.1)  # Sample 10% of traces
provider = TracerProvider(sampler=sampler, resource=resource)

The ParentBased sampler respects the parent’s sampling decision, ensuring consistency across services:

from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased

sampler = ParentBased(root=TraceIdRatioBased(0.1))

Tail-based sampling

Head-based sampling misses interesting traces (errors, high latency) that happen to fall in the unsampled 90%. Tail-based sampling defers the decision until the trace is complete.

This is implemented in the OpenTelemetry Collector, not in the application:

# Collector config
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-traces
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

This keeps all error traces, all traces over 1 second, and 5% of everything else. The collector buffers spans until the decision wait expires.

Instrumentation for specific frameworks

FastAPI

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

This creates spans for each route, includes HTTP method and status code attributes, and propagates context to downstream calls.

SQLAlchemy

from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

SQLAlchemyInstrumentor().instrument(engine=engine)

Every SQL query becomes a span with db.statement, db.system, and timing information. Slow queries become immediately visible in traces.

Celery

from opentelemetry.instrumentation.celery import CeleryInstrumentor

CeleryInstrumentor().instrument()

Task enqueue creates a producer span; task execution creates a consumer span. The trace flows from the web request through the task queue to the worker.

Performance considerations

OpenTelemetry overhead in Python is measurable but manageable:

Span creation: ~1-5 μs per span (without export)
Context propagation: ~0.5 μs per inject/extract
BatchSpanProcessor: Exports asynchronously, minimal impact on request latency
Memory: Each buffered span uses roughly 1-2 KB

To minimize impact:

Use BatchSpanProcessor (not SimpleSpanProcessor) in production.
Set reasonable max_queue_size — if the queue fills, new spans are dropped.
Sample aggressively in high-throughput services (1-10%).
Avoid adding large attributes to spans — they increase memory and export size.
Use the OTEL_TRACES_SAMPLER environment variable for runtime sampling changes without code deploys.

Correlating traces with logs

Inject trace context into log records for trace-log correlation:

import logging

class TraceContextFilter(logging.Filter):
    def filter(self, record):
        span = trace.get_current_span()
        ctx = span.get_span_context()
        record.trace_id = format(ctx.trace_id, "032x")
        record.span_id = format(ctx.span_id, "016x")
        return True

handler = logging.StreamHandler()
handler.addFilter(TraceContextFilter())
handler.setFormatter(logging.Formatter(
    "%(asctime)s [trace=%(trace_id)s span=%(span_id)s] %(message)s"
))

With trace IDs in logs, you can jump from a log line in Grafana directly to the full trace in Tempo or Jaeger.

One thing to remember: Production-grade OpenTelemetry requires tail-based sampling in the collector, explicit context propagation across async and message-queue boundaries, and trace-log correlation — the auto-instrumentors get you started, but these details make traces genuinely useful for debugging.

pythonopentelemetrydistributed-tracing