Python perf_counter Timing — Deep Dive

Build production timing infrastructure with perf_counter: middleware latency tracking, percentile analysis, flame graph integration, and async-aware measurement.

Technical perspective

Precise timing is the foundation of performance engineering. Without reliable measurements, optimisation is guesswork. Python’s time.perf_counter() provides access to the highest-resolution monotonic clock available on the platform — QueryPerformanceCounter on Windows, clock_gettime(CLOCK_MONOTONIC) on Linux, mach_absolute_time on macOS. Understanding the timer’s capabilities, limitations, and integration patterns is essential for building production observability.

Platform-specific clock sources

Platform	Clock source	Typical resolution
Linux	`CLOCK_MONOTONIC`	~1 nanosecond
macOS	`mach_absolute_time`	~1 nanosecond
Windows	`QueryPerformanceCounter`	~100 nanoseconds

You can check your system’s resolution:

import time
print(f"perf_counter resolution: {time.get_clock_info('perf_counter')}")
# namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)',
#           monotonic=True, resolution=1e-09)

The adjustable=False and monotonic=True guarantees mean the clock never jumps backward, even during NTP adjustments or daylight saving transitions.

Nanosecond precision and float limitations

IEEE 754 doubles (Python’s float) have ~15 significant digits. As perf_counter() values grow (the counter has been running since boot), precision degrades:

import time

# After 7 days of uptime:
# perf_counter() ≈ 604800.0
# Resolution: ~0.1 microseconds (lost nanosecond precision)

# After 115 days:
# perf_counter() ≈ 10000000.0
# Resolution: ~1 microsecond

For sub-microsecond measurements on long-running systems, use perf_counter_ns():

start = time.perf_counter_ns()
operation()
elapsed_ns = time.perf_counter_ns() - start
# Integer arithmetic — no precision loss
elapsed_us = elapsed_ns / 1000
elapsed_ms = elapsed_ns / 1_000_000

The nanosecond variant returns int, avoiding float precision issues entirely.

Production timing middleware

A FastAPI middleware that tracks request latency with percentile reporting:

import time
from collections import deque
from dataclasses import dataclass, field
from fastapi import FastAPI, Request
from starlette.middleware.base import BaseHTTPMiddleware

@dataclass
class LatencyTracker:
    """Thread-safe latency tracking with fixed-size window."""
    window_size: int = 10000
    _samples: deque = field(default_factory=lambda: deque(maxlen=10000))

    def record(self, duration_ms: float) -> None:
        self._samples.append(duration_ms)

    def percentile(self, p: float) -> float | None:
        if not self._samples:
            return None
        sorted_samples = sorted(self._samples)
        idx = int(len(sorted_samples) * p / 100)
        return sorted_samples[min(idx, len(sorted_samples) - 1)]

    def stats(self) -> dict:
        if not self._samples:
            return {"count": 0}
        return {
            "count": len(self._samples),
            "p50": self.percentile(50),
            "p95": self.percentile(95),
            "p99": self.percentile(99),
            "min": min(self._samples),
            "max": max(self._samples),
        }

latency = LatencyTracker()

class TimingMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        start = time.perf_counter()
        response = await call_next(request)
        duration_ms = (time.perf_counter() - start) * 1000
        latency.record(duration_ms)
        response.headers["X-Response-Time"] = f"{duration_ms:.2f}ms"
        return response

app = FastAPI()
app.add_middleware(TimingMiddleware)

@app.get("/metrics/latency")
async def get_latency():
    return latency.stats()

Async-aware timing

Timing async code requires care — await suspends the coroutine, and other coroutines run during the suspension. perf_counter measures wall time (including suspension), while process_time doesn’t advance during await:

import asyncio
import time

async def example():
    # Wall time includes the sleep
    wall_start = time.perf_counter()
    cpu_start = time.process_time()

    await asyncio.sleep(1.0)  # Simulating I/O

    wall_elapsed = time.perf_counter() - wall_start  # ≈ 1.0s
    cpu_elapsed = time.process_time() - cpu_start      # ≈ 0.0s

For async applications, you typically want wall time (perf_counter) because users experience latency, not CPU time.

Async timing decorator

import functools
import time

def async_timed(func):
    @functools.wraps(func)
    async def wrapper(*args, **kwargs):
        start = time.perf_counter()
        try:
            return await func(*args, **kwargs)
        finally:
            elapsed = time.perf_counter() - start
            print(f"{func.__name__}: {elapsed:.4f}s")
    return wrapper

@async_timed
async def fetch_data(url: str):
    async with httpx.AsyncClient() as client:
        return await client.get(url)

Comparative benchmarking framework

A reusable framework for comparing implementations:

import time
import statistics
from typing import Callable, Any

def benchmark(
    functions: dict[str, Callable],
    args: tuple = (),
    kwargs: dict | None = None,
    iterations: int = 1000,
    warmup: int = 100,
) -> dict[str, dict]:
    kwargs = kwargs or {}
    results = {}

    for name, func in functions.items():
        # Warmup — populate caches, trigger JIT (if using PyPy)
        for _ in range(warmup):
            func(*args, **kwargs)

        # Measure
        times = []
        for _ in range(iterations):
            start = time.perf_counter_ns()
            func(*args, **kwargs)
            times.append(time.perf_counter_ns() - start)

        results[name] = {
            "median_ns": statistics.median(times),
            "mean_ns": statistics.mean(times),
            "stdev_ns": statistics.stdev(times) if len(times) > 1 else 0,
            "min_ns": min(times),
            "max_ns": max(times),
            "p95_ns": sorted(times)[int(len(times) * 0.95)],
        }

    return results

# Usage
data = list(range(10000))

results = benchmark({
    "sorted()": lambda: sorted(data),
    "list.sort()": lambda: data.copy().sort(),
    "heapq": lambda: list(__import__('heapq').nsmallest(len(data), data)),
}, iterations=500)

for name, stats in results.items():
    print(f"{name}: median={stats['median_ns']/1e6:.3f}ms p95={stats['p95_ns']/1e6:.3f}ms")

Integration with cProfile and line_profiler

perf_counter gives you targeted timing. For holistic profiling, combine it with Python’s profiling tools:

import cProfile
import pstats
import time

def profile_with_timing(func, *args, **kwargs):
    """Run cProfile and wall-clock timing together."""
    profiler = cProfile.Profile()

    wall_start = time.perf_counter()
    profiler.enable()
    result = func(*args, **kwargs)
    profiler.disable()
    wall_elapsed = time.perf_counter() - wall_start

    stats = pstats.Stats(profiler)
    stats.sort_stats("cumulative")
    stats.print_stats(20)

    print(f"\nWall time: {wall_elapsed:.4f}s")
    return result

The wall time from perf_counter includes overhead that cProfile misses (C extensions, I/O waits), giving you the complete picture.

Avoiding timing pitfalls

Garbage collection interference

GC pauses can spike measurements. For micro-benchmarks, disable GC:

import gc
import time

gc.disable()
try:
    start = time.perf_counter_ns()
    operation()
    elapsed = time.perf_counter_ns() - start
finally:
    gc.enable()

But only for isolated benchmarks — disabling GC in production is dangerous.

CPU frequency scaling

Modern CPUs dynamically adjust clock speed. A function that runs in 1ms when the CPU is at full speed might take 3ms when throttled. For consistent benchmarks:

Run multiple iterations and report percentiles
Warm up the CPU before measuring (a few seconds of computation)
On Linux, consider setting the CPU governor to performance for benchmarks

Measuring too little

Operations under ~100ns are within the timer overhead itself. For such fine-grained measurements, time a batch:

start = time.perf_counter_ns()
for _ in range(1_000_000):
    operation()
per_op_ns = (time.perf_counter_ns() - start) / 1_000_000

Compiler/interpreter optimisation

CPython may optimise away code with no side effects. Ensure the timed code produces a result you actually use:

# Bad — CPython might optimise this away
start = time.perf_counter()
1 + 1
elapsed = time.perf_counter() - start

# Better — store and use the result
start = time.perf_counter()
result = sum(range(10000))
elapsed = time.perf_counter() - start
assert result  # Ensure result is used

Structured logging with timing

For production observability, emit timing data as structured logs:

import time
import json
import logging

logger = logging.getLogger(__name__)

def timed_operation(operation_name: str):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            start = time.perf_counter()
            try:
                result = func(*args, **kwargs)
                duration = time.perf_counter() - start
                logger.info(json.dumps({
                    "event": "operation_complete",
                    "operation": operation_name,
                    "duration_ms": round(duration * 1000, 2),
                    "status": "success",
                }))
                return result
            except Exception as e:
                duration = time.perf_counter() - start
                logger.error(json.dumps({
                    "event": "operation_failed",
                    "operation": operation_name,
                    "duration_ms": round(duration * 1000, 2),
                    "status": "error",
                    "error": str(e),
                }))
                raise
        return wrapper
    return decorator

This feeds naturally into log aggregation systems (ELK, Datadog, Grafana Loki) for dashboarding and alerting on latency regressions.

The one thing to remember: perf_counter is the foundation — but production timing needs statistical rigour (percentiles over averages), awareness of measurement pitfalls (GC, CPU scaling, float precision), and integration with observability infrastructure to turn numbers into actionable insights.

pythonperformancestdlib