Python perf_counter Timing — Deep Dive
Technical perspective
Precise timing is the foundation of performance engineering. Without reliable measurements, optimisation is guesswork. Python’s time.perf_counter() provides access to the highest-resolution monotonic clock available on the platform — QueryPerformanceCounter on Windows, clock_gettime(CLOCK_MONOTONIC) on Linux, mach_absolute_time on macOS. Understanding the timer’s capabilities, limitations, and integration patterns is essential for building production observability.
Platform-specific clock sources
| Platform | Clock source | Typical resolution |
|---|---|---|
| Linux | CLOCK_MONOTONIC | ~1 nanosecond |
| macOS | mach_absolute_time | ~1 nanosecond |
| Windows | QueryPerformanceCounter | ~100 nanoseconds |
You can check your system’s resolution:
import time
print(f"perf_counter resolution: {time.get_clock_info('perf_counter')}")
# namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)',
# monotonic=True, resolution=1e-09)
The adjustable=False and monotonic=True guarantees mean the clock never jumps backward, even during NTP adjustments or daylight saving transitions.
Nanosecond precision and float limitations
IEEE 754 doubles (Python’s float) have ~15 significant digits. As perf_counter() values grow (the counter has been running since boot), precision degrades:
import time
# After 7 days of uptime:
# perf_counter() ≈ 604800.0
# Resolution: ~0.1 microseconds (lost nanosecond precision)
# After 115 days:
# perf_counter() ≈ 10000000.0
# Resolution: ~1 microsecond
For sub-microsecond measurements on long-running systems, use perf_counter_ns():
start = time.perf_counter_ns()
operation()
elapsed_ns = time.perf_counter_ns() - start
# Integer arithmetic — no precision loss
elapsed_us = elapsed_ns / 1000
elapsed_ms = elapsed_ns / 1_000_000
The nanosecond variant returns int, avoiding float precision issues entirely.
Production timing middleware
A FastAPI middleware that tracks request latency with percentile reporting:
import time
from collections import deque
from dataclasses import dataclass, field
from fastapi import FastAPI, Request
from starlette.middleware.base import BaseHTTPMiddleware
@dataclass
class LatencyTracker:
"""Thread-safe latency tracking with fixed-size window."""
window_size: int = 10000
_samples: deque = field(default_factory=lambda: deque(maxlen=10000))
def record(self, duration_ms: float) -> None:
self._samples.append(duration_ms)
def percentile(self, p: float) -> float | None:
if not self._samples:
return None
sorted_samples = sorted(self._samples)
idx = int(len(sorted_samples) * p / 100)
return sorted_samples[min(idx, len(sorted_samples) - 1)]
def stats(self) -> dict:
if not self._samples:
return {"count": 0}
return {
"count": len(self._samples),
"p50": self.percentile(50),
"p95": self.percentile(95),
"p99": self.percentile(99),
"min": min(self._samples),
"max": max(self._samples),
}
latency = LatencyTracker()
class TimingMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
start = time.perf_counter()
response = await call_next(request)
duration_ms = (time.perf_counter() - start) * 1000
latency.record(duration_ms)
response.headers["X-Response-Time"] = f"{duration_ms:.2f}ms"
return response
app = FastAPI()
app.add_middleware(TimingMiddleware)
@app.get("/metrics/latency")
async def get_latency():
return latency.stats()
Async-aware timing
Timing async code requires care — await suspends the coroutine, and other coroutines run during the suspension. perf_counter measures wall time (including suspension), while process_time doesn’t advance during await:
import asyncio
import time
async def example():
# Wall time includes the sleep
wall_start = time.perf_counter()
cpu_start = time.process_time()
await asyncio.sleep(1.0) # Simulating I/O
wall_elapsed = time.perf_counter() - wall_start # ≈ 1.0s
cpu_elapsed = time.process_time() - cpu_start # ≈ 0.0s
For async applications, you typically want wall time (perf_counter) because users experience latency, not CPU time.
Async timing decorator
import functools
import time
def async_timed(func):
@functools.wraps(func)
async def wrapper(*args, **kwargs):
start = time.perf_counter()
try:
return await func(*args, **kwargs)
finally:
elapsed = time.perf_counter() - start
print(f"{func.__name__}: {elapsed:.4f}s")
return wrapper
@async_timed
async def fetch_data(url: str):
async with httpx.AsyncClient() as client:
return await client.get(url)
Comparative benchmarking framework
A reusable framework for comparing implementations:
import time
import statistics
from typing import Callable, Any
def benchmark(
functions: dict[str, Callable],
args: tuple = (),
kwargs: dict | None = None,
iterations: int = 1000,
warmup: int = 100,
) -> dict[str, dict]:
kwargs = kwargs or {}
results = {}
for name, func in functions.items():
# Warmup — populate caches, trigger JIT (if using PyPy)
for _ in range(warmup):
func(*args, **kwargs)
# Measure
times = []
for _ in range(iterations):
start = time.perf_counter_ns()
func(*args, **kwargs)
times.append(time.perf_counter_ns() - start)
results[name] = {
"median_ns": statistics.median(times),
"mean_ns": statistics.mean(times),
"stdev_ns": statistics.stdev(times) if len(times) > 1 else 0,
"min_ns": min(times),
"max_ns": max(times),
"p95_ns": sorted(times)[int(len(times) * 0.95)],
}
return results
# Usage
data = list(range(10000))
results = benchmark({
"sorted()": lambda: sorted(data),
"list.sort()": lambda: data.copy().sort(),
"heapq": lambda: list(__import__('heapq').nsmallest(len(data), data)),
}, iterations=500)
for name, stats in results.items():
print(f"{name}: median={stats['median_ns']/1e6:.3f}ms p95={stats['p95_ns']/1e6:.3f}ms")
Integration with cProfile and line_profiler
perf_counter gives you targeted timing. For holistic profiling, combine it with Python’s profiling tools:
import cProfile
import pstats
import time
def profile_with_timing(func, *args, **kwargs):
"""Run cProfile and wall-clock timing together."""
profiler = cProfile.Profile()
wall_start = time.perf_counter()
profiler.enable()
result = func(*args, **kwargs)
profiler.disable()
wall_elapsed = time.perf_counter() - wall_start
stats = pstats.Stats(profiler)
stats.sort_stats("cumulative")
stats.print_stats(20)
print(f"\nWall time: {wall_elapsed:.4f}s")
return result
The wall time from perf_counter includes overhead that cProfile misses (C extensions, I/O waits), giving you the complete picture.
Avoiding timing pitfalls
Garbage collection interference
GC pauses can spike measurements. For micro-benchmarks, disable GC:
import gc
import time
gc.disable()
try:
start = time.perf_counter_ns()
operation()
elapsed = time.perf_counter_ns() - start
finally:
gc.enable()
But only for isolated benchmarks — disabling GC in production is dangerous.
CPU frequency scaling
Modern CPUs dynamically adjust clock speed. A function that runs in 1ms when the CPU is at full speed might take 3ms when throttled. For consistent benchmarks:
- Run multiple iterations and report percentiles
- Warm up the CPU before measuring (a few seconds of computation)
- On Linux, consider setting the CPU governor to
performancefor benchmarks
Measuring too little
Operations under ~100ns are within the timer overhead itself. For such fine-grained measurements, time a batch:
start = time.perf_counter_ns()
for _ in range(1_000_000):
operation()
per_op_ns = (time.perf_counter_ns() - start) / 1_000_000
Compiler/interpreter optimisation
CPython may optimise away code with no side effects. Ensure the timed code produces a result you actually use:
# Bad — CPython might optimise this away
start = time.perf_counter()
1 + 1
elapsed = time.perf_counter() - start
# Better — store and use the result
start = time.perf_counter()
result = sum(range(10000))
elapsed = time.perf_counter() - start
assert result # Ensure result is used
Structured logging with timing
For production observability, emit timing data as structured logs:
import time
import json
import logging
logger = logging.getLogger(__name__)
def timed_operation(operation_name: str):
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
start = time.perf_counter()
try:
result = func(*args, **kwargs)
duration = time.perf_counter() - start
logger.info(json.dumps({
"event": "operation_complete",
"operation": operation_name,
"duration_ms": round(duration * 1000, 2),
"status": "success",
}))
return result
except Exception as e:
duration = time.perf_counter() - start
logger.error(json.dumps({
"event": "operation_failed",
"operation": operation_name,
"duration_ms": round(duration * 1000, 2),
"status": "error",
"error": str(e),
}))
raise
return wrapper
return decorator
This feeds naturally into log aggregation systems (ELK, Datadog, Grafana Loki) for dashboarding and alerting on latency regressions.
The one thing to remember: perf_counter is the foundation — but production timing needs statistical rigour (percentiles over averages), awareness of measurement pitfalls (GC, CPU scaling, float precision), and integration with observability infrastructure to turn numbers into actionable insights.
See Also
- Python Algorithmic Complexity Understand Algorithmic Complexity through a practical analogy so your Python decisions become faster and clearer.
- Python Async Performance Tuning Making your async Python faster is like organizing a busy restaurant kitchen — it's all about flow.
- Python Benchmark Methodology Why timing Python code once means nothing, and how fair testing works like a science experiment.
- Python C Extension Performance How Python borrows C's speed for the hard parts — like hiring a specialist for the toughest job on the worksite.
- Python Caching Strategies Understand Python caching strategies with a shortcut-road analogy so your app gets faster without taking wrong turns.