Python timeit Best Practices — Deep Dive

Internals of Python's timeit module, adaptive calibration, and advanced patterns for production-grade micro-benchmarks.

The timeit module is deceptively simple on the surface but contains careful design decisions for measurement accuracy. Understanding its internals helps you use it correctly and know when to reach for something more powerful.

How timeit works under the hood

Template compilation

When you call timeit.timeit(stmt, setup), the module compiles a function from a template:

# Simplified version of what timeit generates internally
def inner(_it, _timer):
    setup_code
    _t0 = _timer()
    for _i in _it:
        stmt_code
    _t1 = _timer()
    return _t1 - _t0

The statement is literally inserted into a loop body via compile(). This means:

No function call overhead on the measured code — it’s inlined
The setup runs inside the function scope — variables from setup are locals, which are faster than globals in CPython
The loop variable _i is a range iterator, adding minimal per-iteration overhead

The timer function

timeit uses time.perf_counter() by default (since Python 3.3). This is a monotonic clock with the highest available resolution — typically nanosecond precision on modern systems:

import time

# Check your system's resolution
print(time.get_clock_info('perf_counter'))
# namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)',
#           monotonic=True, resolution=1e-09)

For CPU-only work, time.process_time() can be more appropriate as it excludes sleep and I/O wait, but timeit doesn’t use it by default.

Adaptive calibration

When you use the CLI without specifying -n:

python -m timeit "sum(range(1000))"

timeit runs an auto-calibration loop. It starts with number=1, doubles repeatedly (1, 2, 5, 10, 20, 50, …) until the total time exceeds 0.2 seconds, then uses that number for the actual measurement.

# The calibration algorithm (simplified)
def autorange(timer):
    for i in range(1, 10):
        for j in [1, 2, 5]:
            number = j * (10 ** i)
            time_taken = timer.timeit(number)
            if time_taken >= 0.2:
                return number
    return number

This ensures each batch takes long enough to be meaningful but doesn’t waste time on excessive iterations.

Advanced patterns

Comparing callables directly

Instead of string statements, use callables for cleaner code:

import timeit

def approach_a():
    return [x**2 for x in range(1000)]

def approach_b():
    return list(map(lambda x: x**2, range(1000)))

# Using Timer objects for precise control
timer_a = timeit.Timer(approach_a)
timer_b = timeit.Timer(approach_b)

# Auto-range for fair comparison
n_a, t_a = timer_a.autorange()
n_b, t_b = timer_b.autorange()

print(f"List comp: {t_a/n_a*1e6:.2f} µs/op (n={n_a})")
print(f"Map:       {t_b/n_b*1e6:.2f} µs/op (n={n_b})")

Note: autorange() may choose different n values for each approach. For strict comparison, use the same number for both.

Handling mutable state correctly

The hardest problem in micro-benchmarking is mutable state. Here’s a robust pattern:

import timeit
import copy
import random

original_data = random.sample(range(10000), 10000)

def bench_sort():
    data = copy.copy(original_data)  # shallow copy per iteration
    data.sort()
    return data

# copy cost is included in timing — subtract it
copy_time = timeit.timeit(
    lambda: copy.copy(original_data),
    number=10000
)
sort_time = timeit.timeit(bench_sort, number=10000)
net_sort_time = sort_time - copy_time

print(f"Pure sort time: {net_sort_time/10000*1e6:.1f} µs")

This “subtract the baseline” approach isn’t perfect — copy and sort interact through cache effects — but it’s better than timing an already-sorted list.

Multi-line statements with proper scope

import timeit

setup = """
import json
import random
data = {f'key_{i}': random.random() for i in range(100)}
"""

# Multi-line statement using triple-quote
stmt = """
encoded = json.dumps(data)
decoded = json.loads(encoded)
assert decoded == data
"""

result = timeit.repeat(stmt, setup, repeat=5, number=10000)
best = min(result)
print(f"Round-trip: {best/10000*1e6:.1f} µs")

Parameterized benchmarks

import timeit

def bench_range_sum(size):
    """Benchmark sum() for different input sizes"""
    timer = timeit.Timer(
        f'sum(range({size}))',
    )
    n, total = timer.autorange()
    per_call = total / n
    return per_call

sizes = [10, 100, 1000, 10000, 100000]
for size in sizes:
    t = bench_range_sum(size)
    print(f"sum(range({size:>6})): {t*1e6:>10.2f} µs")

Statistical analysis of timeit results

timeit.repeat() returns a list of times. Here’s how to analyze them properly:

import timeit
import statistics

results = timeit.repeat(
    'sorted(data)',
    setup='import random; data = random.sample(range(1000), 1000)',
    repeat=20,
    number=1000
)

per_call = [t / 1000 for t in results]

print(f"Min:    {min(per_call)*1e6:.2f} µs")
print(f"Median: {statistics.median(per_call)*1e6:.2f} µs")
print(f"Mean:   {statistics.mean(per_call)*1e6:.2f} µs")
print(f"Stdev:  {statistics.stdev(per_call)*1e6:.2f} µs")
print(f"CV:     {statistics.stdev(per_call)/statistics.mean(per_call)*100:.1f}%")

# Coefficient of variation (CV) > 10% means the benchmark is too noisy

A coefficient of variation above 10% signals that your environment isn’t controlled enough or your number is too low.

Interaction with the GIL and threads

timeit disables garbage collection but doesn’t control the GIL. If you’re benchmarking in a multi-threaded application, other threads will compete for the GIL during your measurement.

import timeit
import threading

# BAD: background thread adds GIL contention
worker = threading.Thread(target=background_work, daemon=True)
worker.start()
timeit.timeit(target_function, number=10000)  # Inflated by GIL contention

# BETTER: ensure single-threaded during measurement
# Or use multiprocessing-based benchmarks (pyperf does this)

pyperf runs each benchmark in a fresh worker process specifically to avoid this problem.

When timeit isn’t enough

Limitation	Alternative
No statistical tests	`pyperf compare_to`
No regression tracking	`pytest-benchmark --benchmark-autosave`
No flame graphs	`py-spy` or `cProfile` + `snakeviz`
No memory measurement	`tracemalloc` or `memory_profiler`
No async support	Wrap in `asyncio.run()` or use `aiotools`

Production micro-benchmark template

#!/usr/bin/env python3
"""Benchmark template following best practices."""
import gc
import statistics
import sys
import timeit

WARMUP = 10
REPEAT = 20
NUMBER = 10000

def setup():
    """Prepare benchmark data."""
    import random
    return random.sample(range(10000), 10000)

def target(data):
    """The operation under test."""
    return sorted(data)

def main():
    data = setup()

    # Warmup
    for _ in range(WARMUP):
        target(data)

    # Measure
    gc.disable()
    times = timeit.repeat(
        lambda: target(data),
        repeat=REPEAT,
        number=NUMBER,
    )
    gc.enable()

    per_call = [t / NUMBER * 1e6 for t in times]  # microseconds

    print(f"Python {sys.version}")
    print(f"Iterations: {NUMBER} × {REPEAT} repeats")
    print(f"Min:    {min(per_call):.2f} µs")
    print(f"Median: {statistics.median(per_call):.2f} µs")
    print(f"Stdev:  {statistics.stdev(per_call):.2f} µs")
    print(f"CV:     {statistics.stdev(per_call)/statistics.mean(per_call)*100:.1f}%")

if __name__ == '__main__':
    main()

The one thing to remember: timeit handles loop compilation, GC suppression, and high-resolution timing automatically — your job is to isolate setup from measurement, use repeat for multiple independent samples, and analyze the distribution rather than trusting a single number.

pythonperformanceinternals