Python timeit Best Practices — Deep Dive
The timeit module is deceptively simple on the surface but contains careful design decisions for measurement accuracy. Understanding its internals helps you use it correctly and know when to reach for something more powerful.
How timeit works under the hood
Template compilation
When you call timeit.timeit(stmt, setup), the module compiles a function from a template:
# Simplified version of what timeit generates internally
def inner(_it, _timer):
setup_code
_t0 = _timer()
for _i in _it:
stmt_code
_t1 = _timer()
return _t1 - _t0
The statement is literally inserted into a loop body via compile(). This means:
- No function call overhead on the measured code — it’s inlined
- The setup runs inside the function scope — variables from setup are locals, which are faster than globals in CPython
- The loop variable
_iis a range iterator, adding minimal per-iteration overhead
The timer function
timeit uses time.perf_counter() by default (since Python 3.3). This is a monotonic clock with the highest available resolution — typically nanosecond precision on modern systems:
import time
# Check your system's resolution
print(time.get_clock_info('perf_counter'))
# namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)',
# monotonic=True, resolution=1e-09)
For CPU-only work, time.process_time() can be more appropriate as it excludes sleep and I/O wait, but timeit doesn’t use it by default.
Adaptive calibration
When you use the CLI without specifying -n:
python -m timeit "sum(range(1000))"
timeit runs an auto-calibration loop. It starts with number=1, doubles repeatedly (1, 2, 5, 10, 20, 50, …) until the total time exceeds 0.2 seconds, then uses that number for the actual measurement.
# The calibration algorithm (simplified)
def autorange(timer):
for i in range(1, 10):
for j in [1, 2, 5]:
number = j * (10 ** i)
time_taken = timer.timeit(number)
if time_taken >= 0.2:
return number
return number
This ensures each batch takes long enough to be meaningful but doesn’t waste time on excessive iterations.
Advanced patterns
Comparing callables directly
Instead of string statements, use callables for cleaner code:
import timeit
def approach_a():
return [x**2 for x in range(1000)]
def approach_b():
return list(map(lambda x: x**2, range(1000)))
# Using Timer objects for precise control
timer_a = timeit.Timer(approach_a)
timer_b = timeit.Timer(approach_b)
# Auto-range for fair comparison
n_a, t_a = timer_a.autorange()
n_b, t_b = timer_b.autorange()
print(f"List comp: {t_a/n_a*1e6:.2f} µs/op (n={n_a})")
print(f"Map: {t_b/n_b*1e6:.2f} µs/op (n={n_b})")
Note: autorange() may choose different n values for each approach. For strict comparison, use the same number for both.
Handling mutable state correctly
The hardest problem in micro-benchmarking is mutable state. Here’s a robust pattern:
import timeit
import copy
import random
original_data = random.sample(range(10000), 10000)
def bench_sort():
data = copy.copy(original_data) # shallow copy per iteration
data.sort()
return data
# copy cost is included in timing — subtract it
copy_time = timeit.timeit(
lambda: copy.copy(original_data),
number=10000
)
sort_time = timeit.timeit(bench_sort, number=10000)
net_sort_time = sort_time - copy_time
print(f"Pure sort time: {net_sort_time/10000*1e6:.1f} µs")
This “subtract the baseline” approach isn’t perfect — copy and sort interact through cache effects — but it’s better than timing an already-sorted list.
Multi-line statements with proper scope
import timeit
setup = """
import json
import random
data = {f'key_{i}': random.random() for i in range(100)}
"""
# Multi-line statement using triple-quote
stmt = """
encoded = json.dumps(data)
decoded = json.loads(encoded)
assert decoded == data
"""
result = timeit.repeat(stmt, setup, repeat=5, number=10000)
best = min(result)
print(f"Round-trip: {best/10000*1e6:.1f} µs")
Parameterized benchmarks
import timeit
def bench_range_sum(size):
"""Benchmark sum() for different input sizes"""
timer = timeit.Timer(
f'sum(range({size}))',
)
n, total = timer.autorange()
per_call = total / n
return per_call
sizes = [10, 100, 1000, 10000, 100000]
for size in sizes:
t = bench_range_sum(size)
print(f"sum(range({size:>6})): {t*1e6:>10.2f} µs")
Statistical analysis of timeit results
timeit.repeat() returns a list of times. Here’s how to analyze them properly:
import timeit
import statistics
results = timeit.repeat(
'sorted(data)',
setup='import random; data = random.sample(range(1000), 1000)',
repeat=20,
number=1000
)
per_call = [t / 1000 for t in results]
print(f"Min: {min(per_call)*1e6:.2f} µs")
print(f"Median: {statistics.median(per_call)*1e6:.2f} µs")
print(f"Mean: {statistics.mean(per_call)*1e6:.2f} µs")
print(f"Stdev: {statistics.stdev(per_call)*1e6:.2f} µs")
print(f"CV: {statistics.stdev(per_call)/statistics.mean(per_call)*100:.1f}%")
# Coefficient of variation (CV) > 10% means the benchmark is too noisy
A coefficient of variation above 10% signals that your environment isn’t controlled enough or your number is too low.
Interaction with the GIL and threads
timeit disables garbage collection but doesn’t control the GIL. If you’re benchmarking in a multi-threaded application, other threads will compete for the GIL during your measurement.
import timeit
import threading
# BAD: background thread adds GIL contention
worker = threading.Thread(target=background_work, daemon=True)
worker.start()
timeit.timeit(target_function, number=10000) # Inflated by GIL contention
# BETTER: ensure single-threaded during measurement
# Or use multiprocessing-based benchmarks (pyperf does this)
pyperf runs each benchmark in a fresh worker process specifically to avoid this problem.
When timeit isn’t enough
| Limitation | Alternative |
|---|---|
| No statistical tests | pyperf compare_to |
| No regression tracking | pytest-benchmark --benchmark-autosave |
| No flame graphs | py-spy or cProfile + snakeviz |
| No memory measurement | tracemalloc or memory_profiler |
| No async support | Wrap in asyncio.run() or use aiotools |
Production micro-benchmark template
#!/usr/bin/env python3
"""Benchmark template following best practices."""
import gc
import statistics
import sys
import timeit
WARMUP = 10
REPEAT = 20
NUMBER = 10000
def setup():
"""Prepare benchmark data."""
import random
return random.sample(range(10000), 10000)
def target(data):
"""The operation under test."""
return sorted(data)
def main():
data = setup()
# Warmup
for _ in range(WARMUP):
target(data)
# Measure
gc.disable()
times = timeit.repeat(
lambda: target(data),
repeat=REPEAT,
number=NUMBER,
)
gc.enable()
per_call = [t / NUMBER * 1e6 for t in times] # microseconds
print(f"Python {sys.version}")
print(f"Iterations: {NUMBER} × {REPEAT} repeats")
print(f"Min: {min(per_call):.2f} µs")
print(f"Median: {statistics.median(per_call):.2f} µs")
print(f"Stdev: {statistics.stdev(per_call):.2f} µs")
print(f"CV: {statistics.stdev(per_call)/statistics.mean(per_call)*100:.1f}%")
if __name__ == '__main__':
main()
The one thing to remember: timeit handles loop compilation, GC suppression, and high-resolution timing automatically — your job is to isolate setup from measurement, use repeat for multiple independent samples, and analyze the distribution rather than trusting a single number.
See Also
- Python Algorithmic Complexity Understand Algorithmic Complexity through a practical analogy so your Python decisions become faster and clearer.
- Python Async Performance Tuning Making your async Python faster is like organizing a busy restaurant kitchen — it's all about flow.
- Python Benchmark Methodology Why timing Python code once means nothing, and how fair testing works like a science experiment.
- Python C Extension Performance How Python borrows C's speed for the hard parts — like hiring a specialist for the toughest job on the worksite.
- Python Caching Strategies Understand Python caching strategies with a shortcut-road analogy so your app gets faster without taking wrong turns.