Python statistics Module — Deep Dive
Design Philosophy
The statistics module was introduced in PEP 450 with a clear mandate: provide correct, readable statistical functions for typical use cases. The emphasis on “correct” is key — it deliberately uses exact rational arithmetic internally to avoid floating-point accumulation errors.
import statistics
from fractions import Fraction
# Exact arithmetic with Fraction inputs
data = [Fraction(1, 3), Fraction(1, 6), Fraction(1, 2)]
statistics.mean(data) # Fraction(1, 3) — exact, not 0.333...
This makes it suitable for financial calculations and educational contexts where floating-point surprises are unacceptable.
The Exact Arithmetic Pipeline
How mean Works Internally
The naive approach sum(data) / len(data) suffers from floating-point errors for large datasets. The statistics module converts values to Fraction internally:
# Simplified version of the internal logic
def _exact_mean(data):
n = 0
total = Fraction(0)
for x in data:
n += 1
total += Fraction(x) # Exact representation
return total / n
This is why statistics.mean is slower than numpy.mean — it trades speed for correctness. For [0.1, 0.1, 0.1], the standard library returns exactly 0.1, while a naive floating-point sum might return 0.10000000000000002.
fmean: The Fast Path (Python 3.8+)
When you don’t need exact arithmetic:
statistics.fmean(data) # Uses math.fsum for compensated summation
fmean uses math.fsum (Kahan summation algorithm) for nearly-exact float accumulation. It’s ~3-5x faster than mean while still being more accurate than naive sum()/len().
NormalDist: A Full Distribution Object (Python 3.8+)
The module includes NormalDist — a lightweight class for working with normal (Gaussian) distributions:
from statistics import NormalDist
# Create from parameters
dist = NormalDist(mu=170, sigma=10) # Heights in cm
# Probability density
dist.pdf(175) # 0.0352...
# Cumulative distribution
dist.cdf(180) # 0.8413... (probability of being ≤ 180 cm)
# Inverse CDF (quantile function)
dist.inv_cdf(0.95) # 186.45... (95th percentile)
# Create from data
measured = NormalDist.from_samples([168, 172, 175, 170, 169, 173])
measured.mean # 171.17
measured.stdev # 2.56
Distribution Arithmetic
NormalDist supports operations that follow the rules of normal distribution algebra:
height = NormalDist(170, 10)
shoe_size = NormalDist(42, 2)
# Sum of independent normals
combined = height + shoe_size # NormalDist(mu=212, sigma=sqrt(104))
# Scaling
doubled = height * 2 # NormalDist(mu=340, sigma=20)
# Overlap between two distributions
overlap = height.overlap(NormalDist(175, 8)) # Probability of shared area
Monte Carlo Validation
# Generate samples and verify statistics
samples = dist.samples(10_000, seed=42)
empirical = NormalDist.from_samples(samples)
assert abs(empirical.mean - dist.mean) < 0.5
assert abs(empirical.stdev - dist.stdev) < 0.5
Quantiles in Detail
data = list(range(1, 101)) # 1 to 100
# Quartiles (default)
statistics.quantiles(data, n=4)
# [25.75, 50.5, 75.25] — exclusive method (default)
statistics.quantiles(data, n=4, method='inclusive')
# [25.5, 50.5, 75.5] — inclusive method
Exclusive vs Inclusive: The exclusive method (default) is appropriate for continuous data where you’re estimating population quantiles from a sample. The inclusive method matches what you’d calculate by hand for the dataset itself. Excel’s PERCENTILE.INC uses inclusive; PERCENTILE.EXC uses exclusive.
Edge Cases and Gotchas
Empty Data
statistics.mean([]) # Raises StatisticsError: mean requires at least one data point
All functions raise StatisticsError for empty inputs. This is intentional — returning 0 or None would silently produce wrong results.
Single Data Point
statistics.variance([42]) # Raises StatisticsError: variance requires at least two data points
statistics.pvariance([42]) # 0 — population variance of a single point is 0
Sample variance needs n-1 in the denominator, so a single point is undefined.
Mixed Types
statistics.mean([1, 2.5, Fraction(1, 3)]) # Works — converts to common type
statistics.mean([1, 2, "three"]) # Raises TypeError
The module handles mixed numeric types (int, float, Fraction, Decimal) gracefully but rejects non-numeric values.
Decimal Precision
from decimal import Decimal
data = [Decimal("0.1"), Decimal("0.2"), Decimal("0.3")]
statistics.mean(data) # Decimal('0.2') — preserves Decimal type and precision
This is valuable for financial calculations where Decimal precision semantics matter.
Building on the Module
Weighted Statistics (DIY)
The module doesn’t include weighted means, but you can build one:
def weighted_mean(values, weights):
"""Weighted mean using exact arithmetic."""
total = sum(Fraction(v) * Fraction(w) for v, w in zip(values, weights))
weight_sum = sum(Fraction(w) for w in weights)
return float(total / weight_sum)
weighted_mean([80, 90, 70], [0.3, 0.5, 0.2]) # 83.0
Rolling Statistics
from collections import deque
class RollingStats:
"""Maintain running statistics over a sliding window."""
def __init__(self, window_size):
self._data = deque(maxlen=window_size)
def push(self, value):
self._data.append(value)
@property
def mean(self):
return statistics.fmean(self._data)
@property
def stdev(self):
if len(self._data) < 2:
return 0.0
return statistics.stdev(self._data)
@property
def median(self):
return statistics.median(self._data)
Confidence Intervals with NormalDist
from statistics import NormalDist
import math
def confidence_interval(data, confidence=0.95):
"""Calculate confidence interval for the mean."""
n = len(data)
mean = statistics.fmean(data)
se = statistics.stdev(data) / math.sqrt(n) # Standard error
z = NormalDist().inv_cdf((1 + confidence) / 2)
margin = z * se
return (mean - margin, mean + margin)
data = [23.1, 24.5, 22.8, 25.0, 23.7, 24.2]
lo, hi = confidence_interval(data)
# 95% CI: (22.87, 25.00) approximately
Performance Comparison
import timeit
import numpy as np
import statistics
data_list = list(range(10_000))
data_array = np.array(data_list)
# statistics.mean: ~15ms (exact arithmetic)
timeit.timeit(lambda: statistics.mean(data_list), number=100)
# statistics.fmean: ~2ms (compensated float sum)
timeit.timeit(lambda: statistics.fmean(data_list), number=100)
# numpy.mean: ~0.02ms (compiled C)
timeit.timeit(lambda: np.mean(data_array), number=100)
For 10,000 items, NumPy is ~100x faster than fmean and ~750x faster than mean. The crossover point where the statistics module becomes impractical is around 100,000 items for fmean and 10,000 for mean.
What’s Missing (and Where to Go)
| Need | Not in statistics | Use instead |
|---|---|---|
| Weighted statistics | ❌ | NumPy |
| Statistical tests (t-test, etc.) | ❌ | SciPy |
| Distributions beyond Normal | ❌ | SciPy |
| Regression beyond linear | ❌ | scikit-learn |
| Time series | ❌ | pandas, statsmodels |
| Bayesian statistics | ❌ | PyMC, ArviZ |
One Thing to Remember
The statistics module trades speed for correctness — its exact arithmetic and clean API make it the right choice for small datasets, teaching, and financial calculations where every decimal matters.
See Also
- Python Random Module Patterns Learn how Python picks random numbers, shuffles cards, and makes fair choices — and why it's not truly random.
- Python Scipy Scientific Computing Learn why scientists and engineers reach for SciPy when they need Python to crunch serious math problems.
- Python Sympy Symbolic Math See how Python can solve algebra homework for you — with letters instead of just numbers.
- Ci Cd Why big apps can ship updates every day without turning your phone into a glitchy mess — CI/CD is the behind-the-scenes quality gate and delivery truck.
- Containerization Why does software that works on your computer break on everyone else's? Containers fix that — and they're why Netflix can deploy 100 updates a day without the site going down.