A/B Testing — Core Concepts

Statistical hypothesis testing for product experiments: p-values, statistical power, sample size calculation, and why peeking at results ruins your experiment.

The Statistical Foundation

An A/B test is a randomized controlled experiment. The null hypothesis $H_0$ states that the treatment (version B) has no effect — any observed difference is due to random chance. The alternative hypothesis $H_1$ states the treatment has a real effect.

The question: given observed data, how likely is the null hypothesis?

The p-value: The probability of observing results at least as extreme as those actually observed, assuming the null hypothesis is true. If $p = 0.03$, there’s only a 3% chance of seeing a difference this large by random chance when there’s actually no true effect.

The conventional threshold is $p < 0.05$ (5% significance level). This means you’ll incorrectly reject a true null hypothesis 5% of the time (Type I error rate, or false positive rate) — acceptable in most settings.

Two Types of Errors

Type I Error (False Positive): Concluding there’s an effect when there isn’t one. Controlled by your significance level $\alpha$. Set $\alpha = 0.05$ and you’ll make this mistake 5% of the time in properly conducted tests.

Type II Error (False Negative): Missing a real effect — failing to detect a difference that exists. The probability of Type II error is $\beta$. Statistical power is $1 - \beta$ — the probability of correctly detecting a real effect.

Typical power target: 80% (you’ll miss 20% of true effects). This is a tradeoff: higher power requires more data.

Sample Size Calculation

Before running an experiment, you need to estimate the required sample size to detect an effect of a given size with acceptable power:

$$n \approx \frac{2(z_{\alpha/2} + z_\beta)^2 \sigma^2}{\delta^2}$$

Where:

$z_{\alpha/2} = 1.96$ for $\alpha = 0.05$ (two-tailed)
$z_\beta = 0.84$ for 80% power
$\sigma^2$ is the variance of the metric
$\delta$ is the minimum detectable effect (MDE) — the smallest real difference worth caring about

Example: You’re testing a checkout flow change and want to detect a 1% absolute improvement in conversion rate (currently 5%). With $\sigma^2 = p(1-p) = 0.0475$:

$$n \approx \frac{2(1.96 + 0.84)^2 \times 0.0475}{0.01^2} \approx 74,000 \text{ users per variant}$$

For 148,000 total users, at 10,000 daily visitors: you need 15 days of experiment runtime.

Conversely: if you can only afford 30,000 users per variant, your MDE is ~1.6% — you can only reliably detect improvements larger than 1.6 percentage points.

The Peeking Problem

A pervasive mistake: running an experiment until results look significant, then stopping.

If you continuously monitor a test and stop whenever $p < 0.05$, your actual false positive rate is much higher than 5%. In simulation: checking every day, with no true effect, you’ll get $p < 0.05$ at some point during the experiment about 22% of the time — not 5%.

Why? Small random fluctuations can temporarily push $p$ below 0.05 early in an experiment, even when there’s no real effect. If you stop there, you’re capturing noise.

Solutions:

Pre-commit: Decide sample size in advance; only look at results once
Sequential testing: Use statistical methods designed for continuous monitoring (sequential tests, Alpha-spending functions, SPRT)
Always Valid Inference: Methods by Johari et al. (Optimizely, 2015) that maintain valid p-values regardless of when you look

Common Pitfalls

Multiple testing: Test 20 changes, and by chance about one will show $p < 0.05$ with no real effect. The Bonferroni correction divides the significance threshold by the number of tests: for 20 tests, use $\alpha = 0.05/20 = 0.0025$ instead of $0.05$.

Network effects and spillover: If user A is in the control group but user A’s friends are in the treatment group, user A’s experience is affected by the treatment. Common in social networks. Solution: cluster randomization (assign clusters of connected users to the same group).

Novelty effect: Users engage more with anything new, simply because it’s different. A new feature might show initial improvement that fades after users habituate. Run experiments long enough to capture steady-state behavior.

Selection bias in assignment: If your randomization is by day (Monday gets version A, Tuesday gets version B), behavior differences between days contaminate your results. Always randomize at the user level.

Metric selection: What you measure matters enormously. Optimizing click-through rate can hurt revenue. Optimizing short-session engagement can hurt long-term retention. Always pre-specify your primary metric (and secondary ones) before running the test.

The Switch from “Did We Win?” to “How Much?”

Traditional A/B testing asks “is there a statistically significant effect?” Bayesian A/B testing asks “what is the probability that B is better than A, and by how much?”

Bayesian approach uses prior beliefs + observed data → posterior distribution over effect size. Benefits:

More natural interpretation (“B is probably 2.3% better, with 95% credible interval [1.1%, 3.5%]”)
No fixed stopping rules needed
Can incorporate prior knowledge (similar tests showed 1-2% effects)

Many platforms (VWO, Google Optimize) now offer Bayesian testing options. Large companies with strong statistical expertise (Airbnb, Netflix) often use hybrid approaches.

One thing to remember: The hardest part of A/B testing isn’t the statistics — it’s discipline: pre-specifying your metrics, respecting your sample size calculation, and not peeking at results until you’ve collected enough data to trust them.

ab-testinghypothesis-testingp-valuesstatistical-powerexperimentation