A/B Testing — Deep Dive

CUPED variance reduction, interleaving for recommendation systems, multi-armed bandits, Bayesian sequential testing, and how Booking.com runs 1000+ simultaneous tests.

CUPED: Variance Reduction Through Covariates

A fundamental constraint on A/B test speed: higher variance in the metric requires larger samples. If users’ weekly purchase amounts range from $0 to $500, you need many users to reliably detect a $5 improvement.

CUPED (Controlled-experiment Using Pre-Experiment Data), introduced by Microsoft Research (Deng et al., 2013), reduces metric variance using pre-experiment observations.

The intuition: a user who spent $100 last week is likely to spend around $100 this week regardless of which variant they’re in. Pre-experiment spending predicts in-experiment spending. By “removing” this predictable variation, we reduce the noise we need to detect through the experiment.

Formally, define adjusted metric $\tilde{Y}$: $$\tilde{Y}_i = Y_i - \theta X_i$$

Where $X_i$ is a pre-experiment covariate (e.g., previous week’s purchases) and $\theta = \text{Cov}(Y, X) / \text{Var}(X)$ (OLS estimator).

The variance reduction: $$\text{Var}(\tilde{Y}) = \text{Var}(Y)(1 - \rho_{XY}^2)$$

Where $\rho_{XY}$ is the correlation between $Y$ and $X$. If pre-experiment purchases correlate 0.5 with in-experiment purchases, variance is reduced by 25%, and required sample size is reduced by 25%.

Microsoft reported 50% variance reduction on many metrics. Booking.com applies CUPED to nearly all experiments, substantially reducing experiment duration.

Key requirement: $X$ must be pre-experiment (before randomization) — it’s unaffected by treatment assignment, so CUPED doesn’t introduce bias.

Interleaving: Faster Comparisons for Ranking

Standard A/B testing for recommendation systems has a fundamental problem: users’ behavior is highly noisy, and the difference between two ranking algorithms might affect only 1-2% of queries. Detecting this requires millions of users.

Interleaving (Joachims, 2003; Radlinski & Craswell, 2013) provides much faster signal by showing users results from both rankers simultaneously:

Ranker A produces ranking $[a_1, a_2, a_3, …]$
Ranker B produces ranking $[b_1, b_2, b_3, …]$
Interleave: take turns pulling from each ranker while deduplicating. Result: $[a_1, b_1, a_2, b_2, …]$
Show this combined list to the user
Track which items get clicked: score each ranker by how many of their items were clicked

If users click significantly more items from ranker A, ranker A wins. Since both rankers compete on the same page view, between-user variance is eliminated — dramatic variance reduction.

Multi-interleaving extends this to comparing 3+ rankers simultaneously. Netflix, Spotify, and Amazon use variants of interleaving for ranking experiments.

Required sample size: 100–1000x fewer users than A/B testing for equivalent sensitivity. Tradeoff: interleaving only measures which ranker is preferred, not the absolute click rate or downstream metrics.

Multi-Armed Bandits

Traditional A/B testing assigns equal traffic to each variant throughout the experiment. This is wasteful when one variant is clearly better early — you’re still sending half your traffic to the inferior version for weeks.

Multi-Armed Bandit (MAB) algorithms adaptively allocate more traffic to better-performing variants:

Thompson Sampling: Maintain a Beta distribution posterior over the conversion rate for each variant. To choose which variant to show:

Sample from each variant’s posterior: $\hat{p}_k \sim \text{Beta}(\alpha_k, \beta_k)$
Show the variant with the highest sample

As evidence accumulates, the posterior for the better variant becomes more concentrated at a higher value, so it gets selected more often.

UCB (Upper Confidence Bound): Select the variant with the highest upper confidence bound: $$A_t = \arg\max_k \left[\hat{\mu}_k + c\sqrt{\frac{\ln t}{n_k}}\right]$$

Where $\hat{\mu}_k$ is estimated mean reward and $\sqrt{\ln t / n_k}$ is the exploration bonus (decays as variant $k$ is seen more often).

The exploration-exploitation tradeoff: Bandit algorithms sacrifice some statistical rigor (harder to compute p-values) for reduced regret (fewer users exposed to the inferior variant). Most appropriate when the cost of showing inferior versions is high (user experience degradation, revenue loss) and experiment duration is long.

Sequential Testing: Looking Early, Validly

Sequential Probability Ratio Test (SPRT), Wald (1945): A test that can be stopped at any time with valid error guarantees. At each observation, compute the likelihood ratio:

$$\Lambda_n = \prod_{i=1}^n \frac{f_1(x_i)}{f_0(x_i)}$$

Stop and conclude $H_1$ if $\Lambda_n \geq B = (1-\beta)/\alpha$; stop and conclude $H_0$ if $\Lambda_n \leq A = \beta/(1-\alpha)$; continue otherwise.

Under $H_0$: expected sample size is much smaller than fixed-sample tests for tests that stop early due to clear null results. Under $H_1$: early stopping when the effect is large.

mSPRT (mixture SPRT) (Johari et al., 2017, Optimizely): Always-valid inference that maintains type I error control regardless of when you stop, using a variance mixture of normal distributions as the prior. Provides valid p-values and confidence intervals at any point in the experiment.

Experimentation at Scale: Booking.com and Microsoft

Booking.com (Vermeersch, 2019): Runs 1000+ simultaneous experiments on a product used by millions. Key practices:

Auto-detection of instrumentation errors: Compare assignment logs with product logs to detect missing events
Interaction detection: Test for statistical interactions between concurrent experiments (rare but important)
Guardrail metrics: Automatic experiment termination if core metrics (latency, error rates, core conversion) degrade significantly
Experiment templates: Standardized configurations for common experiment types reduce setup time and human error

Microsoft ExP (Experimentation Platform) (Kohavi et al.): Processes trillions of rows of data daily across 100+ products. Key innovation: triggered analysis — analyze only users who actually experienced the treatment (for features that are only visible in certain conditions). Reduces noise from unexposed users.

The Microsoft team’s analysis of 20,000 experiments found: approximately 1/3 of experiments show positive results, 1/3 show neutral/inconclusive results, and 1/3 show negative results. Without testing, you’d ship all of them indiscriminately.

One thing to remember: The sophistication of modern experimentation platforms — CUPED, interleaving, sequential testing, interaction detection — exists to solve one problem: making decisions faster with less data, while maintaining statistical validity.

ab-testingcupedmulti-armed-banditsequential-testinginterleavingexperimentation-platform

A/B Testing — Deep Dive

CUPED: Variance Reduction Through Covariates

Interleaving: Faster Comparisons for Ranking

Multi-Armed Bandits

Sequential Testing: Looking Early, Validly

Experimentation at Scale: Booking.com and Microsoft

See Also

Related Topics