A/B Testing ML Models in Python — Core Concepts

Why A/B Test Models?

Offline evaluation (test set accuracy, cross-validation) tells you how a model performs on historical data. It does not tell you how it performs in production with real users, real latency, and real feedback loops.

A/B testing bridges this gap by exposing a new model to a fraction of live traffic and measuring its impact against the current production model. It answers: “Does this model actually improve the business metric we care about?”

How Model A/B Tests Work

  1. Define the hypothesis — “Model B will increase click-through rate by at least 2% compared to Model A”
  2. Choose metrics — a primary metric (what you optimize) and guardrail metrics (what must not degrade)
  3. Calculate sample size — how much traffic and how long to reach statistical significance
  4. Split traffic — randomly assign users to control (Model A) or treatment (Model B)
  5. Run the experiment — collect data without peeking
  6. Analyze results — statistical test to determine if the difference is real
  7. Decide — promote Model B, iterate, or discard

Key Concepts

Statistical Significance

A result is statistically significant when the observed difference is unlikely to have occurred by chance. The standard threshold is p < 0.05, meaning less than a 5% probability the difference is random noise.

Sample Size

Underpowered tests are the most common mistake. If you need 10,000 users per group to detect a 2% lift but only run the test on 1,000, you will likely conclude “no difference” even when Model B is genuinely better. This is a Type II error (false negative).

Guardrail Metrics

Primary metrics are what you want to improve. Guardrail metrics are what must not get worse. For a recommendation model:

Metric TypeExample
PrimaryClick-through rate
GuardrailPage load time
GuardrailUnsubscribe rate
GuardrailRevenue per user

A model that improves clicks but doubles unsubscribes is a net loss.

Traffic Splitting Strategies

  • User-level split — each user always sees the same model. Prevents inconsistent experiences.
  • Request-level split — each request is randomly routed. Simpler but can confuse users with inconsistent results.
  • Session-level split — consistent within a browsing session, may change between sessions.

User-level splitting is the standard for most ML A/B tests because model behavior affects the entire user experience.

The Peeking Problem

Checking results daily and stopping the test when it “looks significant” inflates false positive rates. A test designed for two weeks should run for two weeks. Early stopping requires sequential analysis methods designed for that purpose.

Common Misconception

Many teams compare Model A and Model B only on the test set and skip the A/B test entirely. Offline metrics and online metrics often disagree. A model with 1% higher accuracy might have 5% worse latency, causing users to leave before seeing the result. A/B testing catches these real-world effects that offline evaluation misses.

One thing to remember: A/B testing is the only reliable way to measure whether a model change actually helps real users — offline metrics alone are not enough to justify a production deployment.

pythonab-testingmachine-learningmlops

See Also