Python A/B Testing Framework — Core Concepts

Why Not Just Ship and See?

You could deploy the green button to everyone and compare this week’s metrics to last week’s. But last week had a holiday, or a marketing campaign, or different weather. Without a control group running at the same time, you can’t separate the effect of your change from everything else happening in the world.

A/B testing isolates the variable. Both groups experience the same external conditions. The only difference is your change.

The Components of an A/B Test

Every experiment has four parts:

ComponentPurposeExample
HypothesisWhat you think will happen”A larger CTA button will increase signups by 5%“
VariantsThe versions being comparedControl (current) vs Treatment (larger button)
AssignmentHow users are split into groups50/50 random split, consistent per user
MetricWhat you’re measuringSignup conversion rate

Statistical Significance: Is the Difference Real?

If Group A converts at 4.8% and Group B at 5.2%, is B actually better? Or did you just get lucky with the sample?

p-value answers this. It tells you the probability of seeing this difference (or larger) if there’s actually no real difference between A and B. The industry standard threshold is p < 0.05 — meaning less than a 5% chance the result is due to random noise.

Sample size determines how small a difference you can detect. To detect a 1% lift on a 5% base conversion rate with 95% confidence, you need roughly 30,000 users per variant. Smaller effects need bigger samples.

Statistical power is the probability of detecting a real effect when one exists. A standard target is 80% power. Running an experiment with too few users gives you low power — you might miss a real improvement.

Assignment Strategies

Random assignment — each new user is randomly placed in a group. Simple but can create imbalanced groups with small samples.

Consistent hashing — hash the user ID to determine their group. The same user always sees the same variant, even across sessions. This is the standard approach.

Stratified assignment — ensure key demographics (country, device, plan) are balanced across groups. More complex but reduces noise in the results.

Guardrail Metrics

Your experiment might improve signups (the primary metric) while destroying something else — like increasing page load time or crashing the checkout flow. Guardrail metrics are secondary metrics that must not get worse:

  • Error rate
  • Page load time
  • Revenue per user
  • Customer support tickets

If a guardrail metric degrades beyond a threshold, the experiment should be stopped automatically, regardless of how well the primary metric is doing.

Common Misconception

“We can stop the test early when we see a winner.” Peeking at results and stopping when they look good inflates your false-positive rate dramatically. If you check results 10 times during an experiment, your actual false-positive rate jumps from 5% to over 25%. Either commit to a fixed sample size upfront, or use sequential testing methods that account for peeking.

One thing to remember: An A/B test needs a clear hypothesis, consistent user assignment, enough sample size for statistical power, and guardrail metrics to catch unintended harm. Cutting corners on any of these makes results unreliable.

pythonexperimentationdata

See Also