Python A/B Testing Framework — Core Concepts

Understand experiment design, statistical significance, and the components of an A/B testing platform built in Python.

Why Not Just Ship and See?

You could deploy the green button to everyone and compare this week’s metrics to last week’s. But last week had a holiday, or a marketing campaign, or different weather. Without a control group running at the same time, you can’t separate the effect of your change from everything else happening in the world.

A/B testing isolates the variable. Both groups experience the same external conditions. The only difference is your change.

The Components of an A/B Test

Every experiment has four parts:

Component	Purpose	Example
Hypothesis	What you think will happen	”A larger CTA button will increase signups by 5%“
Variants	The versions being compared	Control (current) vs Treatment (larger button)
Assignment	How users are split into groups	50/50 random split, consistent per user
Metric	What you’re measuring	Signup conversion rate

Statistical Significance: Is the Difference Real?

If Group A converts at 4.8% and Group B at 5.2%, is B actually better? Or did you just get lucky with the sample?

p-value answers this. It tells you the probability of seeing this difference (or larger) if there’s actually no real difference between A and B. The industry standard threshold is p < 0.05 — meaning less than a 5% chance the result is due to random noise.

Sample size determines how small a difference you can detect. To detect a 1% lift on a 5% base conversion rate with 95% confidence, you need roughly 30,000 users per variant. Smaller effects need bigger samples.

Statistical power is the probability of detecting a real effect when one exists. A standard target is 80% power. Running an experiment with too few users gives you low power — you might miss a real improvement.

Assignment Strategies

Random assignment — each new user is randomly placed in a group. Simple but can create imbalanced groups with small samples.

Consistent hashing — hash the user ID to determine their group. The same user always sees the same variant, even across sessions. This is the standard approach.

Stratified assignment — ensure key demographics (country, device, plan) are balanced across groups. More complex but reduces noise in the results.

Guardrail Metrics

Your experiment might improve signups (the primary metric) while destroying something else — like increasing page load time or crashing the checkout flow. Guardrail metrics are secondary metrics that must not get worse:

Error rate
Page load time
Revenue per user
Customer support tickets

If a guardrail metric degrades beyond a threshold, the experiment should be stopped automatically, regardless of how well the primary metric is doing.

Common Misconception

“We can stop the test early when we see a winner.” Peeking at results and stopping when they look good inflates your false-positive rate dramatically. If you check results 10 times during an experiment, your actual false-positive rate jumps from 5% to over 25%. Either commit to a fixed sample size upfront, or use sequential testing methods that account for peeking.

One thing to remember: An A/B test needs a clear hypothesis, consistent user assignment, enough sample size for statistical power, and guardrail metrics to catch unintended harm. Cutting corners on any of these makes results unreliable.

pythonexperimentationdata