A/B Testing ML Models in Python — ELI5

Imagine you baked two batches of cookies — one with chocolate chips and one with peanut butter chips. You want to know which one people like more. You could eat both yourself and decide, but that is just your opinion. A better way is to give half your friends the chocolate chip cookies and the other half the peanut butter ones, then ask everyone to rate them. That is an A/B test.

Companies do the same thing with their computer brains (models). Say Netflix has a new recommendation model that it thinks will suggest better movies. But “thinks” is not the same as “knows.” The old model has been working fine. What if the new one is actually worse?

So Netflix splits its users into two groups. Group A keeps seeing recommendations from the old model. Group B gets recommendations from the new model. Nobody knows which group they are in. After a few weeks, Netflix checks: did Group B watch more movies? Did they rate them higher? Did fewer people cancel their subscriptions?

If Group B did better, the new model wins and everyone gets it. If not, the new model gets sent back to the drawing board, and nobody got hurt because only half the users saw it.

The important part is fairness. The groups need to be random and big enough that the results are not just luck. If you only tested on five friends, maybe the peanut butter group just happened to love peanut butter. With thousands of people, the results are much more trustworthy.

One thing to remember: A/B testing lets you prove a new model is better with real users before rolling it out to everyone — reducing the risk of making things worse.

pythonab-testingmachine-learningmlops

See Also