Scikit-Learn Imbalanced Data — ELI5

Why a model that's 99% accurate can still be useless — and how to fix it when one group vastly outnumbers another.

Imagine a fire alarm that never goes off. It’s “correct” 99.9% of the time — because fires are extremely rare. But the one time there’s an actual fire, it stays silent. That’s not a good alarm.

This is exactly what happens when you train a machine learning model on imbalanced data — when one group massively outnumbers another.

Say you’re building a model to detect credit card fraud. Out of 10,000 transactions, only 50 are fraudulent. If the model simply says “not fraud” for every single transaction, it’s right 99.5% of the time. Impressive accuracy — but it catches zero fraud. Completely useless.

The problem is that the model takes the lazy path. Why bother learning subtle fraud patterns when guessing “not fraud” gets such a high score? It’s like a student who discovers they can pass an exam by always picking answer “C.”

To fix this, you need to change the rules of the game:

Make rare events louder. Tell the model that missing a fraud case is 100x worse than flagging a legitimate transaction. Now the lazy path is punished.

Balance the training data. Either duplicate the rare examples (oversampling) or use fewer common examples (undersampling) so both groups are equally represented.

Change how you measure success. Stop looking at accuracy. Instead, ask: “Of all the real fraud cases, how many did you catch?” That metric exposes the lazy model immediately.

One thing to remember: When your data has a rare but important class, accuracy lies. Change the metric, rebalance the data, or adjust the penalty — otherwise your model learns to ignore the thing you care about most.

pythonmachine-learningscikit-learn

Scikit-Learn Imbalanced Data — ELI5

See Also

Related Topics