Scikit-Learn Imbalanced Data — Core Concepts
Why imbalanced data breaks standard models
Most machine learning algorithms optimize overall accuracy or loss, treating all classes equally. When one class represents 95%+ of samples, the algorithm finds that predicting the majority class everywhere minimizes overall error. The minority class — often the one you actually care about — gets sacrificed.
This is not a bug in the algorithm. It’s doing exactly what you asked: minimize total mistakes. The fix requires changing what you ask for.
The four lines of defense
1. Better evaluation metrics
Stop using accuracy. Use metrics designed for imbalanced scenarios:
- Precision — of all predicted positives, how many are correct? (Important when false alarms are costly)
- Recall (Sensitivity) — of all actual positives, how many did you catch? (Important when missing cases is costly)
- F1 score — harmonic mean of precision and recall, balancing both
- ROC AUC — measures how well the model separates classes across all probability thresholds
- Precision-Recall AUC — more informative than ROC AUC when the positive class is very rare
- Balanced accuracy — average recall across classes, unaffected by class proportions
Use classification_report for a complete view across all classes.
2. Class weights
Many scikit-learn classifiers accept a class_weight parameter:
from sklearn.ensemble import RandomForestClassifier
# 'balanced' automatically sets weights inversely proportional to class frequency
model = RandomForestClassifier(class_weight='balanced', random_state=42)
With class_weight='balanced', a class that appears 100x less often gets 100x more weight in the loss function. The model is penalized heavily for misclassifying minority samples.
You can also set custom weights: class_weight={0: 1, 1: 50} for fine-grained control.
3. Resampling strategies
Oversampling — duplicate minority class samples or generate synthetic ones. SMOTE (Synthetic Minority Over-sampling Technique) creates new samples by interpolating between existing minority examples. Available through the imbalanced-learn library.
Undersampling — reduce majority class to match minority class size. Simple but loses information. Works best with large datasets where you can afford to discard majority samples.
Combination — oversample the minority slightly and undersample the majority slightly, meeting somewhere in between.
4. Threshold tuning
By default, classifiers predict the class with probability > 0.5. For imbalanced data, adjusting this threshold can dramatically improve results:
- Lower threshold (e.g., 0.3) → catches more positives but more false alarms
- Higher threshold (e.g., 0.7) → fewer false alarms but misses more positives
The optimal threshold depends on the relative cost of false positives vs. false negatives in your business context.
Choosing the right approach
Mild imbalance (70/30 to 90/10): Class weights are usually sufficient. Models can still learn minority patterns from enough examples.
Moderate imbalance (90/10 to 99/1): Combine class weights with appropriate metrics. Consider SMOTE for additional synthetic minority samples.
Severe imbalance (99/1 or worse): Use all strategies together — class weights, resampling, threshold tuning, and specialized metrics. At extreme ratios, anomaly detection approaches may outperform classification.
Common misconception
Resampling should only happen on training data, never on validation or test data. If you apply SMOTE before splitting, synthetic samples based on test data leak into training, producing inflated scores that don’t reflect real-world performance.
Cross-validation with imbalanced data
Always use StratifiedKFold to maintain class proportions across folds:
from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='f1')
Without stratification, some folds may have zero minority samples, producing meaningless scores.
One thing to remember: Imbalanced data requires changing three things simultaneously: how you train (weights/resampling), how you predict (threshold), and how you evaluate (metrics). Fixing only one usually isn’t enough.
See Also
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'