Scikit-Learn Imbalanced Data — Core Concepts

Handle class imbalance in scikit-learn with resampling, class weights, threshold tuning, and proper evaluation metrics.

Why imbalanced data breaks standard models

Most machine learning algorithms optimize overall accuracy or loss, treating all classes equally. When one class represents 95%+ of samples, the algorithm finds that predicting the majority class everywhere minimizes overall error. The minority class — often the one you actually care about — gets sacrificed.

This is not a bug in the algorithm. It’s doing exactly what you asked: minimize total mistakes. The fix requires changing what you ask for.

The four lines of defense

1. Better evaluation metrics

Stop using accuracy. Use metrics designed for imbalanced scenarios:

Precision — of all predicted positives, how many are correct? (Important when false alarms are costly)
Recall (Sensitivity) — of all actual positives, how many did you catch? (Important when missing cases is costly)
F1 score — harmonic mean of precision and recall, balancing both
ROC AUC — measures how well the model separates classes across all probability thresholds
Precision-Recall AUC — more informative than ROC AUC when the positive class is very rare
Balanced accuracy — average recall across classes, unaffected by class proportions

Use classification_report for a complete view across all classes.

2. Class weights

Many scikit-learn classifiers accept a class_weight parameter:

from sklearn.ensemble import RandomForestClassifier

# 'balanced' automatically sets weights inversely proportional to class frequency
model = RandomForestClassifier(class_weight='balanced', random_state=42)

With class_weight='balanced', a class that appears 100x less often gets 100x more weight in the loss function. The model is penalized heavily for misclassifying minority samples.

You can also set custom weights: class_weight={0: 1, 1: 50} for fine-grained control.

3. Resampling strategies

Oversampling — duplicate minority class samples or generate synthetic ones. SMOTE (Synthetic Minority Over-sampling Technique) creates new samples by interpolating between existing minority examples. Available through the imbalanced-learn library.

Undersampling — reduce majority class to match minority class size. Simple but loses information. Works best with large datasets where you can afford to discard majority samples.

Combination — oversample the minority slightly and undersample the majority slightly, meeting somewhere in between.

4. Threshold tuning

By default, classifiers predict the class with probability > 0.5. For imbalanced data, adjusting this threshold can dramatically improve results:

Lower threshold (e.g., 0.3) → catches more positives but more false alarms
Higher threshold (e.g., 0.7) → fewer false alarms but misses more positives

The optimal threshold depends on the relative cost of false positives vs. false negatives in your business context.

Choosing the right approach

Mild imbalance (70/30 to 90/10): Class weights are usually sufficient. Models can still learn minority patterns from enough examples.

Moderate imbalance (90/10 to 99/1): Combine class weights with appropriate metrics. Consider SMOTE for additional synthetic minority samples.

Severe imbalance (99/1 or worse): Use all strategies together — class weights, resampling, threshold tuning, and specialized metrics. At extreme ratios, anomaly detection approaches may outperform classification.

Common misconception

Resampling should only happen on training data, never on validation or test data. If you apply SMOTE before splitting, synthetic samples based on test data leak into training, producing inflated scores that don’t reflect real-world performance.

Cross-validation with imbalanced data

Always use StratifiedKFold to maintain class proportions across folds:

from sklearn.model_selection import StratifiedKFold, cross_val_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='f1')

Without stratification, some folds may have zero minority samples, producing meaningless scores.

One thing to remember: Imbalanced data requires changing three things simultaneously: how you train (weights/resampling), how you predict (threshold), and how you evaluate (metrics). Fixing only one usually isn’t enough.

pythonmachine-learningscikit-learn