Scikit-Learn Imbalanced Data — Deep Dive

Production strategies for class imbalance in scikit-learn — from cost-sensitive learning and threshold optimization to SMOTE pipelines and evaluation traps.

Technical foundation

Class imbalance affects learning through prior probability shift. Most classifiers learn to approximate P(y|X), and when P(y=1) is very small, the posterior for the majority class dominates across most of the feature space. The decision boundary gets pushed toward (or into) the minority class, reducing its recall.

Two theoretical frameworks address this:

Cost-sensitive learning — modify the objective function to penalize minority class errors more heavily
Sampling-based correction — modify the training distribution to approximate balanced priors

Both work by changing the effective class prior that the model sees during training.

Comprehensive evaluation setup

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import (
    make_scorer, f1_score, precision_score, recall_score,
    average_precision_score, balanced_accuracy_score
)

# Create imbalanced dataset: 97% negative, 3% positive
X, y = make_classification(
    n_samples=20000, n_features=25, n_informative=10,
    weights=[0.97, 0.03], flip_y=0.02, random_state=42
)

print(f"Class distribution: {np.bincount(y)} (ratio: {np.bincount(y)[0]/np.bincount(y)[1]:.0f}:1)")

scoring = {
    'f1': make_scorer(f1_score),
    'precision': make_scorer(precision_score, zero_division=0),
    'recall': make_scorer(recall_score),
    'balanced_acc': make_scorer(balanced_accuracy_score),
    'avg_precision': make_scorer(average_precision_score, needs_proba=True),
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Baseline: no imbalance handling
baseline = HistGradientBoostingClassifier(max_iter=200, random_state=42)
baseline_results = cross_validate(baseline, X, y, cv=cv, scoring=scoring)

for metric, scores in baseline_results.items():
    if metric.startswith('test_'):
        print(f"Baseline {metric[5:]}: {scores.mean():.4f} ± {scores.std():.4f}")

Class weight strategies

Built-in class_weight=‘balanced’

from sklearn.ensemble import RandomForestClassifier

# Weight = n_samples / (n_classes × n_samples_per_class)
# For 97/3 split: class 0 weight ≈ 0.52, class 1 weight ≈ 16.7
rf_balanced = RandomForestClassifier(
    n_estimators=300,
    class_weight='balanced',
    random_state=42
)

results = cross_validate(rf_balanced, X, y, cv=cv, scoring=scoring)

Custom cost matrix

When business costs are asymmetric:

# Fraud detection: missing fraud ($500 loss) vs. false alarm ($10 investigation cost)
cost_ratio = 500 / 10  # 50:1

rf_custom = RandomForestClassifier(
    n_estimators=300,
    class_weight={0: 1, 1: cost_ratio},
    random_state=42
)

class_weight=‘balanced_subsample’ for Random Forests

rf_subsample = RandomForestClassifier(
    n_estimators=300,
    class_weight='balanced_subsample',  # recompute weights per bootstrap sample
    random_state=42
)

This recalculates class weights for each tree’s bootstrap sample, which can produce different effective weights per tree — adding diversity to the ensemble.

Threshold optimization

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, f1_score
import matplotlib.pyplot as plt

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model = HistGradientBoostingClassifier(max_iter=300, random_state=42)
model.fit(X_train, y_train)

# Get probabilities
y_proba = model.predict_proba(X_test)[:, 1]

# Find optimal threshold for F1
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-10)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]

print(f"Default threshold (0.5) F1: {f1_score(y_test, y_proba >= 0.5):.4f}")
print(f"Optimal threshold ({optimal_threshold:.3f}) F1: {f1_score(y_test, y_proba >= optimal_threshold):.4f}")

# Visualize threshold impact
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(thresholds, precisions[:-1], label='Precision')
ax1.plot(thresholds, recalls[:-1], label='Recall')
ax1.axvline(optimal_threshold, color='red', linestyle='--', label=f'Optimal ({optimal_threshold:.3f})')
ax1.set_xlabel('Threshold')
ax1.legend()
ax1.set_title('Precision vs Recall by Threshold')

ax2.plot(thresholds, f1_scores[:-1])
ax2.axvline(optimal_threshold, color='red', linestyle='--')
ax2.set_xlabel('Threshold')
ax2.set_ylabel('F1 Score')
ax2.set_title('F1 Score by Threshold')
plt.tight_layout()

Business-cost threshold optimization

def find_cost_optimal_threshold(y_true, y_proba, fp_cost=10, fn_cost=500):
    """Find threshold that minimizes total business cost."""
    thresholds = np.linspace(0.01, 0.99, 200)
    costs = []

    for t in thresholds:
        y_pred = (y_proba >= t).astype(int)
        fp = ((y_pred == 1) & (y_true == 0)).sum()
        fn = ((y_pred == 0) & (y_true == 1)).sum()
        total_cost = fp * fp_cost + fn * fn_cost
        costs.append(total_cost)

    optimal_idx = np.argmin(costs)
    return thresholds[optimal_idx], costs[optimal_idx]

best_threshold, min_cost = find_cost_optimal_threshold(y_test, y_proba)
print(f"Cost-optimal threshold: {best_threshold:.3f} (total cost: ${min_cost:,.0f})")

Resampling with imbalanced-learn

# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.pipeline import Pipeline as ImbPipeline

# SMOTE pipeline (note: imblearn Pipeline, not sklearn Pipeline)
smote_pipe = ImbPipeline([
    ('smote', SMOTE(sampling_strategy=0.5, random_state=42)),  # 50% ratio, not full balance
    ('model', HistGradientBoostingClassifier(max_iter=300, random_state=42)),
])

# SMOTE only applies during fit, not during predict — no leakage
smote_results = cross_validate(smote_pipe, X, y, cv=cv, scoring=scoring)

SMOTE variants

Standard SMOTE: Generates synthetic samples by interpolating between random minority neighbors. Can create noisy samples in overlapping class regions.

BorderlineSMOTE: Only synthesizes samples near the decision boundary where they’re most useful. More targeted than standard SMOTE.

ADASYN: Generates more synthetic samples for minority instances that are harder to classify (surrounded by majority neighbors). Focuses effort where it matters most.

# Compare SMOTE variants
variants = {
    'SMOTE': SMOTE(random_state=42),
    'BorderlineSMOTE': BorderlineSMOTE(random_state=42),
    'ADASYN': ADASYN(random_state=42),
    'SMOTETomek': SMOTETomek(random_state=42),
}

for name, sampler in variants.items():
    pipe = ImbPipeline([
        ('sampler', sampler),
        ('model', HistGradientBoostingClassifier(max_iter=200, random_state=42)),
    ])
    results = cross_validate(pipe, X, y, cv=cv, scoring={'f1': make_scorer(f1_score)})
    print(f"{name:20s} F1: {results['test_f1'].mean():.4f} ± {results['test_f1'].std():.4f}")

Sampling ratio matters

Full balancing (1:1 ratio) is rarely optimal. Often a moderate ratio (e.g., 1:3 to 1:5) outperforms full balancing because it preserves some of the natural class prior:

for ratio in [0.1, 0.2, 0.3, 0.5, 0.75, 1.0]:
    pipe = ImbPipeline([
        ('smote', SMOTE(sampling_strategy=ratio, random_state=42)),
        ('model', HistGradientBoostingClassifier(max_iter=200, random_state=42)),
    ])
    results = cross_validate(pipe, X, y, cv=cv, scoring={'f1': make_scorer(f1_score)})
    print(f"Ratio {ratio:.2f}: F1 = {results['test_f1'].mean():.4f}")

Calibration after resampling

Resampling changes the effective class prior, which miscalibrates predicted probabilities. Recalibrate after training:

from sklearn.calibration import CalibratedClassifierCV

# Train with SMOTE
smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

model = HistGradientBoostingClassifier(max_iter=300, random_state=42)
model.fit(X_resampled, y_resampled)

# Calibrate on original (non-resampled) validation data
calibrated = CalibratedClassifierCV(model, cv='prefit', method='isotonic')
calibrated.fit(X_test, y_test)  # ideally use a separate calibration set

When imbalance isn’t the real problem

Sometimes what looks like a class imbalance problem is actually:

Insufficient features: The minority class isn’t separable with available features. No resampling fixes this.
Label noise: Mislabeled minority samples corrupt the decision boundary. Clean labels before resampling.
Overlapping classes: When feature distributions of both classes heavily overlap, even balanced data won’t produce good separation.

Diagnostic: if a model trained on perfectly balanced data still has low F1, the problem isn’t imbalance — it’s separability.

Production monitoring

from sklearn.metrics import classification_report

# Monitor per-class metrics in production
y_pred = model.predict(X_test)
report = classification_report(y_test, y_pred, output_dict=True)

# Alert if minority class recall drops below threshold
if report['1']['recall'] < 0.70:
    print("ALERT: Minority class recall below 70% — potential data drift or model degradation")

Tradeoffs

Strategy	Pros	Cons	Best For
Class weights	No data modification, fast	Doesn’t help with feature overlap	Mild to moderate imbalance
SMOTE	Creates informative synthetic data	Can create noise in overlap regions	Moderate imbalance, clean data
Undersampling	Simple, fast	Loses majority class information	Large datasets
Threshold tuning	No retraining needed	Requires calibrated probabilities	Post-hoc optimization
Anomaly detection	Works at extreme ratios	Doesn’t use minority labels	99.9%+ imbalance

One thing to remember: The right strategy depends on the imbalance ratio, the overlap between classes, and the business cost of each error type. There’s no universal fix — but combining class weights with proper metrics and threshold tuning covers most production scenarios.

pythonmachine-learningscikit-learn