Scikit-Learn Imbalanced Data — Deep Dive

Technical foundation

Class imbalance affects learning through prior probability shift. Most classifiers learn to approximate P(y|X), and when P(y=1) is very small, the posterior for the majority class dominates across most of the feature space. The decision boundary gets pushed toward (or into) the minority class, reducing its recall.

Two theoretical frameworks address this:

  1. Cost-sensitive learning — modify the objective function to penalize minority class errors more heavily
  2. Sampling-based correction — modify the training distribution to approximate balanced priors

Both work by changing the effective class prior that the model sees during training.

Comprehensive evaluation setup

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import (
    make_scorer, f1_score, precision_score, recall_score,
    average_precision_score, balanced_accuracy_score
)

# Create imbalanced dataset: 97% negative, 3% positive
X, y = make_classification(
    n_samples=20000, n_features=25, n_informative=10,
    weights=[0.97, 0.03], flip_y=0.02, random_state=42
)

print(f"Class distribution: {np.bincount(y)} (ratio: {np.bincount(y)[0]/np.bincount(y)[1]:.0f}:1)")

scoring = {
    'f1': make_scorer(f1_score),
    'precision': make_scorer(precision_score, zero_division=0),
    'recall': make_scorer(recall_score),
    'balanced_acc': make_scorer(balanced_accuracy_score),
    'avg_precision': make_scorer(average_precision_score, needs_proba=True),
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Baseline: no imbalance handling
baseline = HistGradientBoostingClassifier(max_iter=200, random_state=42)
baseline_results = cross_validate(baseline, X, y, cv=cv, scoring=scoring)

for metric, scores in baseline_results.items():
    if metric.startswith('test_'):
        print(f"Baseline {metric[5:]}: {scores.mean():.4f} ± {scores.std():.4f}")

Class weight strategies

Built-in class_weight=‘balanced’

from sklearn.ensemble import RandomForestClassifier

# Weight = n_samples / (n_classes × n_samples_per_class)
# For 97/3 split: class 0 weight ≈ 0.52, class 1 weight ≈ 16.7
rf_balanced = RandomForestClassifier(
    n_estimators=300,
    class_weight='balanced',
    random_state=42
)

results = cross_validate(rf_balanced, X, y, cv=cv, scoring=scoring)

Custom cost matrix

When business costs are asymmetric:

# Fraud detection: missing fraud ($500 loss) vs. false alarm ($10 investigation cost)
cost_ratio = 500 / 10  # 50:1

rf_custom = RandomForestClassifier(
    n_estimators=300,
    class_weight={0: 1, 1: cost_ratio},
    random_state=42
)

class_weight=‘balanced_subsample’ for Random Forests

rf_subsample = RandomForestClassifier(
    n_estimators=300,
    class_weight='balanced_subsample',  # recompute weights per bootstrap sample
    random_state=42
)

This recalculates class weights for each tree’s bootstrap sample, which can produce different effective weights per tree — adding diversity to the ensemble.

Threshold optimization

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, f1_score
import matplotlib.pyplot as plt

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model = HistGradientBoostingClassifier(max_iter=300, random_state=42)
model.fit(X_train, y_train)

# Get probabilities
y_proba = model.predict_proba(X_test)[:, 1]

# Find optimal threshold for F1
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-10)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]

print(f"Default threshold (0.5) F1: {f1_score(y_test, y_proba >= 0.5):.4f}")
print(f"Optimal threshold ({optimal_threshold:.3f}) F1: {f1_score(y_test, y_proba >= optimal_threshold):.4f}")

# Visualize threshold impact
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(thresholds, precisions[:-1], label='Precision')
ax1.plot(thresholds, recalls[:-1], label='Recall')
ax1.axvline(optimal_threshold, color='red', linestyle='--', label=f'Optimal ({optimal_threshold:.3f})')
ax1.set_xlabel('Threshold')
ax1.legend()
ax1.set_title('Precision vs Recall by Threshold')

ax2.plot(thresholds, f1_scores[:-1])
ax2.axvline(optimal_threshold, color='red', linestyle='--')
ax2.set_xlabel('Threshold')
ax2.set_ylabel('F1 Score')
ax2.set_title('F1 Score by Threshold')
plt.tight_layout()

Business-cost threshold optimization

def find_cost_optimal_threshold(y_true, y_proba, fp_cost=10, fn_cost=500):
    """Find threshold that minimizes total business cost."""
    thresholds = np.linspace(0.01, 0.99, 200)
    costs = []

    for t in thresholds:
        y_pred = (y_proba >= t).astype(int)
        fp = ((y_pred == 1) & (y_true == 0)).sum()
        fn = ((y_pred == 0) & (y_true == 1)).sum()
        total_cost = fp * fp_cost + fn * fn_cost
        costs.append(total_cost)

    optimal_idx = np.argmin(costs)
    return thresholds[optimal_idx], costs[optimal_idx]

best_threshold, min_cost = find_cost_optimal_threshold(y_test, y_proba)
print(f"Cost-optimal threshold: {best_threshold:.3f} (total cost: ${min_cost:,.0f})")

Resampling with imbalanced-learn

# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.pipeline import Pipeline as ImbPipeline

# SMOTE pipeline (note: imblearn Pipeline, not sklearn Pipeline)
smote_pipe = ImbPipeline([
    ('smote', SMOTE(sampling_strategy=0.5, random_state=42)),  # 50% ratio, not full balance
    ('model', HistGradientBoostingClassifier(max_iter=300, random_state=42)),
])

# SMOTE only applies during fit, not during predict — no leakage
smote_results = cross_validate(smote_pipe, X, y, cv=cv, scoring=scoring)

SMOTE variants

Standard SMOTE: Generates synthetic samples by interpolating between random minority neighbors. Can create noisy samples in overlapping class regions.

BorderlineSMOTE: Only synthesizes samples near the decision boundary where they’re most useful. More targeted than standard SMOTE.

ADASYN: Generates more synthetic samples for minority instances that are harder to classify (surrounded by majority neighbors). Focuses effort where it matters most.

# Compare SMOTE variants
variants = {
    'SMOTE': SMOTE(random_state=42),
    'BorderlineSMOTE': BorderlineSMOTE(random_state=42),
    'ADASYN': ADASYN(random_state=42),
    'SMOTETomek': SMOTETomek(random_state=42),
}

for name, sampler in variants.items():
    pipe = ImbPipeline([
        ('sampler', sampler),
        ('model', HistGradientBoostingClassifier(max_iter=200, random_state=42)),
    ])
    results = cross_validate(pipe, X, y, cv=cv, scoring={'f1': make_scorer(f1_score)})
    print(f"{name:20s} F1: {results['test_f1'].mean():.4f} ± {results['test_f1'].std():.4f}")

Sampling ratio matters

Full balancing (1:1 ratio) is rarely optimal. Often a moderate ratio (e.g., 1:3 to 1:5) outperforms full balancing because it preserves some of the natural class prior:

for ratio in [0.1, 0.2, 0.3, 0.5, 0.75, 1.0]:
    pipe = ImbPipeline([
        ('smote', SMOTE(sampling_strategy=ratio, random_state=42)),
        ('model', HistGradientBoostingClassifier(max_iter=200, random_state=42)),
    ])
    results = cross_validate(pipe, X, y, cv=cv, scoring={'f1': make_scorer(f1_score)})
    print(f"Ratio {ratio:.2f}: F1 = {results['test_f1'].mean():.4f}")

Calibration after resampling

Resampling changes the effective class prior, which miscalibrates predicted probabilities. Recalibrate after training:

from sklearn.calibration import CalibratedClassifierCV

# Train with SMOTE
smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

model = HistGradientBoostingClassifier(max_iter=300, random_state=42)
model.fit(X_resampled, y_resampled)

# Calibrate on original (non-resampled) validation data
calibrated = CalibratedClassifierCV(model, cv='prefit', method='isotonic')
calibrated.fit(X_test, y_test)  # ideally use a separate calibration set

When imbalance isn’t the real problem

Sometimes what looks like a class imbalance problem is actually:

  1. Insufficient features: The minority class isn’t separable with available features. No resampling fixes this.
  2. Label noise: Mislabeled minority samples corrupt the decision boundary. Clean labels before resampling.
  3. Overlapping classes: When feature distributions of both classes heavily overlap, even balanced data won’t produce good separation.

Diagnostic: if a model trained on perfectly balanced data still has low F1, the problem isn’t imbalance — it’s separability.

Production monitoring

from sklearn.metrics import classification_report

# Monitor per-class metrics in production
y_pred = model.predict(X_test)
report = classification_report(y_test, y_pred, output_dict=True)

# Alert if minority class recall drops below threshold
if report['1']['recall'] < 0.70:
    print("ALERT: Minority class recall below 70% — potential data drift or model degradation")

Tradeoffs

StrategyProsConsBest For
Class weightsNo data modification, fastDoesn’t help with feature overlapMild to moderate imbalance
SMOTECreates informative synthetic dataCan create noise in overlap regionsModerate imbalance, clean data
UndersamplingSimple, fastLoses majority class informationLarge datasets
Threshold tuningNo retraining neededRequires calibrated probabilitiesPost-hoc optimization
Anomaly detectionWorks at extreme ratiosDoesn’t use minority labels99.9%+ imbalance

One thing to remember: The right strategy depends on the imbalance ratio, the overlap between classes, and the business cost of each error type. There’s no universal fix — but combining class weights with proper metrics and threshold tuning covers most production scenarios.

pythonmachine-learningscikit-learn

See Also

  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
  • Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
  • Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
  • Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'