Model Evaluation Metrics in Python — Deep Dive

Implement classification and regression metrics in scikit-learn with threshold tuning, multi-class strategies, and custom scoring functions.

Classification Metrics in Detail

The Confusion Matrix Foundation

Every binary classification metric derives from four counts:

True Positives (TP): Correctly predicted positive.
True Negatives (TN): Correctly predicted negative.
False Positives (FP): Predicted positive, actually negative (Type I error).
False Negatives (FN): Predicted negative, actually positive (Type II error).

from sklearn.metrics import confusion_matrix, classification_report

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[3, 1],
#  [1, 5]]

print(classification_report(y_true, y_pred, target_names=["Negative", "Positive"]))

Beyond Accuracy: The Full Metric Zoo

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    balanced_accuracy_score, matthews_corrcoef, cohen_kappa_score,
)

print(f"Accuracy:          {accuracy_score(y_true, y_pred):.4f}")
print(f"Balanced Accuracy: {balanced_accuracy_score(y_true, y_pred):.4f}")
print(f"Precision:         {precision_score(y_true, y_pred):.4f}")
print(f"Recall:            {recall_score(y_true, y_pred):.4f}")
print(f"F1:                {f1_score(y_true, y_pred):.4f}")
print(f"MCC:               {matthews_corrcoef(y_true, y_pred):.4f}")
print(f"Cohen's Kappa:     {cohen_kappa_score(y_true, y_pred):.4f}")

Matthews Correlation Coefficient (MCC) ranges from -1 to +1 and accounts for all four quadrants of the confusion matrix. It is considered one of the most balanced single metrics for binary classification, especially on imbalanced datasets.

Cohen’s Kappa measures agreement between predicted and actual labels, adjusted for chance agreement. A kappa of 0 means no better than random; 1 means perfect agreement.

Threshold Tuning

Most classifiers output probabilities. The default threshold of 0.5 is arbitrary. Moving it changes the precision-recall balance:

from sklearn.metrics import precision_recall_curve
import numpy as np

y_proba = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_true, y_proba)

# Find threshold where recall >= 0.90
idx = np.where(recalls >= 0.90)[0][-1]
optimal_threshold = thresholds[idx]
print(f"Threshold for 90% recall: {optimal_threshold:.3f}")
print(f"Precision at that threshold: {precisions[idx]:.3f}")

# Apply custom threshold
y_pred_custom = (y_proba >= optimal_threshold).astype(int)

In medical screening, you might set the threshold low to catch almost all positive cases (high recall), accepting more false positives. In ad targeting, you might set it high to avoid wasting budget on unlikely conversions (high precision).

F-beta Score

When precision and recall have different weights:

from sklearn.metrics import fbeta_score

# F2 emphasizes recall (beta=2 means recall is 2x as important as precision)
f2 = fbeta_score(y_true, y_pred, beta=2)

# F0.5 emphasizes precision
f05 = fbeta_score(y_true, y_pred, beta=0.5)

Multi-Class Metrics

Averaging Strategies

For multi-class problems, precision, recall, and F1 are computed per-class and then averaged:

Macro: Unweighted mean across classes. Treats rare classes equally.
Micro: Computes globally by counting total TP, FP, FN. Equivalent to accuracy for single-label problems.
Weighted: Weighted by class frequency. Accounts for imbalance but can hide poor performance on rare classes.

from sklearn.metrics import f1_score

f1_macro = f1_score(y_true, y_pred, average="macro")
f1_micro = f1_score(y_true, y_pred, average="micro")
f1_weighted = f1_score(y_true, y_pred, average="weighted")

For a dataset with 90 percent class A, 5 percent class B, and 5 percent class C, micro F1 is dominated by class A performance. Macro F1 gives equal voice to each class.

Multi-Label Metrics

When each sample can belong to multiple classes (e.g., document tagging):

from sklearn.metrics import multilabel_confusion_matrix

mcm = multilabel_confusion_matrix(y_true_multi, y_pred_multi)
# Returns a confusion matrix per label

Regression Metrics in Detail

from sklearn.metrics import (
    mean_absolute_error, mean_squared_error,
    r2_score, mean_absolute_percentage_error,
)
import numpy as np

mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = mean_absolute_percentage_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)

When to Use Which

MAE is robust to outliers and interpretable in the original units.
RMSE penalizes large errors quadratically — use when big mistakes are especially costly.
MAPE expresses error as a percentage, useful for comparing across different scales. Breaks down when true values are near zero.
R² gives a normalized 0-1 score but can be misleading on non-linear or heteroscedastic data.

Quantile and Interval Metrics

For probabilistic regression (prediction intervals):

from sklearn.metrics import mean_pinball_loss

# Pinball loss for the 90th percentile
loss_90 = mean_pinball_loss(y_true, y_pred_q90, alpha=0.9)

Custom Scoring Functions

Scikit-learn lets you create custom scorers for use in cross-validation and grid search:

from sklearn.metrics import make_scorer

def business_cost(y_true, y_pred):
    """Custom metric: FP costs $10, FN costs $500."""
    fp = ((y_pred == 1) & (y_true == 0)).sum()
    fn = ((y_pred == 0) & (y_true == 1)).sum()
    return -(fp * 10 + fn * 500)  # Negative because sklearn maximizes

cost_scorer = make_scorer(business_cost, greater_is_better=True)

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring=cost_scorer)

This aligns model selection directly with business objectives rather than generic statistical metrics.

Calibration

A model’s predicted probabilities should match observed frequencies. If a model predicts 70 percent probability for a set of events, approximately 70 percent should actually occur. Calibration curves diagnose this:

from sklearn.calibration import calibration_curve

prob_true, prob_pred = calibration_curve(y_true, y_proba, n_bins=10)

Poorly calibrated models can be fixed with CalibratedClassifierCV using Platt scaling or isotonic regression.

Statistical Comparison of Models

Comparing two models on the same CV splits requires a paired test:

from scipy.stats import wilcoxon

scores_a = cross_val_score(model_a, X, y, cv=10, scoring="f1")
scores_b = cross_val_score(model_b, X, y, cv=10, scoring="f1")

stat, p_value = wilcoxon(scores_a, scores_b)
print(f"Wilcoxon p-value: {p_value:.4f}")

A significant p-value (< 0.05) suggests the performance difference is not due to chance. Always use the same CV splits (same random_state) for a fair comparison.

Metric Dashboards in Practice

Production ML systems track metrics over time to detect degradation:

Log predictions and ground truth to a data store.
Compute daily/weekly metrics on a sliding window.
Alert when a metric drops below a threshold or drifts beyond a confidence interval.
Segment metrics by subpopulation (age group, region) to catch localized failures.

Common Pitfalls

Optimizing for accuracy on imbalanced data: Use F1, MCC, or AUC instead.
Reporting training metrics: Always report test/validation scores. Training scores are irrelevant for generalization.
Ignoring metric variance: A single score without confidence intervals is unreliable. Use CV or bootstrapping.
Mixing up macro and micro averaging: They answer different questions. Report both or justify your choice.

One thing to remember: The metric you optimize is the behavior you get — choose a metric that captures what actually matters to the people who will use your model’s predictions.

pythonmodel-evaluationmachine-learningmetrics