Model Evaluation Metrics in Python — Deep Dive
Classification Metrics in Detail
The Confusion Matrix Foundation
Every binary classification metric derives from four counts:
- True Positives (TP): Correctly predicted positive.
- True Negatives (TN): Correctly predicted negative.
- False Positives (FP): Predicted positive, actually negative (Type I error).
- False Negatives (FN): Predicted negative, actually positive (Type II error).
from sklearn.metrics import confusion_matrix, classification_report
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[3, 1],
# [1, 5]]
print(classification_report(y_true, y_pred, target_names=["Negative", "Positive"]))
Beyond Accuracy: The Full Metric Zoo
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
balanced_accuracy_score, matthews_corrcoef, cohen_kappa_score,
)
print(f"Accuracy: {accuracy_score(y_true, y_pred):.4f}")
print(f"Balanced Accuracy: {balanced_accuracy_score(y_true, y_pred):.4f}")
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall: {recall_score(y_true, y_pred):.4f}")
print(f"F1: {f1_score(y_true, y_pred):.4f}")
print(f"MCC: {matthews_corrcoef(y_true, y_pred):.4f}")
print(f"Cohen's Kappa: {cohen_kappa_score(y_true, y_pred):.4f}")
Matthews Correlation Coefficient (MCC) ranges from -1 to +1 and accounts for all four quadrants of the confusion matrix. It is considered one of the most balanced single metrics for binary classification, especially on imbalanced datasets.
Cohen’s Kappa measures agreement between predicted and actual labels, adjusted for chance agreement. A kappa of 0 means no better than random; 1 means perfect agreement.
Threshold Tuning
Most classifiers output probabilities. The default threshold of 0.5 is arbitrary. Moving it changes the precision-recall balance:
from sklearn.metrics import precision_recall_curve
import numpy as np
y_proba = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_true, y_proba)
# Find threshold where recall >= 0.90
idx = np.where(recalls >= 0.90)[0][-1]
optimal_threshold = thresholds[idx]
print(f"Threshold for 90% recall: {optimal_threshold:.3f}")
print(f"Precision at that threshold: {precisions[idx]:.3f}")
# Apply custom threshold
y_pred_custom = (y_proba >= optimal_threshold).astype(int)
In medical screening, you might set the threshold low to catch almost all positive cases (high recall), accepting more false positives. In ad targeting, you might set it high to avoid wasting budget on unlikely conversions (high precision).
F-beta Score
When precision and recall have different weights:
from sklearn.metrics import fbeta_score
# F2 emphasizes recall (beta=2 means recall is 2x as important as precision)
f2 = fbeta_score(y_true, y_pred, beta=2)
# F0.5 emphasizes precision
f05 = fbeta_score(y_true, y_pred, beta=0.5)
Multi-Class Metrics
Averaging Strategies
For multi-class problems, precision, recall, and F1 are computed per-class and then averaged:
- Macro: Unweighted mean across classes. Treats rare classes equally.
- Micro: Computes globally by counting total TP, FP, FN. Equivalent to accuracy for single-label problems.
- Weighted: Weighted by class frequency. Accounts for imbalance but can hide poor performance on rare classes.
from sklearn.metrics import f1_score
f1_macro = f1_score(y_true, y_pred, average="macro")
f1_micro = f1_score(y_true, y_pred, average="micro")
f1_weighted = f1_score(y_true, y_pred, average="weighted")
For a dataset with 90 percent class A, 5 percent class B, and 5 percent class C, micro F1 is dominated by class A performance. Macro F1 gives equal voice to each class.
Multi-Label Metrics
When each sample can belong to multiple classes (e.g., document tagging):
from sklearn.metrics import multilabel_confusion_matrix
mcm = multilabel_confusion_matrix(y_true_multi, y_pred_multi)
# Returns a confusion matrix per label
Regression Metrics in Detail
from sklearn.metrics import (
mean_absolute_error, mean_squared_error,
r2_score, mean_absolute_percentage_error,
)
import numpy as np
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = mean_absolute_percentage_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
When to Use Which
- MAE is robust to outliers and interpretable in the original units.
- RMSE penalizes large errors quadratically — use when big mistakes are especially costly.
- MAPE expresses error as a percentage, useful for comparing across different scales. Breaks down when true values are near zero.
- R² gives a normalized 0-1 score but can be misleading on non-linear or heteroscedastic data.
Quantile and Interval Metrics
For probabilistic regression (prediction intervals):
from sklearn.metrics import mean_pinball_loss
# Pinball loss for the 90th percentile
loss_90 = mean_pinball_loss(y_true, y_pred_q90, alpha=0.9)
Custom Scoring Functions
Scikit-learn lets you create custom scorers for use in cross-validation and grid search:
from sklearn.metrics import make_scorer
def business_cost(y_true, y_pred):
"""Custom metric: FP costs $10, FN costs $500."""
fp = ((y_pred == 1) & (y_true == 0)).sum()
fn = ((y_pred == 0) & (y_true == 1)).sum()
return -(fp * 10 + fn * 500) # Negative because sklearn maximizes
cost_scorer = make_scorer(business_cost, greater_is_better=True)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring=cost_scorer)
This aligns model selection directly with business objectives rather than generic statistical metrics.
Calibration
A model’s predicted probabilities should match observed frequencies. If a model predicts 70 percent probability for a set of events, approximately 70 percent should actually occur. Calibration curves diagnose this:
from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_true, y_proba, n_bins=10)
Poorly calibrated models can be fixed with CalibratedClassifierCV using Platt scaling or isotonic regression.
Statistical Comparison of Models
Comparing two models on the same CV splits requires a paired test:
from scipy.stats import wilcoxon
scores_a = cross_val_score(model_a, X, y, cv=10, scoring="f1")
scores_b = cross_val_score(model_b, X, y, cv=10, scoring="f1")
stat, p_value = wilcoxon(scores_a, scores_b)
print(f"Wilcoxon p-value: {p_value:.4f}")
A significant p-value (< 0.05) suggests the performance difference is not due to chance. Always use the same CV splits (same random_state) for a fair comparison.
Metric Dashboards in Practice
Production ML systems track metrics over time to detect degradation:
- Log predictions and ground truth to a data store.
- Compute daily/weekly metrics on a sliding window.
- Alert when a metric drops below a threshold or drifts beyond a confidence interval.
- Segment metrics by subpopulation (age group, region) to catch localized failures.
Common Pitfalls
- Optimizing for accuracy on imbalanced data: Use F1, MCC, or AUC instead.
- Reporting training metrics: Always report test/validation scores. Training scores are irrelevant for generalization.
- Ignoring metric variance: A single score without confidence intervals is unreliable. Use CV or bootstrapping.
- Mixing up macro and micro averaging: They answer different questions. Report both or justify your choice.
One thing to remember: The metric you optimize is the behavior you get — choose a metric that captures what actually matters to the people who will use your model’s predictions.
See Also
- Python Confusion Matrix See how a simple grid of right and wrong answers reveals what your computer is actually getting confused about.
- Python Cross Validation Find out why testing a computer's homework on different practice sets keeps it from cheating.
- Python Roc Auc Curves Understand how one picture and one number tell you whether a computer's predictions are trustworthy or just lucky guesses.
- Python Sklearn Learning Curves Why your machine learning model might need more data — or a simpler brain — explained with zero jargon.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.