Model Evaluation — Core Concepts
Classification Metrics: The Full Picture
For a binary classification model with positive class (P) and negative class (N):
Confusion matrix:
| Predicted P | Predicted N | |
|---|---|---|
| Actual P | TP | FN |
| Actual N | FP | TN |
Derived metrics:
- Accuracy: $(TP + TN) / (TP + TN + FP + FN)$
- Precision: $TP / (TP + FP)$ — “Of all predicted positives, how many are correct?”
- Recall / Sensitivity / TPR: $TP / (TP + FN)$ — “Of all actual positives, how many did we find?”
- Specificity / TNR: $TN / (TN + FP)$ — “Of all actual negatives, how many did we correctly identify?”
- F1: $2 \times \text{Precision} \times \text{Recall} / (\text{Precision} + \text{Recall})$
- F-beta: $(1+\beta^2) \times \text{Precision} \times \text{Recall} / (\beta^2 \text{Precision} + \text{Recall})$ — β > 1 weights recall higher; β < 1 weights precision higher
The imbalanced class problem: With 99% negative class, predicting all negatives achieves 99% accuracy but 0% recall. Accuracy is misleading. Use F1, AUPRC, or Matthew’s Correlation Coefficient (MCC) instead.
ROC Curve and AUC
The ROC (Receiver Operating Characteristic) curve plots TPR vs. FPR across all classification thresholds:
$$\text{TPR}(\tau) = P(\hat{f}(x) > \tau | y = 1)$$ $$\text{FPR}(\tau) = P(\hat{f}(x) > \tau | y = 0)$$
As threshold $\tau$ decreases: TPR increases (catch more positives) and FPR increases (more false positives). The ROC curve traces this tradeoff.
AUC (Area Under the ROC Curve): A scalar summary. AUC = 0.5: random classifier; AUC = 1.0: perfect classifier; AUC = 0.9: very good (correctly ranks 90% of positive/negative pairs).
Probabilistic interpretation: AUC = P(score for positive > score for negative) for a randomly chosen (positive, negative) pair.
When AUC misleads: On highly imbalanced datasets, AUC can be misleadingly high because the negative class dominates the denominator. AUPRC (Area Under Precision-Recall Curve) is more meaningful for imbalanced problems.
Cross-Validation Strategies
Single train/test splits have high variance — different random splits give different estimates. Cross-validation averages across multiple splits.
k-fold CV: Split into $k$ equal folds. Train on $k-1$ folds, test on the held-out fold. Repeat $k$ times. Report mean ± std. $k=5$ or $k=10$ is standard.
Stratified k-fold: Ensures each fold has the same class proportions as the full dataset. Essential for imbalanced datasets.
Leave-one-out CV (LOOCV): k=n (each sample is a test set once). Maximum data utilization, high variance, computationally expensive.
Time series split: For temporal data, folds must respect time order — always train on past, test on future. Standard k-fold would leak future information.
Nested CV: Two loops of cross-validation. Outer loop estimates generalization error; inner loop selects hyperparameters. Prevents hyperparameter overfitting to the test set.
LLM Evaluation: The Benchmarking Crisis
Evaluating LLMs is fundamentally different from evaluating classifiers. Language generation isn’t right/wrong — it’s a spectrum.
Multiple-choice benchmarks (most common):
- MMLU: 57 subjects, 4-choice questions. Tests knowledge breadth.
- HellaSwag: Sentence completion (reasoning)
- ARC: Grade-school science questions
- BBH (BIG-Bench Hard): Harder tasks with clear correct answers
Problems:
- Benchmark contamination: Models trained on data scraped from the internet may have seen benchmark questions. GPT-4’s training data cutoff predates many benchmarks — but the questions appear online.
- Benchmark saturation: GPT-4, Claude, and Gemini all score 85-90%+ on MMLU. The benchmark is no longer discriminating between frontier models.
- Goodhart’s Law: Models optimized for benchmarks may not improve on the underlying capability. Fine-tuning on benchmark-style questions without improving reasoning ability is well-documented.
Open-ended evaluation (harder but more honest):
- LMSYS Chatbot Arena: Anonymous pairwise comparisons by real users
- MT-Bench: Multi-turn dialogue quality (GPT-4 as judge)
- HumanEval: Code execution correctness
- GPQA: Graduate-level science questions from domain experts
LLM-as-judge (emerging standard): Use a strong LLM (GPT-4, Claude Opus) to evaluate open-ended outputs. More scalable than human evaluation, catches quality issues that reference answers miss. Bias: the judge tends to prefer responses similar to its own style.
Evaluation for AI Safety
Standard capability benchmarks miss safety-critical properties:
Jailbreak robustness: What fraction of adversarial prompts bypass content policies? (JailbreakBench)
Factual accuracy: What fraction of generated facts are verifiable? (TruthfulQA, FActScore)
Calibration: Does confidence correlate with accuracy? Overconfident wrong answers are worse than uncertain wrong answers.
Behavioral consistency: Does the model give the same answer when a question is paraphrased? Inconsistency suggests pattern-matching rather than reasoning.
One thing to remember: Model evaluation is increasingly the bottleneck in AI development — it’s harder to measure real-world capability and safety than to improve them, which is why benchmark saturation and evaluation methodology debates are central concerns in the field.
See Also
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'
- Artificial Intelligence What is AI really? Think of it as a dog that learned tricks — impressive, but it doesn't know why it's doing them.
- Bias Variance Tradeoff The fundamental tension in machine learning between being wrong in the same way vs. being wrong in different ways — and why the simplest model isn't always best.
- Deep Learning Why your phone can spot your face in a messy photo album — and why that trick comes from practice, not magic.
- Embeddings How do computers know that 'dog' and 'puppy' mean almost the same thing? They don't read definitions — they turn words into secret map coordinates, and nearby coordinates mean nearby meanings.