Model Evaluation — Core Concepts

Classification metrics deep dive, ROC curves and AUC, cross-validation strategies, LLM evaluation challenges, and why benchmark saturation is a crisis for AI development.

Classification Metrics: The Full Picture

For a binary classification model with positive class (P) and negative class (N):

Confusion matrix:

	Predicted P	Predicted N
Actual P	TP	FN
Actual N	FP	TN

Derived metrics:

Accuracy: $(TP + TN) / (TP + TN + FP + FN)$
Precision: $TP / (TP + FP)$ — “Of all predicted positives, how many are correct?”
Recall / Sensitivity / TPR: $TP / (TP + FN)$ — “Of all actual positives, how many did we find?”
Specificity / TNR: $TN / (TN + FP)$ — “Of all actual negatives, how many did we correctly identify?”
F1: $2 \times \text{Precision} \times \text{Recall} / (\text{Precision} + \text{Recall})$
F-beta: $(1+\beta^2) \times \text{Precision} \times \text{Recall} / (\beta^2 \text{Precision} + \text{Recall})$ — β > 1 weights recall higher; β < 1 weights precision higher

The imbalanced class problem: With 99% negative class, predicting all negatives achieves 99% accuracy but 0% recall. Accuracy is misleading. Use F1, AUPRC, or Matthew’s Correlation Coefficient (MCC) instead.

ROC Curve and AUC

The ROC (Receiver Operating Characteristic) curve plots TPR vs. FPR across all classification thresholds:

$$\text{TPR}(\tau) = P(\hat{f}(x) > \tau | y = 1)$$ $$\text{FPR}(\tau) = P(\hat{f}(x) > \tau | y = 0)$$

As threshold $\tau$ decreases: TPR increases (catch more positives) and FPR increases (more false positives). The ROC curve traces this tradeoff.

AUC (Area Under the ROC Curve): A scalar summary. AUC = 0.5: random classifier; AUC = 1.0: perfect classifier; AUC = 0.9: very good (correctly ranks 90% of positive/negative pairs).

Probabilistic interpretation: AUC = P(score for positive > score for negative) for a randomly chosen (positive, negative) pair.

When AUC misleads: On highly imbalanced datasets, AUC can be misleadingly high because the negative class dominates the denominator. AUPRC (Area Under Precision-Recall Curve) is more meaningful for imbalanced problems.

Cross-Validation Strategies

Single train/test splits have high variance — different random splits give different estimates. Cross-validation averages across multiple splits.

k-fold CV: Split into $k$ equal folds. Train on $k-1$ folds, test on the held-out fold. Repeat $k$ times. Report mean ± std. $k=5$ or $k=10$ is standard.

Stratified k-fold: Ensures each fold has the same class proportions as the full dataset. Essential for imbalanced datasets.

Leave-one-out CV (LOOCV): k=n (each sample is a test set once). Maximum data utilization, high variance, computationally expensive.

Time series split: For temporal data, folds must respect time order — always train on past, test on future. Standard k-fold would leak future information.

Nested CV: Two loops of cross-validation. Outer loop estimates generalization error; inner loop selects hyperparameters. Prevents hyperparameter overfitting to the test set.

LLM Evaluation: The Benchmarking Crisis

Evaluating LLMs is fundamentally different from evaluating classifiers. Language generation isn’t right/wrong — it’s a spectrum.

Multiple-choice benchmarks (most common):

MMLU: 57 subjects, 4-choice questions. Tests knowledge breadth.
HellaSwag: Sentence completion (reasoning)
ARC: Grade-school science questions
BBH (BIG-Bench Hard): Harder tasks with clear correct answers

Problems:

Benchmark contamination: Models trained on data scraped from the internet may have seen benchmark questions. GPT-4’s training data cutoff predates many benchmarks — but the questions appear online.
Benchmark saturation: GPT-4, Claude, and Gemini all score 85-90%+ on MMLU. The benchmark is no longer discriminating between frontier models.
Goodhart’s Law: Models optimized for benchmarks may not improve on the underlying capability. Fine-tuning on benchmark-style questions without improving reasoning ability is well-documented.

Open-ended evaluation (harder but more honest):

LMSYS Chatbot Arena: Anonymous pairwise comparisons by real users
MT-Bench: Multi-turn dialogue quality (GPT-4 as judge)
HumanEval: Code execution correctness
GPQA: Graduate-level science questions from domain experts

LLM-as-judge (emerging standard): Use a strong LLM (GPT-4, Claude Opus) to evaluate open-ended outputs. More scalable than human evaluation, catches quality issues that reference answers miss. Bias: the judge tends to prefer responses similar to its own style.

Evaluation for AI Safety

Standard capability benchmarks miss safety-critical properties:

Jailbreak robustness: What fraction of adversarial prompts bypass content policies? (JailbreakBench)

Factual accuracy: What fraction of generated facts are verifiable? (TruthfulQA, FActScore)

Calibration: Does confidence correlate with accuracy? Overconfident wrong answers are worse than uncertain wrong answers.

Behavioral consistency: Does the model give the same answer when a question is paraphrased? Inconsistency suggests pattern-matching rather than reasoning.

One thing to remember: Model evaluation is increasingly the bottleneck in AI development — it’s harder to measure real-world capability and safety than to improve them, which is why benchmark saturation and evaluation methodology debates are central concerns in the field.

model-evaluationroc-auccross-validationllm-benchmarksprecision-recallf1

Model Evaluation — Core Concepts

Classification Metrics: The Full Picture

ROC Curve and AUC

Cross-Validation Strategies

LLM Evaluation: The Benchmarking Crisis

Evaluation for AI Safety

See Also

Related Topics