Model Evaluation — Core Concepts

Classification Metrics: The Full Picture

For a binary classification model with positive class (P) and negative class (N):

Confusion matrix:

Predicted PPredicted N
Actual PTPFN
Actual NFPTN

Derived metrics:

  • Accuracy: $(TP + TN) / (TP + TN + FP + FN)$
  • Precision: $TP / (TP + FP)$ — “Of all predicted positives, how many are correct?”
  • Recall / Sensitivity / TPR: $TP / (TP + FN)$ — “Of all actual positives, how many did we find?”
  • Specificity / TNR: $TN / (TN + FP)$ — “Of all actual negatives, how many did we correctly identify?”
  • F1: $2 \times \text{Precision} \times \text{Recall} / (\text{Precision} + \text{Recall})$
  • F-beta: $(1+\beta^2) \times \text{Precision} \times \text{Recall} / (\beta^2 \text{Precision} + \text{Recall})$ — β > 1 weights recall higher; β < 1 weights precision higher

The imbalanced class problem: With 99% negative class, predicting all negatives achieves 99% accuracy but 0% recall. Accuracy is misleading. Use F1, AUPRC, or Matthew’s Correlation Coefficient (MCC) instead.

ROC Curve and AUC

The ROC (Receiver Operating Characteristic) curve plots TPR vs. FPR across all classification thresholds:

$$\text{TPR}(\tau) = P(\hat{f}(x) > \tau | y = 1)$$ $$\text{FPR}(\tau) = P(\hat{f}(x) > \tau | y = 0)$$

As threshold $\tau$ decreases: TPR increases (catch more positives) and FPR increases (more false positives). The ROC curve traces this tradeoff.

AUC (Area Under the ROC Curve): A scalar summary. AUC = 0.5: random classifier; AUC = 1.0: perfect classifier; AUC = 0.9: very good (correctly ranks 90% of positive/negative pairs).

Probabilistic interpretation: AUC = P(score for positive > score for negative) for a randomly chosen (positive, negative) pair.

When AUC misleads: On highly imbalanced datasets, AUC can be misleadingly high because the negative class dominates the denominator. AUPRC (Area Under Precision-Recall Curve) is more meaningful for imbalanced problems.

Cross-Validation Strategies

Single train/test splits have high variance — different random splits give different estimates. Cross-validation averages across multiple splits.

k-fold CV: Split into $k$ equal folds. Train on $k-1$ folds, test on the held-out fold. Repeat $k$ times. Report mean ± std. $k=5$ or $k=10$ is standard.

Stratified k-fold: Ensures each fold has the same class proportions as the full dataset. Essential for imbalanced datasets.

Leave-one-out CV (LOOCV): k=n (each sample is a test set once). Maximum data utilization, high variance, computationally expensive.

Time series split: For temporal data, folds must respect time order — always train on past, test on future. Standard k-fold would leak future information.

Nested CV: Two loops of cross-validation. Outer loop estimates generalization error; inner loop selects hyperparameters. Prevents hyperparameter overfitting to the test set.

LLM Evaluation: The Benchmarking Crisis

Evaluating LLMs is fundamentally different from evaluating classifiers. Language generation isn’t right/wrong — it’s a spectrum.

Multiple-choice benchmarks (most common):

  • MMLU: 57 subjects, 4-choice questions. Tests knowledge breadth.
  • HellaSwag: Sentence completion (reasoning)
  • ARC: Grade-school science questions
  • BBH (BIG-Bench Hard): Harder tasks with clear correct answers

Problems:

  1. Benchmark contamination: Models trained on data scraped from the internet may have seen benchmark questions. GPT-4’s training data cutoff predates many benchmarks — but the questions appear online.
  2. Benchmark saturation: GPT-4, Claude, and Gemini all score 85-90%+ on MMLU. The benchmark is no longer discriminating between frontier models.
  3. Goodhart’s Law: Models optimized for benchmarks may not improve on the underlying capability. Fine-tuning on benchmark-style questions without improving reasoning ability is well-documented.

Open-ended evaluation (harder but more honest):

  • LMSYS Chatbot Arena: Anonymous pairwise comparisons by real users
  • MT-Bench: Multi-turn dialogue quality (GPT-4 as judge)
  • HumanEval: Code execution correctness
  • GPQA: Graduate-level science questions from domain experts

LLM-as-judge (emerging standard): Use a strong LLM (GPT-4, Claude Opus) to evaluate open-ended outputs. More scalable than human evaluation, catches quality issues that reference answers miss. Bias: the judge tends to prefer responses similar to its own style.

Evaluation for AI Safety

Standard capability benchmarks miss safety-critical properties:

Jailbreak robustness: What fraction of adversarial prompts bypass content policies? (JailbreakBench)

Factual accuracy: What fraction of generated facts are verifiable? (TruthfulQA, FActScore)

Calibration: Does confidence correlate with accuracy? Overconfident wrong answers are worse than uncertain wrong answers.

Behavioral consistency: Does the model give the same answer when a question is paraphrased? Inconsistency suggests pattern-matching rather than reasoning.

One thing to remember: Model evaluation is increasingly the bottleneck in AI development — it’s harder to measure real-world capability and safety than to improve them, which is why benchmark saturation and evaluation methodology debates are central concerns in the field.

model-evaluationroc-auccross-validationllm-benchmarksprecision-recallf1

See Also

  • Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'
  • Artificial Intelligence What is AI really? Think of it as a dog that learned tricks — impressive, but it doesn't know why it's doing them.
  • Bias Variance Tradeoff The fundamental tension in machine learning between being wrong in the same way vs. being wrong in different ways — and why the simplest model isn't always best.
  • Deep Learning Why your phone can spot your face in a messy photo album — and why that trick comes from practice, not magic.
  • Embeddings How do computers know that 'dog' and 'puppy' mean almost the same thing? They don't read definitions — they turn words into secret map coordinates, and nearby coordinates mean nearby meanings.