Scikit-Learn Learning Curves — Core Concepts

How to diagnose underfitting and overfitting in scikit-learn using learning curves — with visual patterns every ML practitioner should recognize.

Why learning curves matter

Every machine learning project faces a fork: should you gather more data or try a different model? Learning curves answer that question empirically by plotting model performance against the number of training samples.

Without this diagnostic, teams routinely spend weeks collecting data that doesn’t improve results, or swap models when the real bottleneck is sample size.

How learning curves work

A learning curve trains the same model on progressively larger subsets of data — say 10%, 20%, 40%, 60%, 80%, and 100% of the training set. At each size, it records two scores:

Training score — how well the model fits the data it learned from
Validation score — how well the model generalizes to unseen data

These two lines, plotted together, create diagnostic patterns.

Three patterns to recognize

High bias (underfitting): Both training and validation scores are low and converge early. The model is too simple to capture the underlying signal. More data won’t help — you need a more expressive model, additional features, or less aggressive regularization.

High variance (overfitting): The training score is near-perfect but the validation score lags far behind. There’s a persistent gap between the curves. More training data typically closes this gap. Alternatively, simplify the model or add regularization.

Good fit: Both scores are reasonably high, the gap between them is small, and the validation curve has plateaued. This is the sweet spot where additional data yields diminishing returns and the model generalizes well.

Using scikit-learn’s learning_curve function

Scikit-learn provides sklearn.model_selection.learning_curve, which handles the mechanics: splitting data into subsets, cross-validating at each size, and returning arrays of scores.

Key parameters include:

estimator — any scikit-learn model or pipeline
train_sizes — fractions or absolute numbers of training examples to evaluate
cv — the cross-validation strategy (e.g., 5-fold)
scoring — the metric to optimize (accuracy, F1, R², etc.)

The function returns training sizes used, training scores, and test scores — ready for plotting.

Common misconception

Many practitioners assume a flat validation curve always means “model is perfect.” In reality, it can also mean the model plateaued at a mediocre score and needs architectural changes. Always check where the curve plateaued, not just that it stopped moving.

When to use learning curves

Before collecting expensive new data — check if more samples will actually help
During model selection — compare how different models respond to data volume
After feature engineering — verify that new features reduced the bias-variance gap
In production monitoring — detect when retraining with fresh data stops improving performance

One thing to remember: The gap between training and validation curves is the story. A shrinking gap means more data is working. A stubborn gap means the model needs structural change.

pythonmachine-learningscikit-learn