Gradient Descent — Core Concepts
Why AI Training Is Basically Downhill Skiing
Every machine learning model you’ve ever interacted with — ChatGPT, Google’s spam filter, the algorithm choosing your Spotify recommendations — was shaped by gradient descent. It’s the workhorse of modern AI, and most people who use AI daily have no idea it exists.
Here’s the core idea: training a model is an optimization problem. You have a model that makes predictions, you have correct answers to compare against, and you need the model’s predictions to get as close to correct as possible. Gradient descent is how you get there.
The Loss Function: Measuring Wrongness
Before descent can happen, you need a way to measure how wrong the model is. That’s the loss function (sometimes called cost function).
Imagine your model is predicting house prices. It predicts $400,000. The real price was $350,000. The loss function captures that $50,000 gap — and the goal is to make that number as small as possible across all your training examples.
Different problems use different loss functions:
- Mean Squared Error — common for predicting numbers (house prices, temperatures)
- Cross-Entropy Loss — standard for classification (is this email spam or not?)
- Binary Cross-Entropy — yes/no predictions
The loss function turns “how wrong is the model?” into a single number you can work with mathematically.
The Landscape Metaphor (And Why It’s Not Just Metaphor)
With a loss function, you can imagine a landscape. Every possible configuration of the model’s parameters (its internal settings — GPT-4 has about 1.8 trillion of them) corresponds to a point in this landscape, and the height of that point is the loss.
High points = bad model. Low points = good model.
Gradient descent is the algorithm that navigates this landscape downhill.
The gradient is the mathematical name for slope. In multiple dimensions — and ML models have millions of dimensions — the gradient is a vector that points in the direction of steepest uphill. Flip it, and you have the direction of steepest downhill.
Each training step:
- Calculate the gradient at the current position
- Move a small step in the negative gradient direction (downhill)
- Repeat
Learning Rate: The Most Important Hyperparameter You’ve Never Heard Of
The size of each step is controlled by the learning rate — a small number, typically something like 0.001 or 0.0001.
Too large a learning rate: you overshoot the valley, bouncing around without settling. Too small: training takes forever, or you get stuck in minor dips.
Getting the learning rate right is part art, part science. Researchers at Google Brain spent months figuring out optimal learning rates for large models. A bad learning rate can mean a training run that costs $50 million in compute teaches the model almost nothing.
Variants: Not All Descent Is Equal
Plain gradient descent — running through your entire dataset before each step — is rarely used today. It’s too slow. Instead:
Stochastic Gradient Descent (SGD) Uses one random training example per step. Fast, but noisy — the path zigzags wildly toward the minimum.
Mini-Batch Gradient Descent The practical default. Uses a small batch (32, 64, 256 examples) per step. Balanced between speed and stability. Almost every modern model trains this way.
Adam (Adaptive Moment Estimation) The current favorite for most deep learning. Introduced by Diederik Kingma and Jimmy Ba in 2014, Adam adapts the learning rate for each parameter individually based on how much that parameter has been changing. In practice, it converges faster and is far less sensitive to learning rate choice than plain SGD. It’s the default in PyTorch and TensorFlow for good reason.
The Local Minimum Problem (It’s Complicated)
A valley in a landscape isn’t always the deepest valley. The model might roll into a local minimum — a solution that’s good but not optimal — without ever finding the global minimum.
For years, this terrified researchers. But in practice, large neural networks rarely get completely stuck. The high-dimensional loss landscape has so many dimensions that true local minima are extremely rare — most apparent valleys have at least one escape route in some direction. The bigger problem in practice is saddle points, where the landscape curves downward in some dimensions and upward in others.
Modern optimizers like Adam handle saddle points better than vanilla gradient descent.
A Common Misconception
Most people think gradient descent finds the best possible solution. It doesn’t. It finds a good solution, and that’s usually fine. The model doesn’t need to be perfect — it needs to generalize well to new data it hasn’t seen. Sometimes a slightly less perfect fit on training data means better performance on real-world inputs. This is why training isn’t just about minimizing loss to zero.
One Thing to Remember
Gradient descent doesn’t find the perfect answer — it finds a good-enough answer by taking millions of tiny steps downhill through a mathematical landscape. The learning rate controls how big those steps are, and getting it wrong can waste millions in compute.
See Also
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'
- Artificial Intelligence What is AI really? Think of it as a dog that learned tricks — impressive, but it doesn't know why it's doing them.
- Bias Variance Tradeoff The fundamental tension in machine learning between being wrong in the same way vs. being wrong in different ways — and why the simplest model isn't always best.
- Deep Learning Why your phone can spot your face in a messy photo album — and why that trick comes from practice, not magic.
- Embeddings How do computers know that 'dog' and 'puppy' mean almost the same thing? They don't read definitions — they turn words into secret map coordinates, and nearby coordinates mean nearby meanings.