Gradient Descent — Core Concepts

The algorithm behind every AI system you've ever used — and the surprisingly simple math that makes it work.

Why AI Training Is Basically Downhill Skiing

Every machine learning model you’ve ever interacted with — ChatGPT, Google’s spam filter, the algorithm choosing your Spotify recommendations — was shaped by gradient descent. It’s the workhorse of modern AI, and most people who use AI daily have no idea it exists.

Here’s the core idea: training a model is an optimization problem. You have a model that makes predictions, you have correct answers to compare against, and you need the model’s predictions to get as close to correct as possible. Gradient descent is how you get there.

The Loss Function: Measuring Wrongness

Before descent can happen, you need a way to measure how wrong the model is. That’s the loss function (sometimes called cost function).

Imagine your model is predicting house prices. It predicts $400,000. The real price was $350,000. The loss function captures that $50,000 gap — and the goal is to make that number as small as possible across all your training examples.

Different problems use different loss functions:

Mean Squared Error — common for predicting numbers (house prices, temperatures)
Cross-Entropy Loss — standard for classification (is this email spam or not?)
Binary Cross-Entropy — yes/no predictions

The loss function turns “how wrong is the model?” into a single number you can work with mathematically.

The Landscape Metaphor (And Why It’s Not Just Metaphor)

With a loss function, you can imagine a landscape. Every possible configuration of the model’s parameters (its internal settings — GPT-4 has about 1.8 trillion of them) corresponds to a point in this landscape, and the height of that point is the loss.

High points = bad model. Low points = good model.

Gradient descent is the algorithm that navigates this landscape downhill.

The gradient is the mathematical name for slope. In multiple dimensions — and ML models have millions of dimensions — the gradient is a vector that points in the direction of steepest uphill. Flip it, and you have the direction of steepest downhill.

Each training step:

Calculate the gradient at the current position
Move a small step in the negative gradient direction (downhill)
Repeat

Learning Rate: The Most Important Hyperparameter You’ve Never Heard Of

The size of each step is controlled by the learning rate — a small number, typically something like 0.001 or 0.0001.

Too large a learning rate: you overshoot the valley, bouncing around without settling. Too small: training takes forever, or you get stuck in minor dips.

Getting the learning rate right is part art, part science. Researchers at Google Brain spent months figuring out optimal learning rates for large models. A bad learning rate can mean a training run that costs $50 million in compute teaches the model almost nothing.

Variants: Not All Descent Is Equal

Plain gradient descent — running through your entire dataset before each step — is rarely used today. It’s too slow. Instead:

Stochastic Gradient Descent (SGD) Uses one random training example per step. Fast, but noisy — the path zigzags wildly toward the minimum.

Mini-Batch Gradient Descent The practical default. Uses a small batch (32, 64, 256 examples) per step. Balanced between speed and stability. Almost every modern model trains this way.

Adam (Adaptive Moment Estimation) The current favorite for most deep learning. Introduced by Diederik Kingma and Jimmy Ba in 2014, Adam adapts the learning rate for each parameter individually based on how much that parameter has been changing. In practice, it converges faster and is far less sensitive to learning rate choice than plain SGD. It’s the default in PyTorch and TensorFlow for good reason.

The Local Minimum Problem (It’s Complicated)

A valley in a landscape isn’t always the deepest valley. The model might roll into a local minimum — a solution that’s good but not optimal — without ever finding the global minimum.

For years, this terrified researchers. But in practice, large neural networks rarely get completely stuck. The high-dimensional loss landscape has so many dimensions that true local minima are extremely rare — most apparent valleys have at least one escape route in some direction. The bigger problem in practice is saddle points, where the landscape curves downward in some dimensions and upward in others.

Modern optimizers like Adam handle saddle points better than vanilla gradient descent.

A Common Misconception

Most people think gradient descent finds the best possible solution. It doesn’t. It finds a good solution, and that’s usually fine. The model doesn’t need to be perfect — it needs to generalize well to new data it hasn’t seen. Sometimes a slightly less perfect fit on training data means better performance on real-world inputs. This is why training isn’t just about minimizing loss to zero.

One Thing to Remember

Gradient descent doesn’t find the perfect answer — it finds a good-enough answer by taking millions of tiny steps downhill through a mathematical landscape. The learning rate controls how big those steps are, and getting it wrong can waste millions in compute.

techaimachine-learningoptimizationtraining