Gradient Descent: The Algorithm That Trains Every AI Model

If you understand one algorithm in all of AI, make it gradient descent. It’s how neural networks learn. Every model you’ve ever used — GPT, Stable Diffusion, BERT, AlphaGo — was trained by some variant of this single idea.

The concept is beautifully simple. The implementation details fill textbooks. This guide focuses on building real intuition.

The core idea

Imagine you’re blindfolded on a hilly landscape. Your goal is to find the lowest valley. You can’t see, but you can feel the slope under your feet.

Strategy: Take a step in whichever direction goes downhill most steeply. Repeat.

That’s gradient descent. The “landscape” is the loss function (how wrong your model is), the “position” is your model’s parameters (weights and biases), and the “slope” is the gradient (the direction and rate of steepest increase in loss).

You step in the negative gradient direction — downhill — to reduce loss.

Making it precise

A neural network has millions (or billions) of parameters: θ = [θ₁, θ₂, …, θₙ].

The loss function L(θ) measures how bad the model’s predictions are on the training data. Lower loss = better predictions.

The gradient ∇L(θ) is a vector of partial derivatives — one per parameter — pointing in the direction of steepest increase in loss.

The update rule:

θ_new = θ_old - α × ∇L(θ_old)

Where α (alpha) is the learning rate — how big each step is.

That’s the entire algorithm. Compute the gradient, step opposite to it, repeat.

Why the learning rate matters enormously

The learning rate is the most important hyperparameter in deep learning. Get it wrong and nothing works.

Too large: You overshoot valleys, bouncing back and forth, potentially diverging to infinity. The model never converges.

Too small: You make progress painfully slowly. Training that should take hours takes weeks. You might also get stuck in shallow local minima.

Just right: The model converges efficiently to a good solution. But “just right” changes during training — you want larger steps early (exploring) and smaller steps later (refining).

Learning rate schedules

Modern training uses learning rate schedules that change α over time:

Warmup: Start very small, increase linearly for the first few thousand steps. This prevents early instability.
Cosine decay: After warmup, decrease α following a cosine curve. Smooth and effective.
Step decay: Drop α by a fixed factor (e.g., 10x) at predetermined intervals.
One-cycle: Increase then decrease α in a single cycle. Surprisingly effective for many tasks.

The learning rate is so important that finding a good one is worth more than almost any other architectural change.

Stochastic gradient descent (SGD)

Computing the gradient over the entire training dataset is expensive. For a dataset with millions of examples, one gradient computation requires processing every single one.

Stochastic gradient descent computes the gradient on a small random subset (a mini-batch) instead. This gradient is noisy — it doesn’t point exactly downhill — but it points roughly in the right direction.

The noise is actually beneficial:

Escaping local minima. The randomness helps the optimizer bounce out of shallow valleys and find deeper ones.
Implicit regularization. SGD’s noise acts as a regularizer, preventing the model from memorizing the training data too precisely.
Speed. Processing 32 examples is much faster than processing 1 million, and you make progress with every batch.

Typical mini-batch sizes: 32, 64, 128, 256, or 512. Larger batches give more accurate gradient estimates but less regularization benefit. There’s an active debate about optimal batch sizes.

Beyond vanilla SGD

Plain SGD has limitations. The gradient at any point only tells you about the local slope — it says nothing about curvature or momentum. Modern optimizers add these signals.

Momentum

Instead of stepping purely based on the current gradient, accumulate a running average of recent gradients:

velocity = β × velocity + gradient
θ = θ - α × velocity

This is like a ball rolling downhill — it builds up speed in consistent directions and dampens oscillation in noisy directions. β is typically 0.9.

Adam (Adaptive Moment Estimation)

The most widely used optimizer in deep learning. It combines two ideas:

Momentum (first moment) — running average of gradients
Adaptive learning rates (second moment) — running average of squared gradients

Parameters that receive large, consistent gradients get smaller learning rates (they’re already learning fast). Parameters with small, noisy gradients get larger learning rates (they need more help).

m = β₁ × m + (1-β₁) × gradient          # first moment
v = β₂ × v + (1-β₂) × gradient²         # second moment  
θ = θ - α × m / (√v + ε)                # update

Default hyperparameters (β₁=0.9, β₂=0.999, ε=1e-8) work well for most tasks. Adam is the “safe default” optimizer — if you’re not sure what to use, use Adam.

AdamW

A corrected version of Adam that properly implements weight decay regularization. It’s the standard for training transformers, including all modern LLMs.

The loss landscape

For simple models, the loss landscape is a smooth bowl with one minimum. Gradient descent rolls straight to it.

For deep neural networks, the landscape is astronomically high-dimensional and wildly non-convex. There are:

Saddle points — flat in some directions, sloped in others. Far more common than local minima in high dimensions.
Plateaus — large flat regions where gradients are near zero. Training appears stuck.
Sharp vs. flat minima — sharp minima generalize poorly; flat minima generalize well. SGD’s noise naturally favors flat minima.

Despite the theoretical nightmare of non-convex optimization, gradient descent works remarkably well in practice for neural networks. Why? The most compelling explanation: in very high-dimensional spaces, most local minima have similar loss values to the global minimum. Bad local minima are rare.

Backpropagation: computing the gradient

The gradient requires computing how the loss changes with respect to every parameter. For a model with billions of parameters, this seems impossible.

Backpropagation makes it efficient using the chain rule of calculus. It computes gradients layer by layer, from output back to input, reusing intermediate results. The cost of computing the gradient is roughly 2-3x the cost of a forward pass — independent of the number of parameters.

This efficiency is why neural networks are trainable at all. Without backpropagation, computing gradients for a billion-parameter model would be intractable.

What gradient descent can’t do

Gradient descent requires a differentiable loss function — you need to compute derivatives. This means:

Discrete decisions (yes/no, select from a list) aren’t directly optimizable. Workarounds like the straight-through estimator or Gumbel-softmax exist but add complexity.
Non-differentiable metrics (BLEU, exact match, human preference) can’t be directly optimized. RLHF is partly a solution to this — it creates a differentiable proxy (the reward model) for human preferences.
Architecture search — choosing the model structure itself isn’t a gradient descent problem.

The remarkable generality

What makes gradient descent extraordinary isn’t any single application — it’s the generality. The same algorithm trains:

A spam filter with 1,000 parameters
A language model with 1 trillion parameters
An image generator, a speech recognizer, a game-playing agent

The model architecture changes. The data changes. The optimizer variant changes. But the fundamental loop — compute gradient, step downhill, repeat — is always the same.

Understanding gradient descent means understanding how all of modern AI learns. Everything else is structure built on top of this foundation.