ML Optimization

Chapter 4: ML Fundamentals & Core Concepts Section 7 of 8 9 subtopics

TL;DR

Optimization is how a model learns — you define what "wrong" means with a loss function, then gradient descent finds the parameter values that minimize it. The learning rate is the most consequential hyperparameter you'll ever set. SGD is noisy but that noise is a feature. Momentum smooths out the chaos. Adam adapts learning rates per-parameter and is the practical default for almost everything. AdamW fixes a subtle but important bug in Adam's handling of weight decay. Learning rate schedules (warmup + cosine decay) and gradient clipping round out the modern training recipe. All of this is powered by backpropagation, the calculus trick that makes computing gradients feasible.

I avoided the math behind optimization for longer than I'd like to admit. Every time I saw a gradient descent equation in a paper, I'd nod knowingly and scroll past, operating on faith that PyTorch was handling it. I could call optimizer.step() and get results, which felt like understanding. It wasn't. Finally, the discomfort of not knowing what was actually happening under the hood when my model refused to converge — and having no intuition for why — grew too great. Here is that dive.

Optimization in machine learning is the process of finding the best parameter values for a model. It was formalized long before ML existed — in operations research and numerical analysis — but the specific cocktail of stochastic gradient descent, momentum, and adaptive learning rates that powers modern deep learning came together between roughly 2012 and 2015. Today, it's the engine behind every neural network, every language model, every recommendation system that learns from data.

Before we start, a heads-up. We're going to be building gradient descent from absolute zero, walking through optimizer internals, and getting into the weeds on learning rates and loss landscapes. You don't need to know calculus or linear algebra beforehand. We'll add the concepts we need one piece at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

What Does "Wrong" Mean? The Loss Function
The Foggy Landscape
Feeling the Slope: The Gradient
Taking a Step: The Update Rule
How Much Data Per Step? Batch, Stochastic, Mini-batch
The Learning Rate: One Number to Rule Them All
Rest Stop
Momentum: The Heavy Ball
Nesterov: Looking Before You Leap
Adaptive Learning Rates: AdaGrad, RMSprop, Adam
AdamW: The One You Should Actually Use
Learning Rate Schedules
The Terrain Gets Weird: Saddle Points and Flat Valleys
Gradient Clipping: Insurance Against Catastrophe
Backpropagation Is Not Gradient Descent
Second-Order Methods (Brief Detour)
The Full Recipe
Resources

What Does "Wrong" Mean? The Loss Function

Imagine we're building a model to predict house prices. We have one input — square footage — and one output — price. Our model is absurdly simple: multiply the square footage by some number w (the weight), add some number b (the bias), and that's our prediction.

predicted_price = w * square_footage + b

Right now w and b are random. Our model predicts that a 1,500 sqft house costs $47. That's wrong. But how wrong? We need a number that captures the wrongness. That number is the loss — a single value that says how far our predictions are from reality. A common choice is mean squared error: take the difference between each prediction and the actual price, square it (so negative errors don't cancel positive ones), and average across all houses.

# Our tiny dataset: 3 houses
actual_prices  = [300_000, 450_000, 200_000]
predicted      = [47, 47, 47]       # random initial predictions

# Mean squared error
errors_squared = [(300000-47)**2, (450000-47)**2, (200000-47)**2]
loss = sum(errors_squared) / 3      # a very large number

The loss is enormous. Good — it should be. Our model knows nothing yet. The entire goal of optimization is to find values of w and b that make this number as small as possible. That's it. Every optimizer, every training loop, every convergence curve you've ever seen — they're all chasing a smaller loss.

The Foggy Landscape

Here's the mental image that makes everything click. Picture the loss as a physical terrain. Every possible combination of w and b is a point on a vast landscape, and the height at that point is the loss. High ground means bad predictions. Valleys mean good ones. The deepest valley is the best our model can do.

Now imagine you're standing on this landscape in thick fog. You can't see the valleys. You can't see the horizon. You can feel the slope under your boots — the ground tilts to the left and slightly downward. That's all you have.

So you take a step downhill. You feel the slope again. Take another step. Keep going. Eventually you end up in a valley. Maybe not the deepest valley on the whole landscape, but a valley. Your loss went down, and your model got better.

That's gradient descent. The entire idea. We'll keep coming back to this foggy landscape throughout the post, because it turns out the terrain is stranger than you'd expect — full of saddle points, narrow ravines, and vast flat plateaus. But the core of it is always: feel the slope, step downhill.

Feeling the Slope: The Gradient

The slope under your feet has a technical name: the gradient. It's a vector — one number for each parameter — that tells you which direction is steepest uphill. For our house price model with two parameters (w and b), the gradient has two components: how much the loss changes when you nudge w a tiny bit, and how much it changes when you nudge b a tiny bit.

Those individual numbers are called partial derivatives. The partial derivative of the loss with respect to w answers one question: "If I increase w by a tiny amount, does the loss go up or down, and by how much?" If the answer is "up by a lot," then we should decrease w. If "down by a little," we should increase it, gently.

The gradient bundles all these partial derivatives together into a single arrow pointing uphill. We want to go downhill, so we move in the opposite direction — the negative gradient.

# For our 2-parameter model:
# gradient = [∂L/∂w, ∂L/∂b]
#
# ∂L/∂w answers: "how does the loss change when I wiggle w?"
# ∂L/∂b answers: "how does the loss change when I wiggle b?"
#
# We step in the OPPOSITE direction:
# w_new = w - step_size * ∂L/∂w
# b_new = b - step_size * ∂L/∂b

I'll be honest — when I first encountered partial derivatives, the notation scared me more than the concept. The concept is "wiggle one knob, see what happens to the loss." That's all a partial derivative is.

Taking a Step: The Update Rule

With the gradient in hand, the actual update is almost anticlimactic:

θ = θ - α * ∇L(θ)

# θ = all parameters (w and b in our case)
# α = learning rate (how big a step)
# ∇L(θ) = gradient of loss with respect to θ (the slope)

That single line is gradient descent. Every optimizer you'll ever encounter — SGD, Adam, AdaGrad, all of them — is a variation on this theme. They all compute some version of "which direction is downhill" and take some version of "a step in that direction." The differences lie in how they estimate the direction and how they decide the step size.

Let's watch it work on our house price model. We'll simplify to one parameter — the weight w — and use a toy loss function where the best value is w = 150.

import numpy as np

# Toy: loss = (w - 150)^2, minimum at w = 150
def loss(w):
    return (w - 150) ** 2

def gradient(w):
    return 2 * (w - 150)

w = 0.0          # start with a terrible guess
alpha = 0.1      # learning rate

for step in range(8):
    grad = gradient(w)
    w = w - alpha * grad
    print(f"Step {step}: w = {w:.1f}, loss = {loss(w):.0f}")

# Step 0: w = 30.0,  loss = 14400
# Step 1: w = 54.0,  loss = 9216
# Step 2: w = 73.2,  loss = 5898
# Step 3: w = 88.6,  loss = 3775
# Step 4: w = 100.8, loss = 2416
# ...converging toward 150

Each step, the loss drops. The gradient tells us "you're too low, go up," and we go up. As we get closer to 150, the gradient gets smaller (the slope flattens), so our steps naturally shrink. We're rolling into the valley and slowing down.

How Much Data Per Step? Batch, Stochastic, Mini-batch

Our toy example had 3 houses. The gradient was cheap to compute. Real datasets have millions of examples. Computing the exact gradient over all of them for every single step is like surveying the entire mountainside before taking one step. Accurate, but agonizingly slow.

This tension — accuracy of the gradient estimate versus computational cost per step — splits gradient descent into three flavors.

Batch gradient descent uses every training example to compute one gradient, then takes one step. The gradient is as accurate as it can be for this dataset. The path downhill is smooth and predictable. But with 10 million examples, you're doing 10 million calculations for a single step. Then doing it again. For large datasets, this is impractical.

Stochastic gradient descent (SGD) goes to the other extreme. Pick one training example at random, compute the gradient from that single example, take a step. The gradient estimate is terrible — one house is a pitiful representation of the whole market. The path downhill looks drunk. But you're taking N steps per pass through the data instead of one.

And here's the part that surprised me: the noise is actually helpful. A single-sample gradient is inaccurate, yes, but that inaccuracy acts like random jostling. It kicks the optimizer out of shallow valleys and past saddle points that a clean, exact gradient would settle into contentedly. The noise is a feature, not a bug. We'll revisit this idea later when we talk about sharp and flat minima.

Mini-batch gradient descent is the practical middle ground. Grab a random batch of B examples (typically 32 to 512), compute the gradient from those, take a step. The estimate is much better than a single sample, the cost is manageable, and GPUs are designed to process batches efficiently. This is what everyone actually uses. When someone says "SGD" in a paper or codebase, they almost always mean mini-batch SGD with a batch size of 32 to 256.

# What "training" actually looks like
for epoch in range(num_epochs):
    for batch_X, batch_y in random_batches(X, y, batch_size=64):
        gradient = compute_gradient(batch_X, batch_y, θ)
        θ = θ - α * gradient

The batch size controls a tradeoff. Larger batches give cleaner gradient estimates and use the GPU more efficiently. Smaller batches introduce more noise, which often leads to better generalization. It's one of those knobs where the right answer is "it depends," but 32 to 256 is the sweet spot for most tasks.

The Learning Rate: One Number to Rule Them All

If you only tune one hyperparameter, tune the learning rate. Nothing else is close.

The learning rate α controls how far you step with each update. It's the single most consequential number in your entire training configuration. Get it wrong and nothing else you do matters.

Set it too high and you overshoot the valley entirely. The loss doesn't decrease — it oscillates wildly or explodes to infinity. Your terminal fills with NaN. Set it too low and you inch forward so slowly that training takes days when it should take hours, and you might never escape a mediocre valley because you lack the energy to climb over the ridge to a better one.

# Too high — the optimizer overshoots and diverges
w = 5.0
for step in range(5):
    w = w - 1.1 * (2 * w)  # α = 1.1
    print(f"Step {step}: w = {w:.1f}")
# w: 5.0 → -6.0 → 7.2 → -8.6 → 10.4   (exploding!)

# Too low — barely moves
w = 5.0
for step in range(5):
    w = w - 0.001 * (2 * w)  # α = 0.001
    print(f"Step {step}: w = {w:.4f}")
# w: 5.0 → 4.990 → 4.980 → 4.970 → 4.960  (glacial)

# Sweet spot
w = 5.0
for step in range(5):
    w = w - 0.3 * (2 * w)  # α = 0.3
    print(f"Step {step}: w = {w:.3f}")
# w: 5.0 → 2.0 → 0.8 → 0.32 → 0.128       (converging!)

There's a bit of deep learning folklore worth knowing. Andrej Karpathy, in his code and lectures, used 3e-4 (0.0003) as a default learning rate for Adam so often that the community started calling it the "Karpathy constant." It's not magic — it's an empirically good starting point that happens to work across a surprising range of architectures. For SGD with momentum, the common starting point is around 0.1, which you then decay over training.

💡 The Learning Rate Finder

A practical trick: train for one epoch while exponentially increasing the learning rate from very small (1e-7) to very large (10). Plot loss versus learning rate. The loss will decrease, hit a minimum, then explode. Pick the learning rate at the steepest point of descent — roughly one order of magnitude below where the loss starts exploding. This takes minutes and saves hours of guessing.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a working mental model of ML optimization: there's a loss landscape defined by your parameters, gradient descent walks downhill by following the slope, the learning rate controls the step size, and mini-batch training is the practical way to do it at scale. That model is genuinely useful. You can read training logs, diagnose divergence, and understand why someone is tuning a learning rate.

It doesn't tell the whole story, though. Vanilla gradient descent oscillates in narrow valleys, treats every parameter identically even when they have wildly different update needs, and uses a fixed learning rate that's never quite right for the whole training run. The optimizers that actually power modern deep learning — Adam, AdamW, schedules — were invented because people ran into exactly these limitations and got frustrated enough to fix them.

The short version is: Adam adapts the learning rate per parameter and almost always works. There. You're 80% of the way there.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

Momentum: The Heavy Ball

Vanilla SGD has an annoying habit. The loss landscape is often shaped like an elongated valley — steep walls on either side, gentle slope along the floor. SGD oscillates back and forth across the steep walls while making painfully slow progress along the valley floor. It's wasting most of its energy bouncing side-to-side instead of rolling forward.

Picture a heavy ball rolling downhill. In directions where gravity consistently pulls the same way (the valley floor), the ball builds up speed — it accelerates. In directions where it keeps getting pushed back and forth (the valley walls), the pushes cancel out and the oscillations dampen.

Momentum captures this idea. Instead of stepping based only on today's gradient, you keep a running average — a velocity — that accumulates past gradients:

v = 0            # velocity, same shape as parameters
β = 0.9          # momentum coefficient

for batch_X, batch_y in batches:
    g = compute_gradient(batch_X, batch_y, θ)
    v = β * v + g           # 90% old velocity + new gradient
    θ = θ - α * v           # step using velocity, not raw gradient

The coefficient β is typically 0.9, which means the velocity is roughly an exponential moving average of the last 10 gradients. Consistent gradients accumulate into speed. Oscillating gradients cancel out. Same physics as a low-pass filter — it smooths the noise and lets the signal through.

Our heavy ball analogy will come back. It's the thread that connects momentum, Nesterov, and eventually Adam — they're all variations on "what if the ball were smarter?"

Nesterov: Looking Before You Leap

Standard momentum computes the gradient at your current position, then applies velocity. Nesterov accelerated gradient does it in reverse: first take a step in the direction of your accumulated velocity (a "lookahead"), then compute the gradient at that future position.

v = 0
β = 0.9

for batch_X, batch_y in batches:
    θ_lookahead = θ - α * β * v              # peek ahead
    g = compute_gradient(batch_X, batch_y, θ_lookahead)
    v = β * v + g
    θ = θ - α * v

Why does this matter? If momentum is carrying our heavy ball past the valley floor and up the other side, standard momentum doesn't notice until the ball is already climbing. Nesterov evaluates the gradient at where the ball is about to be, so it sees the uphill slope sooner and applies the brakes earlier. It's a small tweak — one line of code — but it tightens convergence, especially for well-behaved (convex) problems. Use it when your framework offers it; it's free.

Adaptive Learning Rates: AdaGrad, RMSprop, Adam

Here's the insight that changed everything: different parameters need different learning rates.

Think about word embeddings. The embedding vector for "the" gets updated on nearly every batch — it appears in almost every sentence. The embedding for "defenestrate" might get updated once in a thousand batches. If you use the same learning rate for both, you're either updating "the" too aggressively or "defenestrate" too conservatively. Ideally, you'd turn down the knob for frequently-updated parameters and turn it up for rare ones. Adaptive methods do exactly this, automatically.

AdaGrad (2011) was the first to try. It tracks the sum of squared gradients for each parameter. Parameters that have received large gradients accumulate a large sum, which divides the learning rate down. Parameters with small, infrequent gradients keep a higher effective learning rate.

# AdaGrad
cache = 0       # sum of squared gradients, per parameter
ε = 1e-8        # tiny constant to avoid division by zero

for batch_X, batch_y in batches:
    g = compute_gradient(batch_X, batch_y, θ)
    cache = cache + g ** 2
    θ = θ - α * g / (sqrt(cache) + ε)

For sparse problems — NLP, recommender systems — AdaGrad is beautiful. Rare features get amplified. But there's a fatal flaw: that cache only grows. It never shrinks. Over thousands of steps, the accumulated sum gets so large that the effective learning rate drops to near zero and learning halts entirely. AdaGrad dies on long training runs.

RMSprop fixes this with one change: instead of accumulating all past squared gradients, use an exponential moving average that gradually forgets old gradients. The learning rate adapts but never decays to zero.

# RMSprop
cache = 0
β = 0.999       # decay rate

for batch_X, batch_y in batches:
    g = compute_gradient(batch_X, batch_y, θ)
    cache = β * cache + (1 - β) * g ** 2    # EMA, forgets old gradients
    θ = θ - α * g / (sqrt(cache) + ε)

RMSprop was never published in a paper. Geoffrey Hinton introduced it in a Coursera lecture in 2012, which might be the most consequential slide deck in deep learning history. It works. It became the direct ancestor of Adam.

Adam (Adaptive Moment Estimation, 2014) puts momentum and RMSprop together. It maintains two running averages: the first moment m (an EMA of the gradients themselves, like momentum) and the second moment v (an EMA of the squared gradients, like RMSprop). Together, they give you both a smoothed gradient direction and a per-parameter learning rate.

# Adam — the full algorithm
m = 0          # first moment (momentum)
v = 0          # second moment (RMSprop)
β1 = 0.9       # decay rate for first moment
β2 = 0.999     # decay rate for second moment
ε = 1e-8
t = 0          # timestep

for batch_X, batch_y in batches:
    t += 1
    g = compute_gradient(batch_X, batch_y, θ)

    m = β1 * m + (1 - β1) * g          # momentum-style update
    v = β2 * v + (1 - β2) * g ** 2     # RMSprop-style update

    m_hat = m / (1 - β1 ** t)          # bias correction
    v_hat = v / (1 - β2 ** t)          # bias correction

    θ = θ - α * m_hat / (sqrt(v_hat) + ε)

Those bias correction terms (m_hat, v_hat) deserve a sentence. Since m and v start at zero, they're biased toward zero in the early steps. Dividing by (1 - β^t) inflates them back to their true scale. It's a bigger correction early on and vanishes as t grows.

My favorite thing about Adam is that, aside from high-level explanations like the one I gave, no one is completely certain why it works so well across such a wide range of problems. The default hyperparameters (β1=0.9, β2=0.999, ε=1e-8, lr=3e-4) work on everything from tiny classifiers to massive language models. That's unusual. Most things in ML need tuning. Adam somehow doesn't — at least not much.

AdamW: The One You Should Actually Use

There's a subtle but important bug in how Adam interacts with weight decay, a regularization technique that penalizes large parameter values to prevent overfitting. In standard Adam, weight decay is applied through the gradient — you add a term to the gradient that pushes parameters toward zero. The problem is that Adam's adaptive scaling then distorts that push. Parameters with large historical gradients get less regularization, and parameters with small historical gradients get more. The regularization becomes inconsistent and unpredictable.

In 2019, Loshchilov and Hutter published "Decoupled Weight Decay Regularization" and the fix is almost embarrassingly elegant: apply the weight decay directly to the parameters, separate from the gradient update, so it's never touched by the adaptive scaling.

# AdamW — decoupled weight decay
# ... (same m, v, bias correction as Adam) ...

# The key difference: weight decay is DECOUPLED
θ = θ - α * m_hat / (sqrt(v_hat) + ε) - α * λ * θ
#                                        ^^^^^^^^^^^
#           applied directly to params, NOT through the gradient

The result is consistent, predictable regularization that matches what you'd expect from weight decay in plain SGD. Empirically, AdamW generalizes better than Adam with L2 regularization across a wide range of tasks. It's the default optimizer in Hugging Face, the default in most transformer training recipes, and what you should use unless you have a specific reason not to. PyTorch ships it as torch.optim.AdamW.

⚠️ SGD vs. Adam: The Debate That Won't Die

Adam converges faster and needs less tuning. SGD with a carefully crafted learning rate schedule sometimes generalizes slightly better on vision tasks (ResNets, for instance). In NLP and transformer land, AdamW dominates unchallenged. I'll be honest — the field doesn't have a clean, unified answer here. The practical advice: start with AdamW. If you're squeezing the last 0.3% accuracy on an image classification benchmark and have the compute budget to tune schedules, experiment with SGD + momentum + cosine annealing.

Learning Rate Schedules

A fixed learning rate is almost never optimal. The intuition maps directly back to our foggy landscape: early in training, you're far from any valley and need big steps to explore broadly. Late in training, you're close to a valley floor and need small, careful steps to settle in without overshooting. A schedule that starts high and decays captures this perfectly.

Step decay is the oldest approach: multiply the learning rate by a factor (typically 0.1) at fixed epochs. "Divide LR by 10 at epochs 30, 60, 90" was the standard recipe in older vision papers. It works, but requires manually choosing when to drop.

Cosine annealing smoothly decays the learning rate following a cosine curve from the initial value to near zero. No abrupt jumps, no schedule to hand-design. It's become the default for most modern training recipes because it works well and asks nothing of you.

Warmup addresses a problem at the very start of training. When parameters are freshly initialized — essentially random — the gradients are unreliable. Taking large steps based on unreliable gradients is reckless. Warmup starts with a tiny learning rate and linearly increases it over the first few hundred or thousand steps, letting the model stabilize before you open the throttle. I'm still developing my intuition for exactly why warmup helps as much as it does, but empirically it's essential for transformer training.

The modern recipe combines them: linear warmup for 5–10% of total steps, then cosine decay for the rest.

import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# Warmup for 1000 steps, then cosine decay for 9000 steps
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=1000)
cosine = CosineAnnealingLR(optimizer, T_max=9000, eta_min=1e-6)
scheduler = SequentialLR(optimizer, [warmup, cosine], milestones=[1000])

One-cycle policy, from Leslie Smith, takes a different approach: ramp the learning rate up to a high value, then anneal it down below the starting value, all within one training run. It's counterintuitive — why would you increase the learning rate mid-training? — but it empirically matches or beats carefully tuned step schedules, and it eliminates all the manual scheduling decisions. PyTorch ships it as OneCycleLR.

The Terrain Gets Weird: Saddle Points and Flat Valleys

Time to return to our foggy landscape, because the terrain is stranger than we first imagined.

For years, people worried that gradient descent would get trapped in bad local minima — valleys that aren't the deepest, but where the gradient is zero so the optimizer has no reason to leave. It turns out, in high dimensions, this fear is mostly misplaced.

Here's why. For a critical point (where the gradient is zero) to be a true local minimum, the loss must curve upward in every single direction. In a model with a million parameters, that means a million directions must all curve up simultaneously. The probability of that happening by chance is vanishingly small. Almost every critical point in a high-dimensional loss landscape is a saddle point — a minimum in some directions but a maximum in others, like the center of a horse saddle or a Pringles chip.

Saddle points do slow down training, because the gradient near a saddle point is close to zero, so steps become tiny. But SGD's noise provides enough random perturbation to eventually nudge the optimizer off the saddle and back onto a descending path. This is another reason to appreciate the noise in stochastic gradients — it's rescue from saddle points.

Sharp vs. Flat Minima

Not all valleys are equally good. A sharp minimum sits at the bottom of a narrow, steep-walled crevice. A flat minimum sits in a broad, gently-sloped basin. Both achieve low training loss, but they behave very differently on new data.

A sharp minimum is fragile. Tiny changes in the parameters cause large changes in the loss. When your training data and your real-world data differ even slightly — and they always do — a model sitting in a sharp minimum's performance collapses. A flat minimum is robust. Small perturbations barely change the loss, so the model's predictions stay reliable on data it hasn't seen.

This connects to a practical observation that confused me for a while: larger batch sizes tend to find sharp minima (worse generalization), while smaller batches tend to find flat ones (better generalization). The noise from small batches acts like a constant jostling that prevents the optimizer from settling into narrow crevices — it can only rest in wide basins where the noise doesn't bounce it out. It's why cranking up the batch size for speed sometimes hurts your final model, even though the gradient estimate is more accurate.

This area is still under active research. Sharpness-Aware Minimization (SAM), introduced in 2021, explicitly seeks flat minima by optimizing not for the lowest loss at a point, but for the lowest loss in the worst-case neighborhood around that point. It's shown consistent improvements, though at the cost of roughly doubling the compute per step. The theory connecting sharpness to generalization is compelling but not settled — there are researchers who argue the connection is more nuanced than the clean story I told above.

Gradient Clipping: Insurance Against Catastrophe

Sometimes, despite your best efforts, the gradient explodes. A single bad batch produces a gradient with a norm of 10,000 when the usual is around 1. One update with that gradient destroys everything the model has learned. This is especially common in recurrent neural networks and early-stage transformer training, where long sequences create multiplicative chains that amplify gradients.

The fix is embarrassingly simple: if the gradient's total magnitude exceeds a threshold, scale the whole thing down proportionally. The direction is preserved; the magnitude is capped.

# Gradient clipping in PyTorch — one line
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Under the hood:
# 1. Compute total_norm = sqrt(sum(p.grad.norm()**2 for all params))
# 2. If total_norm > max_norm:
#        scale = max_norm / total_norm
#        multiply every gradient by scale

A clip value of 1.0 is a safe default. It's used in virtually every serious training codebase — GPT, BERT, LLaMA, you name it. Think of it as insurance: most of the time the gradients are fine and the clipping does nothing, but when a rogue gradient appears, it prevents catastrophe. The cost is a single norm computation per step. Free.

Backpropagation Is Not Gradient Descent

These two are confused with each other constantly, so let's draw a bright line.

Gradient descent answers: "Given the gradient of the loss with respect to every parameter, how do I update the parameters?" It's the optimization algorithm — the stepping-downhill part.

Backpropagation answers: "How do I compute that gradient efficiently for a neural network?" It's the calculus trick — an application of the chain rule that propagates error signals backward through the layers.

Without backpropagation, you could still do gradient descent. You'd compute gradients by finite differences: nudge each parameter by a tiny amount ε, measure the change in loss, divide. For a model with 100 million parameters, that means 100 million forward passes per gradient step. Backpropagation computes the exact same gradient in one backward pass. That's the transformation — from O(n) forward passes to O(1). It's what makes training neural networks feasible at all.

We'll go deep on the mechanics of backpropagation in the deep learning chapters. For now, the division of labor is the thing to remember: gradient descent decides what to do with gradients, backpropagation decides how to compute them.

Second-Order Methods (Brief Detour)

Everything so far has been first-order — we only used the gradient, which is the first derivative. Second-order methods additionally use the curvature, encoded in a matrix called the Hessian. The Hessian tells you not which direction is downhill, but how the slope itself is changing — whether you're approaching a cliff, a gentle valley, or a flat plateau.

Back on our foggy landscape: first-order methods feel the tilt under your feet. Second-order methods also feel whether the ground is curving — whether your next step is going to steepen or flatten. That's strictly more information, so you'd expect better steps. And you'd be right.

Newton's method uses the full Hessian to take "perfect" steps that account for curvature. Near the minimum, it converges quadratically — dramatically faster than any first-order method. The problem is the Hessian. For a model with n parameters, it's an n × n matrix. A model with 10 million parameters produces a Hessian with 100 trillion entries. Not happening.

L-BFGS (Limited-memory BFGS) approximates the Hessian using the last k gradient updates without ever materializing the full matrix. It works brilliantly for small-to-medium models and is the default solver for scikit-learn's logistic regression. For deep learning, L-BFGS is impractical because it doesn't handle the noise from mini-batch gradients well.

The practical bottom line: for models under ~100K parameters with full-batch gradients, L-BFGS will probably converge faster than Adam. For anything larger or mini-batch-based, stick with first-order methods. Most of us will never need second-order optimization, but knowing it exists and why it exists fills in a conceptual gap.

The Full Recipe

Everything we've built — the gradient, the learning rate, momentum, adaptive rates, schedules, clipping — comes together in a training loop that fits on a screen. This is what modern deep learning training actually looks like:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

model = YourModel()

# AdamW: the default for almost everything
optimizer = optim.AdamW(
    model.parameters(),
    lr=3e-4,             # the "Karpathy constant"
    weight_decay=0.01,   # mild, decoupled regularization
    betas=(0.9, 0.999)   # momentum + RMSprop decay rates
)

# Schedule: warmup + cosine decay
total_steps = len(train_loader) * num_epochs
warmup_steps = int(0.05 * total_steps)  # 5% warmup
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_steps)
cosine = CosineAnnealingLR(optimizer, T_max=total_steps - warmup_steps, eta_min=1e-6)
scheduler = SequentialLR(optimizer, [warmup, cosine], milestones=[warmup_steps])

# Training loop
for epoch in range(num_epochs):
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        loss = criterion(model(batch_X), batch_y)
        loss.backward()                                        # backprop: compute gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # clip: insurance
        optimizer.step()                                       # gradient descent: update params
        scheduler.step()                                       # schedule: adjust learning rate

That's the entire optimization stack. AdamW + warmup + cosine decay + gradient clipping. It's what GPT uses. It's what BERT uses. It's what Vision Transformers use. Understand each piece — and you do now — and you can read any training codebase and know exactly what's happening.

Resources

If you're still with me, thank you. I hope it was worth it.

We started with a model that predicted every house costs $47, defined what "wrong" means with a loss function, walked through a foggy landscape feeling for slopes, discovered that noise is a feature, watched a heavy ball teach us about momentum, traced the lineage from AdaGrad to Adam to AdamW, and learned that the terrain itself — saddle points, sharp crevices, flat basins — shapes what the optimizer finds. My hope is that the next time you see a training curve flatline or a loss explode to NaN, instead of randomly poking at hyperparameters, you'll have a mental model of what's happening under the hood and where to look first.

A few resources that helped me build this understanding, each worth your time:

Sebastian Ruder's "An overview of gradient descent optimization algorithms" — the most cited blog post in optimization, and for good reason. Comprehensive, visual, and regularly updated.
The Adam paper (Kingma & Ba, 2014) — unusually readable for an ML paper. The bias correction derivation is elegant.
Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019) — the AdamW paper. Short, clear, and it changed how the entire field regularizes.
Leslie Smith, "Super-Convergence" (2018) — the one-cycle policy paper. Counterintuitive results, beautifully demonstrated.
Li et al., "Visualizing the Loss Landscape of Neural Nets" (2018) — the paper that gave us those gorgeous 3D loss landscape plots. Seeing the terrain makes the theory visceral.
Andrej Karpathy, "A Recipe for Training Neural Networks" (blog post) — practical wisdom from someone who's trained a lot of neural networks. The debugging advice alone is worth the read.

← Previous Model Selection Next → Nice to Know