Optimization Algorithms & Schedules

Chapter 8: Training Deep Networks Section 1 of 8 ~35 min read

TL;DR

Training a neural network means navigating a rugged, high-dimensional loss landscape to find weights that generalize. SGD is where it starts — compute gradient, step. Momentum gives SGD a memory so it stops bouncing around ravines. Adam adapts a separate learning rate for every single parameter using running statistics of the gradient, and it’s why your model converges without heroic LR tuning. AdamW fixes a subtle but important bug in how Adam applies weight decay, and it’s what you should actually be using. For the learning rate schedule: cosine annealing with linear warmup is the modern default. The optimizer picks the direction; the schedule controls the stride length. AdamW + cosine + warmup handles 90% of what you’ll face.

I’ll be honest — for the longest time, I treated the optimizer as a dropdown menu. Pick Adam, set lr=1e-3, move on to the interesting stuff. I figured the optimizer was a solved problem. It isn’t. The difference between “model sort of converges” and “model converges to something that actually generalizes” lives in the details of how you take gradient steps and how you shrink them over time. I only understood this after watching the same architecture produce wildly different test accuracies depending on which optimizer and schedule I paired it with.

So here’s what we’re going to do. We’ll start with the simplest possible picture — a loss function as a piece of terrain — and walk from vanilla SGD all the way through to the modern AdamW + cosine + warmup recipe that powers transformers and LLMs. Along the way we’ll use a running example: training a tiny image classifier that sorts pictures of cats, dogs, and birds. We’ll start with a toy 2-parameter model so we can literally visualize the landscape, then scale up when intuitions are solid.

Before we start: this section assumes you already know what a gradient is and how backpropagation computes it. If those feel shaky, Chapter 2 (Calculus) and the backpropagation section earlier in this book have you covered. We won’t re-derive the chain rule here.

Let’s walk the landscape.

What We’ll Cover

The Loss Landscape — terrain, valleys, and why flat beats sharp
Vanilla SGD — the simplest step
SGD + Momentum — giving the ball a memory
The Adaptive Revolution: Adagrad → RMSprop
Adam — momentum meets adaptivity
AdamW — what you should actually use
Rest stop
Learning Rate Schedules — why they matter
Warmup — surviving the first few steps
Cosine Annealing — the modern default decay
One-Cycle Policy — super-convergence
Beyond the Defaults: LAMB, LARS, SWA
The Practical Recipe
Wrap-up & Resources

The Loss Landscape

Picture this: you have a tiny model with two learnable weights, w1 and w2, classifying images into cat, dog, or bird. The loss function — the number that tells you how wrong your predictions are — is a surface hovering above the (w1, w2) plane. Every point on that plane is a different pair of weights; the height is the loss at those weights. Training means finding a low point.

For our two-parameter toy, you can literally plot this surface. It looks like terrain — hills, valleys, saddle points, plateaus. A saddle point is a spot that looks like a minimum if you face one direction but a maximum if you turn 90 degrees, like the middle of a horse saddle. Real networks have millions of parameters, and the terrain is absurdly high-dimensional, but the intuition transfers: the optimizer is a hiker navigating this landscape, and different optimizers use different strategies to find low ground.

One insight matters for everything that follows: flat minima generalize better than sharp ones. A flat minimum is a wide valley — if you nudge the weights slightly (which is what happens when you evaluate on new data the model hasn’t seen), the loss barely changes. A sharp minimum is a narrow spike pointing down — any perturbation sends the loss shooting up. Much of optimizer and schedule design is, whether explicitly or not, about steering the model toward these wide, flat valleys. We’ll see how momentum, noise from mini-batches, and learning rate schedules all conspire to make this happen.

One more thing about the landscape: it’s non-convex. There is no single global minimum you can march toward with a clean formula. There are many local minima, many saddle points, and vast plateaus where the gradient is nearly zero. The good news is that in high dimensions, most local minima turn out to be roughly as good as the global one. The bad news is that saddle points and plateaus can trap a naïve optimizer for thousands of steps. That’s our motivation. Let’s start hiking.

Vanilla SGD: The Simplest Step

Stochastic Gradient Descent is the starting point for every optimizer in deep learning. The idea is almost laughably simple: compute the gradient of the loss with respect to each weight on a mini-batch of data, then step in the opposite direction.

# Vanilla SGD update for one parameter
theta = theta - lr * gradient

That’s the entire algorithm. For our cat/dog/bird classifier, each training step grabs a mini-batch of, say, 32 images, computes how wrong the current weights are (the loss), backpropagates to get the gradient for every weight, and nudges each weight downhill by lr * gradient.

It works. Models have been trained this way. But spend a few hours watching a vanilla SGD training run and three problems become painfully obvious.

First, every parameter gets the same learning rate. A gradient for a rarely-activated feature detector in the early layers and a gradient for a heavily-used classifier weight in the final layer receive identical step sizes. That’s like telling a surgeon and a lumberjack to use the same force with their tools. Second, ravine oscillation: if the loss surface is steep in one direction and shallow in another (and it almost always is), SGD bounces back and forth across the steep walls while making agonizingly slow progress along the valley floor. Third, noise. Mini-batch gradients are noisy estimates of the true gradient. Vanilla SGD amplifies that noise because it has no memory — each step is computed from scratch, with no awareness of what previous steps looked like.

Back to our terrain analogy: vanilla SGD is a hiker who looks at the slope directly beneath their feet, takes one step downhill, then forgets everything and does it again. No memory of which direction they’ve been traveling. No sense of scale. That’s enough to motivate the first major improvement.

SGD + Momentum: Giving the Ball a Memory

Imagine replacing our forgetful hiker with a heavy ball rolling downhill. The ball doesn’t stop and re-evaluate at every point — it builds up speed in the direction it’s been consistently rolling. If the terrain oscillates side-to-side (the ravine problem), the ball’s momentum cancels out those sideways jolts. If the terrain slopes consistently in one direction, the ball accelerates.

That’s momentum. We introduce a velocity vector v that accumulates an exponential moving average of past gradients. Each step, the velocity carries forward a fraction of where it was heading before, plus the new gradient.

# SGD with momentum
v = momentum * v + gradient        # accumulate velocity
theta = theta - lr * v             # step using velocity, not raw gradient

The momentum coefficient is typically 0.9, meaning 90% of the previous velocity survives into the current step. The effect on our cat/dog/bird classifier is dramatic: in dimensions where the gradient keeps flipping sign (the ravine walls), the velocity averages out to near zero, damping the oscillation. In dimensions where the gradient points consistently the same way (the valley floor), the velocity builds up, amplifying progress. I was genuinely surprised the first time I plotted SGD vs. SGD+momentum trajectories on a toy surface — momentum turned a zigzag path into a nearly straight line to the minimum.

Nesterov Accelerated Gradient

Yurii Nesterov contributed a clever twist: instead of computing the gradient at your current position, first look ahead to where momentum would carry you, and compute the gradient there.

# Nesterov momentum (conceptual)
lookahead = theta + momentum * v
gradient = compute_gradient(lookahead)
v = momentum * v + gradient
theta = theta - lr * v

Why does this help? Standard momentum can overshoot — the ball rolls past the minimum because of accumulated velocity. Nesterov says: “Before you commit to this step, peek ahead and see if you’re about to go uphill. If so, pull back.” It gives the optimizer a corrective, anticipatory character. In practice, the improvement over standard momentum is small but consistent. PyTorch makes it a single flag:

optimizer = torch.optim.SGD(
    model.parameters(), lr=0.1, momentum=0.9, nesterov=True
)

SGD with momentum (Nesterov or not) remains the workhorse for training vision models. If you see a ResNet or EfficientNet paper, odds are they trained with SGD+momentum and a carefully tuned learning rate schedule. It tends to find flatter minima than Adam-family optimizers, which often translates to better generalization. The catch is you have to invest effort in tuning the LR schedule — momentum SGD is high-reward but also high-effort. That trade-off is exactly what motivated the next wave of optimizers.

Limitation that motivates what’s next: momentum solves the oscillation problem, but every parameter still gets the same learning rate. A parameter that gets bombarded with large, frequent gradients and a parameter that receives tiny, rare gradient signals are both stepped by lr. We need per-parameter adaptation.

The Adaptive Revolution: From Adagrad to RMSprop

In 2011, John Duchi and collaborators published Adagrad (Adaptive Gradient), and it changed how people thought about learning rates. The core idea: instead of one global learning rate, maintain a per-parameter learning rate that adapts based on how large that parameter’s gradients have been historically.

Adagrad accumulates the sum of squared gradients for each parameter into a variable G. The update divides the learning rate by √G:

# Adagrad (conceptual, one parameter)
G = G + gradient ** 2
theta = theta - (lr / (sqrt(G) + epsilon)) * gradient

For our cat/dog/bird classifier, consider two parameters: one in the final dense layer that sees strong gradients every batch, and one in an early convolutional filter that only activates on rare edge patterns. Adagrad gives the first parameter a small effective LR (its G is huge) and the second parameter a large effective LR (its G is small). The optimizer automatically allocates more learning to the parameters that need it. This was a revelation for sparse problems — NLP embeddings where most words appear rarely, or recommender systems where most items have few interactions.

But Adagrad has a fatal flaw: G only grows. It’s a sum, never reset. Over the course of training, G gets larger and larger, the effective learning rate shrinks toward zero, and the optimizer effectively stops learning. For dense networks trained over many epochs, this is a deal-breaker.

Enter Geoffrey Hinton, who in 2012 proposed the fix in — of all places — an unpublished Coursera lecture. (To this day, the canonical reference for RMSprop is a set of lecture slides, not a peer-reviewed paper. I find that delightful.) The fix is elegant: instead of summing all squared gradients forever, use an exponential moving average of squared gradients. Recent history matters more; ancient history fades.

# RMSprop (conceptual, one parameter)
G = decay * G + (1 - decay) * gradient ** 2
theta = theta - (lr / (sqrt(G) + epsilon)) * gradient

With decay = 0.99, gradients from 100 steps ago contribute almost nothing to G. The effective learning rate can now increase again if recent gradients become smaller. The dying-LR problem is solved. RMSprop became the default adaptive optimizer for a few years and still shows up in reinforcement learning codebases and RNN training today. But it was about to be superseded by something that combined both ideas we’ve seen so far — momentum and per-parameter adaptivity — into a single algorithm.

Adam: Momentum Meets Adaptivity

Adam (Adaptive Moment Estimation, Kingma & Ba 2014) is probably the most widely used optimizer in deep learning history. Understanding why it works is more important than memorizing its four lines of math, so let’s build it from the pieces we already have.

Adam tracks two quantities for every parameter. The first moment m is an exponential moving average of the gradient itself — this is momentum. It tells you the direction the gradient has been pointing recently. The second moment v is an exponential moving average of the squared gradient — this is the RMSprop piece. It tells you the magnitude of recent gradients for that specific parameter.

# Adam update (simplified, one parameter)
m = beta1 * m + (1 - beta1) * gradient        # first moment (momentum)
v = beta2 * v + (1 - beta2) * gradient ** 2   # second moment (RMSprop)

# Bias correction
m_hat = m / (1 - beta1 ** t)
v_hat = v / (1 - beta2 ** t)

theta = theta - lr * m_hat / (sqrt(v_hat) + epsilon)

The division by √v is where the per-parameter magic happens. Parameters that have been receiving large, frequent gradients accumulate a large v, which shrinks their effective step size — the optimizer takes cautious steps for these already well-constrained weights. Parameters with rare or small gradients have a small v, so their effective step size is larger — the optimizer explores more aggressively where the signal has been weak. This is why Adam converges faster than SGD on most tasks without requiring careful learning rate tuning. It’s doing automatic, per-parameter LR scaling.

Bias Correction: The Startup Fix

Both m and v are initialized to zero. In the first few steps, they’re heavily biased toward zero because they haven’t accumulated enough history. Without correction, the effective learning rate in early training would be far too small — the model would barely move.

The bias correction term m / (1 - β₁⊃t) compensates. At step t=1 with β₁=0.9, the denominator is 1 - 0.9 = 0.1, so we multiply m by 10×. By step t=20, the correction factor is negligible. Think of it as a startup fix: important for the first dozen steps, irrelevant afterward. I mention this because it comes up in interviews, but it’s not where the real insight lives.

The Default Hyperparameters

The original paper proposed defaults that have, remarkably, survived a decade of scrutiny:

β₁ = 0.9 — momentum decay for the first moment
β₂ = 0.999 — decay for the second moment (slower, so the magnitude estimate is more stable)
ε = 1e-8 — prevents division by zero
lr = 0.001 — starting point; almost always needs a schedule on top

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

The fine-tuning trap: Adam’s default LR of 1e-3 is designed for training from scratch. If you’re fine-tuning a pretrained model (say, a BERT checkpoint for text classification, or a pretrained ResNet for your cat/dog/bird classifier), 1e-3 will blast the pretrained weights out of their good region in the loss landscape. Use 1e-5 to 5e-5 for fine-tuning. I learned this by watching a fine-tuned model produce worse results than the frozen pretrained one, which was a very confusing afternoon.

Limitation: Adam combines momentum and adaptivity beautifully, but there’s a subtle problem lurking in how it interacts with regularization. That brings us to the optimizer you should actually be reaching for.

AdamW: What You Should Actually Use

AdamW is not “Adam with weight decay added on.” It’s a fix for a genuine bug in how the original Adam handles regularization, and the difference matters more than most people realize.

The Bug in Adam + L2

Weight decay is a standard regularization technique: you penalize large weights by adding a term proportional to the weight magnitude to the loss, encouraging the model to keep weights small. In vanilla SGD, weight decay and L2 regularization are mathematically equivalent — adding λ · θ² to the loss produces a gradient term 2λθ, which has the same effect as directly shrinking each weight by a fraction λ each step.

But in Adam, this equivalence breaks down. When you implement weight decay as L2 regularization (adding the penalty to the loss), the weight decay gradient 2λθ flows through Adam’s adaptive mechanism — it gets divided by √v along with everything else. The result: parameters with large accumulated squared gradients (large v) have their weight decay weakened. The regularization is no longer uniform. Large weights that happen to live in high-gradient regions escape the penalty. The optimizer is working against the regularizer.

The Fix: Decoupled Weight Decay

Loshchilov and Hutter (2019) identified this and proposed a clean solution: apply weight decay directly to the weights, completely outside the gradient-based update. Don’t let it touch the moments. Don’t let the adaptive scaling interfere with it.

# Adam + L2 (the broken way)
gradient = gradient + weight_decay * theta     # decay inside gradient
m = beta1 * m + (1 - beta1) * gradient        # decay pollutes moments
v = beta2 * v + (1 - beta2) * gradient ** 2
theta = theta - lr * m_hat / (sqrt(v_hat) + eps)

# AdamW (the correct way)
m = beta1 * m + (1 - beta1) * gradient        # clean gradient only
v = beta2 * v + (1 - beta2) * gradient ** 2
theta = theta - lr * m_hat / (sqrt(v_hat) + eps)   # adaptive step
theta = theta - lr * weight_decay * theta           # decay applied separately

The code difference looks trivial — you’re moving one line. But the effect on training is not trivial. With AdamW, the regularization does what you actually intend: it pushes all weights toward zero uniformly, regardless of their gradient history. The model generalizes better because the regularization is consistent.

# AdamW in PyTorch — this should be your default
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.01    # typical range: 0.01 to 0.1
)

If someone says “Adam” in a modern context, they almost certainly mean AdamW. It is the default optimizer for transformers, large language models, and most modern architectures. Virtually every GPT, BERT, and Vision Transformer paper uses AdamW. Use torch.optim.AdamW, not torch.optim.Adam.

☕ Rest Stop

Take a breath. We’ve covered the entire optimizer lineage that matters: SGD → SGD+momentum → Adagrad → RMSprop → Adam → AdamW. That’s the full arc from “compute gradient, step” to “per-parameter adaptive learning rates with properly decoupled weight decay.”

If you stopped here, you’d have a solid mental model. You could pick an optimizer for any task, explain why Adam converges faster than SGD, articulate the AdamW fix in an interview, and make informed decisions in practice. That’s real, usable knowledge.

But knowing the optimizer is only half the story. The optimizer gives you the direction to step. The learning rate schedule controls how far you step, and how that distance changes over the course of training. Getting the schedule wrong is one of the most common reasons a model trains but doesn’t generalize. That’s where we’re headed next.

Learning Rate Schedules: Why They Matter

Go back to our hiking analogy. Early in training, you’re on a high plateau — the weights are random, the loss is terrible, and you need to take big strides to get anywhere useful. Late in training, you’re near a good valley — big strides would overshoot and send you bouncing between the walls. You need to tiptoe into the flat bottom.

A fixed learning rate is a compromise that’s almost never optimal. Set it high enough for good early progress and you overshoot later. Set it low enough for precise convergence and the first 80% of training is wasted. The learning rate schedule automates the transition: large steps early, small steps later.

I used to skip the schedule during prototyping and wonder why my models plateaued. The schedule isn’t a nice-to-have; it’s load-bearing infrastructure. Let’s look at the pieces.

Warmup: Surviving the First Few Steps

In the very first steps of training, several things are simultaneously terrible. The weights are randomly initialized, so the gradients are essentially noise pointing in arbitrary directions. If you’re using Adam, the moment estimates m and v are initialized to zero and haven’t converged yet — the bias correction helps, but the estimates are still unreliable. If you’re using batch normalization, the running statistics haven’t stabilized.

Taking large steps based on garbage gradients is how you get early training instability: the loss spikes, NaNs appear, and sometimes the model never recovers. Linear warmup is the fix: start with a very small learning rate and linearly ramp it up to the target LR over a set number of steps.

warmup_steps = 1000
target_lr = 1e-3

for step in range(total_steps):
    if step < warmup_steps:
        lr = target_lr * (step / warmup_steps)
    else:
        lr = decay_schedule(step)   # cosine, step decay, etc.

    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

At step 0 the LR is essentially zero. At step 500 it’s half the target. At step 1000 it reaches the full target LR, and from there the decay schedule takes over. The ramp gives Adam’s moment estimates time to become meaningful, gives batch norm statistics time to stabilize, and prevents any single garbage gradient from causing irreversible damage.

A typical warmup length is 5–10% of total training steps. For a 100-epoch ImageNet run, that’s roughly 5 epochs. For fine-tuning a transformer, maybe 500–2000 steps. Warmup is especially critical for transformer architectures — the attention mechanism is notoriously sensitive to initialization, and large early learning rates cause attention weights to collapse. The original “Attention Is All You Need” paper used warmup for a reason, and every major LLM training run since has followed suit.

Cosine Annealing: The Modern Default

After warmup, you need a strategy for decaying the learning rate. The most popular modern choice is cosine annealing: the LR follows a cosine curve from its peak value down to a minimum (often near zero).

The formula is straightforward. If t is the current step and T is the total number of steps:

# Cosine annealing formula
lr_t = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos(pi * t / T))

At t=0, the cosine is 1, so lr_t = lr_max. At t=T, the cosine is −1, so lr_t = lr_min. In between, the decay is smooth, spending more time at moderate learning rates than either extreme.

Why does this work better than the older approach of step decay (dropping the LR by 10× at fixed epoch milestones)? Step decay creates discontinuities — the model is cruising at one LR, then suddenly the rug is pulled. The training dynamics change abruptly, and the first few epochs after a drop are often wasted as the optimizer re-adjusts. Cosine annealing avoids these shocks entirely. The curve is smooth, the model spends substantial time at intermediate learning rates where it’s still making meaningful updates, and the gentle tail at the end gives the model many steps to settle into a flat region of the landscape.

# Cosine annealing in PyTorch
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=total_epochs,   # one full cosine half-cycle
    eta_min=1e-6          # floor LR
)

for epoch in range(total_epochs):
    train_one_epoch(model, optimizer)
    scheduler.step()

Cosine with Warm Restarts

A variant worth knowing: cosine annealing with warm restarts (SGDR, Loshchilov & Hutter 2017). The LR decays to near zero following a cosine curve, then snaps back to the initial LR and decays again. Each restart is a deliberate shock that lets the optimizer escape a local minimum it may have settled into. Subsequent cycles are often made longer (e.g., doubling the period each time), giving the model progressively more time to converge in later cycles.

scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
    optimizer,
    T_0=10,       # steps in the first cycle
    T_mult=2      # each subsequent cycle is 2x longer
)

Warm restarts are less common than plain cosine in transformer training, but they show up in vision tasks and ensemble methods where you want diverse models from different restart points.

One-Cycle Policy: Super-Convergence

Leslie Smith’s one-cycle policy is a more aggressive schedule built on a counterintuitive insight: using a high learning rate in the middle of training acts as a regularizer. The logic is that a high LR prevents the model from settling into sharp minima early — it forces continued exploration of the loss landscape, favoring the wide, flat valleys that generalize better.

The schedule has three phases: ramp the LR up from a small value to a high peak (warmup), hold near the peak briefly, then anneal it down to a value much smaller than the starting point. Momentum is cycled inversely — high momentum when LR is low, low momentum when LR is high — because the high LR already provides plenty of exploration, and you don’t want momentum amplifying that into instability.

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.01,
    total_steps=total_steps,
    pct_start=0.3,         # 30% of training is the warmup ramp
    anneal_strategy='cos'  # cosine decay after the peak
)

# CRITICAL: step per batch, not per epoch
for batch in train_loader:
    loss = train_step(model, batch, optimizer)
    scheduler.step()

Gotcha that bites everyone once: OneCycleLR counts steps per batch. If you accidentally call scheduler.step() per epoch instead of per batch, the schedule stretches across the entire run, the LR peak arrives at the wrong time, and the whole thing silently underperforms. This is one of those bugs that doesn’t crash — it produces mediocre results and you blame the model.

Smith found that the one-cycle policy can achieve the same final accuracy in significantly fewer epochs, a phenomenon he called super-convergence. It pairs especially well with SGD+momentum. The high-LR phase is doing something similar to what large-batch training does — it smooths the loss landscape from the optimizer’s perspective, biasing the trajectory toward flat minima.

Beyond the Defaults: LAMB, LARS, and SWA

Three more optimizers worth having in your vocabulary, even if you won’t use them daily.

LARS (Layer-wise Adaptive Rate Scaling) and LAMB (Layer-wise Adaptive Moments optimizer for Batch training) address a specific problem: when you scale to very large batch sizes for distributed training, the effective learning rate becomes too aggressive for some layers and too timid for others. LARS (used with SGD) and LAMB (used with Adam) fix this by computing a separate trust ratio for each layer, scaling the learning rate by the ratio of the layer’s weight norm to its gradient norm. This keeps the relative update magnitude consistent across layers regardless of batch size. Google’s BERT was famously trained with LAMB on batches of 65,536 sequences. If you’re not doing large-batch distributed training, you don’t need these — but they show up in system design interviews for large-scale ML.

Stochastic Weight Averaging (SWA) takes a different approach entirely. Instead of optimizing harder, it averages the weights from multiple points late in training. The idea is grounded in the flat minima principle: an average of several points near a minimum is likely to land closer to the center of a flat region than any individual point. In practice, you train normally until the LR has decayed substantially, then continue training with a constant or cyclical LR while maintaining a running average of the weights. At the end, you use the averaged weights for inference.

# SWA in PyTorch (built-in support)
from torch.optim.swa_utils import AveragedModel, SWALR

swa_model = AveragedModel(model)
swa_scheduler = SWALR(optimizer, swa_lr=0.05)

for epoch in range(swa_start, total_epochs):
    train_one_epoch(model, optimizer)
    swa_model.update_parameters(model)
    swa_scheduler.step()

# After training: update batch norm stats for swa_model
torch.optim.swa_utils.update_bn(train_loader, swa_model)

SWA is cheap (no extra optimizer state, minimal compute overhead) and can improve generalization by 0.5–1.5% on vision tasks. It’s worth knowing for interviews and for squeezing out the last bit of performance.

The Practical Recipe

After all the theory, here’s what I actually reach for when starting a project. This is the recipe that works for the vast majority of tasks, and it’s what you’ll see in most modern codebases.

import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

model = YourModel()
optimizer = AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs, eta_min=1e-6)

# With warmup
warmup_steps = int(0.05 * total_steps)   # 5% warmup

for step in range(total_steps):
    if step < warmup_steps:
        lr_scale = step / warmup_steps
        for pg in optimizer.param_groups:
            pg['lr'] = 1e-3 * lr_scale

    loss = train_step(model, batch, optimizer)

    if step >= warmup_steps:
        scheduler.step()

That’s AdamW + cosine annealing + linear warmup. It handles 90% of what you’ll encounter. For the remaining 10%, here are domain-specific adjustments:

Computer vision from scratch (ResNets, ConvNets): SGD + momentum (0.9) + Nesterov, LR=0.1, cosine annealing. Vision folks have decades of tuning recipes for SGD, and it still wins on generalization for these architectures. Weight decay around 1e-4.

Computer vision fine-tuning (pretrained backbone + new head): AdamW, LR=1e-4 to 5e-5, linear decay or short cosine. The pretrained weights are already good; you want gentle updates. Consider differential learning rates — lower LR for early layers, higher for the new head.

NLP and Transformers: AdamW, LR=1e-4 to 5e-4, cosine decay with linear warmup (5–10% of total steps). This is essentially the recipe from the original BERT paper and it’s barely changed since.

LLM pre-training: AdamW, LR around 3e-4, cosine decay to 10% of peak, 2000-step warmup. β₂ sometimes lowered to 0.95 for stability with very long sequences. Weight decay 0.1.

Quick prototyping: Adam (even plain Adam is fine), LR=1e-3, no schedule. Get something running, validate the data pipeline and architecture, then optimize the training recipe later.

The checkpoint gotcha: When resuming training from a checkpoint, you must save and restore the optimizer state (optimizer.state_dict()) and the scheduler state (scheduler.state_dict()) alongside the model weights. If you only load model weights, Adam’s moment estimates reset to zero, the scheduler restarts from step 0, warmup runs again unnecessarily, and training quality silently degrades. I have seen production training runs corrupted by this exact mistake, and the failure mode is insidious — the model still trains, it’s just worse, and you might not notice until evaluation.

The LR Finder

Before committing to a learning rate, you can empirically find a good starting point. Leslie Smith’s LR finder (2017) works like this: start with a tiny LR, train for one epoch while exponentially increasing the LR each batch, and record the loss at each step. Plot loss versus LR on a log scale. The ideal LR is where the loss is decreasing fastest — the steepest downward slope — not where the loss is at its minimum. The minimum corresponds to a LR that’s already too high for stable long-term training; the steepest descent is where learning is most efficient.

def lr_finder(model, train_loader, optimizer, criterion,
              lr_start=1e-7, lr_end=10, num_steps=100):
    lrs, losses = [], []
    lr_mult = (lr_end / lr_start) ** (1 / num_steps)
    lr = lr_start

    model.train()
    data_iter = iter(train_loader)

    for step in range(num_steps):
        try:
            batch = next(data_iter)
        except StopIteration:
            data_iter = iter(train_loader)
            batch = next(data_iter)

        x, y = batch
        for pg in optimizer.param_groups:
            pg['lr'] = lr

        optimizer.zero_grad()
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()

        lrs.append(lr)
        losses.append(loss.item())
        lr *= lr_mult

        if step > 0 and loss.item() > 4 * min(losses):
            break

    return lrs, losses

Run this, plot with plt.plot(lrs, losses); plt.xscale('log'), and look for the steepest downhill segment. That’s your learning rate neighborhood.

Wrap-Up

We started with “gradient times learning rate” and built up to the full modern training recipe: AdamW with cosine annealing and linear warmup. Along the way, we saw why each piece exists — momentum to stop bouncing in ravines, adaptive per-parameter rates to handle the diversity of gradient scales across a network, decoupled weight decay to make regularization actually work with adaptive optimizers, warmup to survive the chaos of random initialization, and cosine decay to smoothly transition from exploration to refinement.

The terrain analogy we’ve been using isn’t perfect — real loss landscapes are millions of dimensions, not two — but it captures the essential dynamics. The optimizer is the hiker’s strategy; the schedule is the hiker’s pace. Get both right and the model finds a wide valley that generalizes. Get either wrong and you end up stuck on a plateau, oscillating in a ravine, or sitting in a sharp minimum that falls apart on test data.

Thank you for making it through this. Optimization is one of those topics that seems dry until you realize it’s the reason your model works (or doesn’t). I hope the next time you set up a training run, the optimizer and schedule choices feel like informed decisions rather than cargo cult defaults.

Resources

Kingma & Ba, “Adam: A Method for Stochastic Optimization” (2014) — The Adam paper. Short, readable, and still the definitive reference. Read Sections 1–3 and the experiment section.
Loshchilov & Hutter, “Decoupled Weight Decay Regularization” (2019) — The AdamW paper. This is the one that explains why L2 regularization in Adam is broken and how decoupled weight decay fixes it. Concise and important.
Smith & Topin, “Super-Convergence” (2018) — The one-cycle policy paper. Leslie Smith’s work on learning rate schedules is underappreciated. This paper shows how aggressive LR cycling can cut training time dramatically.
Izmailov et al., “Averaging Weights Leads to Wider Optima” (2018) — The SWA paper. A beautiful idea that’s easy to implement and consistently improves generalization. The connection to flat minima is well-explained.
Hinton’s Coursera Lecture 6 Slides — The original “publication” of RMSprop. Yes, one of the most important optimizers in deep learning was published as lecture slides. That’s the field for you.
PyTorch optim documentation — The practical reference. Every optimizer and scheduler mentioned here has a PyTorch implementation with clear usage examples.

What You Should Now Be Able To Do

Explain the loss landscape intuitively — non-convexity, saddle points, flat vs. sharp minima — and why optimizer choice affects generalization.
Walk through the SGD → momentum → Adagrad → RMSprop → Adam → AdamW lineage, explaining what each one fixes and what limitation motivates the next.
Describe Adam’s per-parameter adaptive learning rate mechanism and explain bias correction intuitively.
Articulate exactly why AdamW exists: decoupled weight decay vs. L2 regularization through the adaptive term.
Implement warmup + cosine annealing in PyTorch and explain why warmup stabilizes early training.
Explain the one-cycle policy and why high LR in the middle acts as regularization.
Pick appropriate optimizer + schedule combinations for vision, NLP, LLM, and fine-tuning tasks with confidence.
Use the LR finder to empirically select a learning rate and explain why you pick the steepest descent, not the minimum.

Next → Training Techniques: Normalization, Regularization & Gradients