Optimization

Chapter 2: Mathematical Foundations
TL;DR

Optimization is finding the lowest point in a landscape. Gradient descent is a blindfolded hiker feeling the slope underfoot. Convex problems are the friendly ones — a single valley, and any downhill path leads to the bottom. Non-convex problems are deep learning's wilderness, full of saddle points and deceptive valleys, where SGD plus momentum navigate by harnessing noise as a feature. Adam adapts learning rates per-parameter by tracking first and second moments. Learning rate schedules control the pace — warmup for transformers, cosine annealing for smooth convergence. Saddle points, not local minima, are the real enemy in high dimensions. Constrained optimization adds guardrails via Lagrangians and KKT conditions; duality is why SVMs work. Second-order methods use curvature but rarely survive the cost of computing the Hessian.

I avoided optimization theory for longer than I'd like to admit. For years, I treated the optimizer as a black box — pick Adam, set the learning rate to 3e-4, and pray. It worked often enough that I never felt the pressure to understand why it worked. Then I hit a problem where it didn't. The loss plateaued, the gradients exploded, and I had no vocabulary for diagnosing what had gone wrong. That's what sent me down the rabbit hole. This section is what I found at the bottom.

Here's the orientation. Optimization, at its core, is the study of finding the best answer — the lowest loss, the tightest fit, the widest margin. Every ML model you've ever trained was doing optimization under the hood. Linear regression minimizes squared error. SVMs maximize a margin. Neural networks minimize a loss function over millions of parameters. The math changes, but the question is always the same: where is the bottom of this landscape?

Before we start: this section assumes you're comfortable with derivatives and partial derivatives from the calculus section, and with vectors and matrices from linear algebra. If you can compute a gradient of a simple function and understand what a dot product means, you're ready. We'll build everything else from scratch.

The journey goes like this. We start with what optimization even means, then build gradient descent from the ground up with a toy example you can trace by hand. We'll wrestle with the learning rate dilemma, learn why convex problems are easy, and then face the harsh reality of non-convex landscapes. Momentum and mini-batch SGD will be our first tools for navigating the wilderness. There's an explicit rest stop halfway through — enough to be dangerous in a good way. Then we push into adaptive methods like Adam, the surprisingly contentious SGD-vs-Adam debate, learning rate schedules, saddle points, loss landscapes, constrained optimization with Lagrangians and KKT conditions, and finally a brief tour of second-order methods. It's a lot. Let's go.

Table of Contents

What Optimization Is
Gradient Descent from Scratch
The Learning Rate Dilemma
Convex vs. Non-Convex
Momentum
Mini-Batch SGD
🛑 Rest Stop
Adaptive Methods: RMSprop and Adam
The SGD vs. Adam Debate
Learning Rate Schedules
Saddle Points
Loss Landscapes
Constrained Optimization
Second-Order Methods
Wrap-Up
Resources

What Optimization Is

Imagine you're a hiker, blindfolded, dropped somewhere on a vast landscape. Your goal is to find the lowest point — the deepest valley. You can't see the terrain, but you can feel the slope beneath your feet. If the ground tilts left, you step left. If it tilts steeply downward ahead, you take a bigger step forward. That's optimization. You're navigating a landscape you can't see, guided only by local information about which way is downhill.

In mathematical language, we have a function f(x) and we want to find the input x* that makes f as small as possible. The function f is the objective function (in ML, usually a loss function). The input x is the parameter vector — in a neural network, that's millions of weights and biases. The landscape is the surface you get by plotting the loss against all possible parameter values. A point where the function value is lowest among all neighbors is a local minimum. A point where it's lowest everywhere is the global minimum. Our blindfolded hiker wants the global minimum but will often settle for a good local one.

The fundamental tool is the gradient, written ∇f(x). It's a vector that points in the direction of steepest ascent, and its magnitude tells you how steep. To go downhill, you walk opposite to the gradient. That's the entire intuition behind gradient-based optimization, and most of what follows is about doing this more cleverly.

Gradient Descent from Scratch

Let's make this concrete with the smallest possible example. Consider the function f(x) = (x − 3)². This is a parabola — a U-shape with its bottom at x = 3. We know the answer, but let's pretend we don't and find it by feeling the slope.

The gradient (just the derivative in 1D) is f'(x) = 2(x − 3). At any point x, this tells us the slope. If x is less than 3, the slope is negative — the hill goes down to the right, so we should move right. If x is greater than 3, the slope is positive — downhill is to the left. The update rule is:

x_new = x_old - α · f'(x_old)

That α is the learning rate — how big a step we take. It's a number we choose. Let's walk through this by hand. Suppose we start at x = 10 with α = 0.1.

Step 1: f'(10) = 2(10 − 3) = 14. So x_new = 10 − 0.1 × 14 = 8.6. We moved toward 3.

Step 2: f'(8.6) = 2(8.6 − 3) = 11.2. So x_new = 8.6 − 0.1 × 11.2 = 7.48. Still moving toward 3, but the step got smaller because the slope got gentler.

Step 3: f'(7.48) = 2(7.48 − 3) = 8.96. So x_new = 7.48 − 0.1 × 8.96 = 6.584.

You see the pattern. Each step, we get closer to 3 and the steps shrink because the slope is flattening out. After enough steps, we'll be so close to 3 that the gradient is effectively zero and we stop. That's gradient descent. The whole thing.

import numpy as np

# Gradient descent on f(x) = (x - 3)^2
# Gradient: f'(x) = 2*(x - 3)
# Minimum: x = 3, f(3) = 0

x = 10.0        # starting point
lr = 0.1        # learning rate
history = [x]

for step in range(50):
    grad = 2 * (x - 3)        # compute the slope
    x = x - lr * grad          # step opposite to the gradient
    history.append(x)

print(f"After 50 steps: x = {x:.6f}")   # 3.000000 (essentially)
print(f"f(x) = {(x - 3)**2:.10f}")       # ~0.0

# Watch how the steps shrink:
# Step 0: x=10.0 → 8.6  (big step, steep slope)
# Step 5: x=5.37 → 4.90 (medium step)
# Step 20: x=3.001 → 3.001 (tiny step, nearly flat)

In multiple dimensions, the idea is identical. Instead of a single derivative, we compute the gradient vector — the vector of partial derivatives with respect to each parameter. Each parameter gets updated independently, all moved in the direction that most steeply decreases the loss. A neural network with 10 million parameters does this exact thing, 10 million updates at once, on every training step. The math is the same; the scale is different.

Why does this work? There's a formal reason. Gradient descent is doing something called steepest descent: among all directions you could step, the negative gradient direction decreases f the most per unit of distance traveled. For small enough step sizes, every step is guaranteed to decrease the loss (on smooth functions). The method isn't clever or creative — it's greedy. It always goes maximally downhill from where it stands. And for a surprising range of problems, greedy is good enough.

The Learning Rate Dilemma

I still sometimes pick a learning rate that's off by a factor of ten. It's the single most important hyperparameter in all of deep learning, and there's no formula for getting it right on the first try.

Here's the dilemma. If the learning rate is too large, you overshoot. Our hiker feels a steep slope and takes a huge leap, sailing right past the valley floor and landing on the opposite hillside — possibly even higher than where they started. Repeat this and the loss increases instead of decreasing, or oscillates wildly. Set it too high and training diverges entirely — the loss goes to infinity.

If the learning rate is too small, you crawl. Each step is so tiny that you'd need millions of iterations to make meaningful progress. Training technically works, but it takes forever and you might get stuck in a shallow dip that a bolder step would have sailed right through.

The sweet spot is a learning rate just large enough to make progress without overshooting. But "just large enough" depends on the loss landscape, which depends on the model architecture, the data, the batch size, and even the training stage. The ideal learning rate at the start of training (when gradients are large and the landscape is rough) is different from the ideal rate near the end (when you're fine-tuning into a narrow valley).

Let's see this concretely. Take our same f(x) = (x − 3)² but try three different learning rates:

import numpy as np

def gradient_descent_demo(lr, steps=20):
    x = 10.0
    path = [x]
    for _ in range(steps):
        grad = 2 * (x - 3)
        x = x - lr * grad
        path.append(x)
    return path

# Too small: barely moves
slow = gradient_descent_demo(lr=0.01)
print(f"lr=0.01 after 20 steps: x={slow[-1]:.4f}")  # ~6.71, far from 3

# Just right: converges smoothly
good = gradient_descent_demo(lr=0.1)
print(f"lr=0.1  after 20 steps: x={good[-1]:.4f}")   # ~3.0001, nailed it

# Too big: oscillates or diverges
wild = gradient_descent_demo(lr=0.9)
print(f"lr=0.9  after 20 steps: x={wild[-1]:.4f}")    # oscillates around 3

# Way too big: explodes
boom = gradient_descent_demo(lr=1.1)
print(f"lr=1.1  after 20 steps: x={boom[-1]:.4f}")    # diverges!

For this particular quadratic function, the critical threshold is α = 1.0. Above that, the update overcorrects — each step doesn't just miss the target, it ends up farther away on the other side. Below 1.0, it converges, but the speed varies dramatically. Real loss landscapes are far more complex than a parabola, so the critical threshold changes at every point. That's why this is a dilemma, not a problem with a clean solution.

Watch Out

A common first response to "my model isn't learning" is to increase the learning rate. A common first response to "my loss is exploding" is to decrease it. Both instincts are correct about 60% of the time, which is just often enough to be dangerous. The other 40% is some other issue entirely — bad data, a bug in the loss function, or vanishing gradients. Always plot the loss curve before blaming the learning rate.

Convex vs. Non-Convex

Now we need to talk about the terrain itself, because the same gradient descent algorithm behaves completely differently depending on the shape of the landscape.

A convex function is one that curves upward everywhere — picture a bowl. More precisely, if you draw a straight line between any two points on the function's graph, the line sits above or on the graph. No dips. No bumps. One smooth valley. The canonical example is f(x) = x², a parabola. Mathematically, f is convex if for any two points x, y and any t ∈ [0, 1]:

f(t·x + (1-t)·y) ≤ t·f(x) + (1-t)·f(y)

That inequality says: the function value at a blend of two inputs is at most the blend of the two function values. If you can verify this for your function, you've won the optimization lottery.

Why? Because for convex functions, every local minimum is a global minimum. There are no deceptive valleys. If our blindfolded hiker finds a flat spot at the bottom of the bowl, they can be certain it's the bottom. Gradient descent is guaranteed to converge to this unique optimum (given a small enough learning rate). No need for tricks, restarts, or prayers.

A convex set is the geometric cousin: a region where the line segment between any two interior points stays inside the region. A filled circle is convex. A crescent moon is not. This matters because many optimization problems restrict x to a feasible region, and if that region is convex and the objective is convex, you're in the promised land of tractable optimization.

Classical ML lives mostly in convex territory. Linear regression with squared error produces a paraboloid — one bottom, guaranteed. Logistic regression with log-loss is convex. SVMs minimize a convex quadratic subject to linear constraints. Ridge and Lasso add convex penalties. For all these models, gradient descent (or a specialized solver) finds the global optimum, not an approximation. That's a strong guarantee, and it's one reason these models remain so trustworthy.

Now the bad news. A non-convex function has multiple valleys, ridges, plateaus, and saddle points. Our hiker's landscape isn't a single bowl anymore — it's a mountain range. Deep learning loss functions are emphatically non-convex. The interactions between layers, the nonlinear activations, the sheer dimensionality — all of it creates a landscape with an incomprehensible number of critical points. Gradient descent might find a local minimum, but there's no guarantee it's the global one.

The remarkable empirical fact is that this usually doesn't matter. We'll see why soon — it has to do with saddle points, the surprising structure of high-dimensional landscapes, and the noise in SGD. But the theoretical distinction is important: convex means you can relax, non-convex means you need tools that cope with uncertainty.

A related result worth knowing is Jensen's inequality: for a convex function f and a random variable X, f(E[X]) ≤ E[f(X)]. The function of the average is at most the average of the function. This underpins the Evidence Lower Bound (ELBO) in variational inference and explains why many bounds in ML are tight for convex objectives and loose for non-convex ones.

Key Insight

If you can prove your loss function is convex, you've essentially solved the hard part. The optimizer becomes a detail — almost any reasonable algorithm will find the answer. The real difficulty in ML is that most interesting models produce non-convex landscapes. Knowing which terrain you're on determines which tools to reach for.

Momentum

Plain gradient descent has a problem with valleys that are narrow in one direction and shallow in another. The hiker oscillates back and forth across the narrow walls while making painfully slow progress along the valley floor. It's like trying to roll a ball down a half-pipe — the ball bounces side to side instead of rolling smoothly toward the exit.

The fix comes from physics. Instead of stepping based only on the current gradient, we give our hiker momentum — the ability to accumulate velocity. If the gradient has been pointing in the same direction for several steps, momentum builds up and the steps get bigger. If the gradient keeps switching signs (oscillating), the positive and negative contributions cancel out and the oscillation dampens. It's exactly how a ball rolling down a hill behaves: it accelerates on consistent slopes and smooths out over bumps.

The update rule for Polyak momentum (also called classical momentum or heavy-ball method) introduces a velocity term v:

v_new = β · v_old + ∇f(x)
x_new = x_old - α · v_new

Here β is the momentum coefficient, typically 0.9. On each step, we compute the gradient as usual, but instead of using it directly, we blend it with the previous velocity. The velocity v accumulates gradient history. If the last ten gradients all pointed left, v points strongly left, and we take a big step. If the gradients keep flipping between left and right, v stays small because the contributions cancel.

Think of it this way. Without momentum, the hiker stops at the end of each step, looks at the ground, and takes a fresh step. With momentum, the hiker is sliding. Their past motion carries them forward. They still feel the slope and adjust, but they don't stop between steps.

Nesterov momentum (Nesterov Accelerated Gradient, or NAG) adds a subtle twist that makes a real difference. Instead of computing the gradient at the current position, you compute it at the position you'd reach if you followed the current velocity — a "look-ahead" gradient. The idea is: if I'm sliding to the left, I should check the slope where I'm about to be, not where I am now. This lets the optimizer slow down before it overshoots.

v_new = β · v_old + ∇f(x_old - α · β · v_old)   # look-ahead gradient
x_new = x_old - α · v_new

Nesterov showed that this modification achieves the theoretically optimal convergence rate for convex functions — O(1/t²) instead of O(1/t) for plain gradient descent. In practice, the difference between Polyak and Nesterov momentum is often modest, but Nesterov tends to overshoot less.

Momentum helps with more than just narrow valleys. It also helps escape saddle points. At a saddle point, the gradient is near zero, so plain gradient descent stalls. But if the optimizer has built up velocity from previous steps, the momentum carries it through the flat region and out the other side. This is one of the less obvious but deeply important benefits of momentum — it turns the optimizer's memory into a saddle-point escape mechanism.

Momentum also helps traverse flat plateaus. If a stretch of the loss landscape is nearly flat, the gradient is tiny and progress is glacial. But momentum accumulates even tiny gradients over time, building enough velocity to cross the plateau. Without momentum, you might stall in a flat region for thousands of steps. With it, the ball keeps rolling.

Mini-Batch SGD

So far, we've assumed we compute the gradient on the entire training set at each step. For a dataset of a million examples, that means a million forward passes and a million backward passes before a single parameter update. That's batch gradient descent, and it's too slow for modern scale.

The alternative is Stochastic Gradient Descent (SGD) — compute the gradient on a single randomly chosen example and update immediately. This is noisy. One example gives a crude estimate of the true gradient. But it's fast, and the noise turns out to be a feature, not a bug.

In practice, we use mini-batch SGD: compute the gradient on a small random subset (a mini-batch) of, say, 32 or 256 examples. This is a compromise. The mini-batch gradient is a noisy estimate of the full-batch gradient, but it's much cheaper to compute, and the noise isn't too wild.

Why is the noise a feature? Think about our hiker analogy. Full-batch gradient descent is the hiker on a clear day, seeing the exact slope everywhere. They walk directly to the nearest valley bottom and stop — even if it's a shallow local minimum with a much deeper valley just over the next ridge. Mini-batch SGD is the hiker in fog and wind. They can feel the general direction of downhill, but there's randomness — gusts that push them sideways, mist that obscures the exact slope. Sometimes those gusts push them out of a shallow minimum and over the ridge into a deeper valley.

The noise from mini-batches has real mathematical consequences. It acts as implicit regularization: the optimizer can't settle into the sharpest, narrowest valleys because the noise bounces it back out. It tends to find flat minima — regions where the loss is low and stays low even when parameters are perturbed. There's accumulating evidence that flat minima generalize better than sharp ones, because the flatness means the model's performance is robust to small perturbations in the weights.

Batch size interacts with the learning rate. Larger batches give cleaner gradient estimates, which allows larger learning rates. Smaller batches give noisier estimates, which requires smaller learning rates to avoid instability. A common rule of thumb (the "linear scaling rule") is: if you double the batch size, double the learning rate. This heuristic works reasonably well in practice, though it breaks down at very large batch sizes.

Key Insight

SGD's noise is one of deep learning's most important features. It helps escape saddle points, avoid sharp local minima, and find solutions that generalize well. Full-batch gradient descent gives exact gradients but often converges to worse solutions. More noise, within reason, leads to better generalization. This counterintuitive fact is one of the central results in modern optimization theory for deep learning.

🛑 Rest Stop

Pause here if you want. You now have a working mental model of optimization in deep learning: gradient descent navigates a loss landscape by feeling the slope. The learning rate controls step size — too big and you overshoot, too small and you crawl. Convex landscapes are friendly (one valley, guaranteed convergence); non-convex landscapes are the norm in deep learning. Momentum gives the optimizer memory, helping it build speed on consistent slopes and dampen oscillations. Mini-batch SGD adds beneficial noise that helps escape bad local minima and find flat regions that generalize well.

That's enough to understand most training pipelines. You know why Adam often works out of the box (we'll explain the mechanics next), you know why people fiddle with learning rates, and you know why SGD + momentum remains competitive despite being the oldest trick in the book. If you're comfortable here, this is a perfectly fine place to stop.

What follows is the deeper story: adaptive optimizers, the Adam-vs-SGD debate, learning rate schedules, the geometry of saddle points, what loss landscapes actually look like, constrained optimization (where Lagrangians and KKT conditions live), and a brief tour of second-order methods. It's the stuff that separates "I can train models" from "I can debug training and reason about why."

Adaptive Methods: RMSprop and Adam

The learning rate dilemma has a deeper layer we haven't addressed: different parameters might need different learning rates. In a neural network, some parameters see steep gradients (they're in the action) and some see tiny gradients (they're in a quiet corner of the landscape). A single global learning rate that works for the steep parameters is too aggressive for the quiet ones, and a rate that works for the quiet ones is too timid for the steep ones.

Adaptive methods solve this by giving each parameter its own effective learning rate, adjusted automatically based on the history of gradients that parameter has seen.

The story starts with AdaGrad (Duchi et al., 2011). AdaGrad keeps a running sum of squared gradients for each parameter. Parameters that have seen large gradients get their learning rate scaled down; parameters that have seen small gradients get their learning rate scaled up. The update for parameter i is:

s_i = s_i + (grad_i)²
x_i = x_i - α · grad_i / (√s_i + ε)

The accumulated squared gradient s_i in the denominator shrinks the effective learning rate over time. For sparse problems (like natural language, where most features are zero most of the time), this is brilliant — rare features get larger updates. But for deep learning, there's a fatal flaw: s_i only grows. The learning rate monotonically decreases and eventually becomes so small that learning stops entirely. Training dies of old age.

RMSprop (Hinton, unpublished but taught in his Coursera lectures, 2012) fixes the accumulation problem with a simple idea: use an exponential moving average instead of a sum. Recent gradients matter more than ancient ones.

s_i = ρ · s_i + (1 - ρ) · (grad_i)²        # exponential average of squared gradients
x_i = x_i - α · grad_i / (√s_i + ε)

With ρ typically 0.9 or 0.99, RMSprop forgets old gradients and adapts to the current landscape. If a parameter's gradients recently became small, its learning rate increases. If they became large, the learning rate decreases. It's like our hiker adjusting their stride based on the terrain they've been walking through recently, not the terrain from three days ago.

Now we arrive at Adam (Kingma and Ba, 2015), probably the most widely used optimizer in deep learning. Adam combines the best of momentum and RMSprop. It tracks two quantities: the first moment (mean of gradients, like momentum) and the second moment (mean of squared gradients, like RMSprop).

m = β₁ · m + (1 - β₁) · grad          # first moment estimate (momentum-like)
v = β₂ · v + (1 - β₂) · grad²         # second moment estimate (RMSprop-like)
m̂ = m / (1 - β₁ᵗ)                     # bias correction
v̂ = v / (1 - β₂ᵗ)                     # bias correction
x = x - α · m̂ / (√v̂ + ε)             # update

The default hyperparameters (β₁ = 0.9, β₂ = 0.999, ε = 1e-8) work well across a wide range of problems, which is why Adam is the "set it and forget it" optimizer. But let's understand why each piece is there.

The first moment m is a smoothed estimate of the gradient direction and magnitude — essentially momentum. It keeps the optimizer moving consistently even when individual mini-batch gradients are noisy. The second moment v tracks the scale of recent gradients for each parameter. Dividing by √v normalizes the update: parameters with large gradients get smaller steps, parameters with small gradients get larger steps. Together, m/√v gives you a direction and an appropriate scale.

The bias correction terms (dividing by 1 − β^t) are a subtle but important detail. At the start of training, m and v are initialized to zero. After the first step, m = (1 − β₁) · grad, which underestimates the true mean by a factor of about 10 (since β₁ = 0.9). The bias correction inflates the early estimates to account for the zero initialization. Without it, the first few updates are too small. By around step 10-20, the correction terms approach 1 and stop mattering.

Why does Adam work so well? It adapts the learning rate per-parameter (handling the different-scales problem), it has momentum (handling noise and saddle points), and the defaults are robust. For practitioners, this means less hyperparameter tuning than SGD with momentum, which is why Adam became the default for research projects where development speed matters.

import numpy as np

# Adam from scratch on f(x, y) = (x-3)^2 + 10*(y+2)^2
# The landscape is steep in y, shallow in x — different scales.
# Adam adapts the learning rate for each parameter independently.

def loss(params):
    x, y = params
    return (x - 3)**2 + 10 * (y + 2)**2

def grad(params):
    x, y = params
    return np.array([2*(x - 3), 20*(y + 2)])

params = np.array([10.0, 10.0])
m = np.zeros(2)       # first moment
v = np.zeros(2)       # second moment
lr = 0.1
beta1, beta2, eps = 0.9, 0.999, 1e-8

for t in range(1, 201):
    g = grad(params)
    m = beta1 * m + (1 - beta1) * g          # update first moment
    v = beta2 * v + (1 - beta2) * g**2       # update second moment
    m_hat = m / (1 - beta1**t)               # bias correction
    v_hat = v / (1 - beta2**t)               # bias correction
    params = params - lr * m_hat / (np.sqrt(v_hat) + eps)

print(f"Final: x={params[0]:.4f}, y={params[1]:.4f}")  # ~(3.0, -2.0)
print(f"Loss:  {loss(params):.6f}")                      # ~0.0

Notice that the y-direction has gradients 10× larger than x, yet Adam handles both smoothly. Plain gradient descent with a single learning rate would either oscillate wildly in y or crawl in x. Adam's per-parameter scaling resolves this automatically.

The SGD vs. Adam Debate

I'm still developing my intuition for when to use which optimizer. This is one of those areas where the community hasn't fully converged, and I want to be honest about that.

Here's the tension. Adam converges faster and requires less tuning. SGD with momentum, when carefully tuned, often finds solutions that generalize better — lower test loss, not just lower training loss. This pattern has been observed repeatedly in computer vision (ResNets on ImageNet, for instance) and is robust enough that many production image classifiers still use SGD.

The leading explanation involves sharp vs. flat minima. Adam's per-parameter learning rates let it navigate into narrow, sharp valleys in the loss landscape. SGD's noise prevents it from settling into sharp valleys — the noise bounces it out — so it ends up in wider, flatter regions. Flat minima correspond to models whose predictions don't change much when the weights are slightly perturbed, and there's both theoretical and empirical evidence that this flatness correlates with better generalization.

Wilson et al. (2017) showed that adaptive methods (Adam, RMSprop, AdaGrad) can converge to solutions with worse generalization than SGD, even when the training loss is lower. The gap is real and measurable on standard benchmarks. This sparked a lot of follow-up work trying to get the best of both worlds.

One important fix is AdamW (Loshchilov and Hutter, 2019). The original Adam implementation combined L2 regularization with the adaptive gradient scaling in a way that didn't work as intended — the regularization was scaled down by the same adaptive factor, weakening it for parameters with large gradients. AdamW separates weight decay from the adaptive update, applying it directly to the parameters. This seemingly minor change significantly closes the generalization gap. AdamW is now the default optimizer for training transformers, including the GPT and BERT families.

The rough practical consensus in 2024: use AdamW for transformers and language models. Use SGD with momentum for CNNs and vision tasks if you're willing to tune the learning rate and schedule. Use Adam (or AdamW) if you want fast convergence with less fuss. And always evaluate on held-out data — the optimizer that reaches lower training loss isn't always the one you want.

Key Insight

The SGD-vs-Adam debate isn't about which is "better" — it's about the generalization properties of the minima they find. SGD's noise acts as implicit regularization, pushing it toward flat minima. Adam converges faster but can land in sharper minima. AdamW largely resolves this for transformers by decoupling weight decay. Know your architecture and your budget before choosing.

Learning Rate Schedules

So far we've treated the learning rate as a constant. In practice, nearly every serious training run changes the learning rate over time. The intuition is straightforward: early in training, you want to move fast — take big steps to get in the right neighborhood. Late in training, you want to move carefully — take small steps to settle precisely into the minimum without overshooting. A learning rate schedule (or learning rate decay) codifies this intuition.

The simplest approach is step decay: multiply the learning rate by some factor (say 0.1) every N epochs. This was the standard for years — train at 0.1 for 30 epochs, drop to 0.01 for 30 more, drop to 0.001 for the final stretch. It works, but the abrupt drops can cause instabilities, and choosing when to drop is another hyperparameter to tune.

Cosine annealing (Loshchilov and Hutter, 2017) smoothly decreases the learning rate following a cosine curve from some maximum value down to near zero over the course of training. There are no discontinuities, no sudden drops. The curve starts slow (learning rate barely changes), accelerates in the middle, and slows again at the bottom. Visually, it looks like the first half of a cosine wave. The smoothness turns out to be beneficial — it avoids the instabilities of step decay and tends to give slightly better results.

Warmup is a more recent addition, driven by transformer training. The idea is to increase the learning rate gradually during the first few hundred or thousand steps before beginning the decay. Why? At initialization, the network's parameters are random. The gradients in early training are large and poorly calibrated — they point "downhill" in a vague sense but can be wildly wrong in magnitude. A large learning rate applied to these initial wild gradients can destabilize training permanently. Warmup starts with a very small learning rate, giving the optimizer a few steps to "calibrate" before ramping up to the target rate.

The combination of warmup followed by cosine decay is now the standard schedule for transformer training. The one-cycle policy (Smith, 2018) is a variation that ramps the learning rate up to a maximum and then back down over a single cycle, also varying the momentum inversely. It often achieves competitive results in significantly fewer epochs.

A powerful practical tool is the learning rate range test (Smith, 2017). Start with a very small learning rate and gradually increase it over a few hundred steps, plotting the loss along the way. The loss will decrease for a range of learning rates, then start increasing when the rate gets too high. The best learning rate for training is typically somewhere in the middle of the decreasing region — large enough to learn fast, small enough not to diverge. This takes five minutes and saves hours of guessing.

Watch Out

Warmup is not optional for large transformers. Without it, the early training steps can push the model into a region of parameter space from which it never recovers. If your transformer loss doesn't decrease after a few thousand steps, the first thing to check is whether you forgot warmup.

Saddle Points

For a long time, the conventional fear about non-convex optimization was local minima — valleys that aren't the deepest but trap the optimizer. It turns out the real enemy in high dimensions is the saddle point.

A saddle point is a critical point where the gradient is zero, but it's not a minimum or a maximum. The surface curves upward in some directions and downward in others. Think of an actual horse saddle: if you sit on it, the surface curves up toward the horse's head and tail (you'd roll off the sides) but curves down toward the ground on either side. In two dimensions it's easy to picture. In a million dimensions, saddle points are everywhere.

Here's the key argument, due to Dauphin et al. (2014). At any critical point (where the gradient is zero), the Hessian matrix — the matrix of second derivatives — tells you about the curvature in every direction. The diagonal of the Hessian (more precisely, its eigenvalues) can be positive (curving up — minimum-like) or negative (curving down — maximum-like). For a point to be a true local minimum, all eigenvalues must be positive. For a saddle point, you need a mix of positive and negative.

In a model with n parameters, the Hessian has n eigenvalues. If each eigenvalue is positive or negative with roughly equal probability (a simplification, but the argument extends), the probability that all n eigenvalues are positive is about (1/2)^n. For a network with a million parameters, that's (1/2)^1,000,000 — so astronomically small that it essentially never happens. The vast majority of critical points in high-dimensional landscapes are saddle points, not local minima.

This is good news and bad news. The good news: local minima are rare, so the optimizer is unlikely to get permanently trapped in one. The bad news: saddle points can still slow training to a crawl. At a saddle point, the gradient is near zero, so gradient descent moves very slowly. The optimizer doesn't know whether to go left (where the surface curves up) or right (where it curves down). It just sees flatness.

Momentum helps because the velocity carries the optimizer through flat regions. SGD's noise helps because random perturbations in the gradient give the optimizer a "nudge" in the downhill directions. Together, they make saddle points a nuisance rather than a roadblock. But in very high dimensions, no one is completely certain about the full geometry of these landscapes — we have strong empirical evidence and partial theoretical results, but the complete picture is still an open research question.

Dauphin et al. also observed something reassuring: the loss values at different critical points tend to be correlated with the "index" of the critical point (how many negative eigenvalues the Hessian has). High-loss critical points tend to be saddle points with many negative eigenvalues. Low-loss critical points tend to have mostly positive eigenvalues — they're near-minima. So the critical points you actually encounter during training (as the loss decreases) become progressively more minimum-like. The landscape is structured in your favor.

Key Insight

In high-dimensional optimization, saddle points outnumber local minima by an astronomical margin. A critical point with n parameters needs all n curvature directions to be positive to be a local minimum — the probability is exponentially small. SGD with momentum naturally escapes saddle points through noise and velocity. The real local minima you encounter tend to have similar loss values, so even getting "stuck" usually isn't catastrophic.

Loss Landscapes

Talking about "landscapes" with millions of dimensions is inherently hand-wavy. We can't visualize a million-dimensional surface. But researchers have found clever ways to project these surfaces into 2D, and what they found is striking.

Li et al. (2018) introduced filter-normalized loss landscape visualizations. The idea: pick two random directions in parameter space, normalize them so the visualization isn't dominated by scale differences between layers, then plot the loss along a 2D slice of the landscape. It's like taking a cross-section of a mountain range — you see one particular view, not the whole thing, but it reveals structure.

What they found was that architecture design dramatically affects the smoothness of the landscape. ResNets with skip connections produce smooth, nearly convex-looking loss surfaces. The same architectures without skip connections produce chaotic, rough surfaces full of sharp valleys and abrupt transitions. Skip connections don't just help gradient flow (the usual explanation) — they literally reshape the optimization terrain into something more navigable.

Mode connectivity (Garipov et al., 2018) is another fascinating finding. Take two independently trained networks that converge to different local minima. You might expect a ridge of high loss between them — after all, they found different valleys. But Garipov et al. showed that you can often find a path between the two minima along which the loss stays low. The valleys aren't isolated — they're connected by low-loss corridors. This suggests the loss landscape of typical neural networks is less "mountain range with isolated valleys" and more "hilly terrain with connected valley systems."

Draxler et al. (2018) extended this, showing that local minima in neural networks are often connected by paths of nearly constant loss. The landscape, at least for overparameterized networks, seems to have a single, connected low-loss basin with multiple access points. This helps explain why different random initializations and different hyperparameters often reach solutions with similar test performance — they're finding different entrances to the same valley system.

These visualizations and theoretical results have practical implications. If the loss landscape is smooth (as it is for well-designed architectures), gradient-based optimizers navigate it efficiently. If it's rough (as it is for poorly designed architectures), training becomes erratic and sensitive to hyperparameters. When you see someone say "I can't get this architecture to train," the landscape itself might be the problem — not the optimizer or the learning rate.

Constrained Optimization

Everything so far has been unconstrained: minimize f(x) with x free to roam anywhere. Real problems are rarely so permissive. Probabilities must sum to one. Budgets can't be exceeded. Physical quantities can't go negative. Resource allocations have upper bounds. Constrained optimization is where you minimize an objective while keeping your solution inside a feasible region defined by constraints.

The constraints come in two flavors: equality constraints (g(x) = 0, the solution must lie exactly on a surface) and inequality constraints (h(x) ≤ 0, the solution must stay on one side of a boundary). In standard form, we write:

minimize f(x)
subject to  g_i(x) = 0  for i = 1, ..., m      (equality constraints)
            h_j(x) ≤ 0  for j = 1, ..., p      (inequality constraints)

The Lagrangian Method

The core idea is elegant: turn a constrained problem into an unconstrained one by folding the constraints into the objective. For a problem with equality constraints, the Lagrangian is:

L(x, λ) = f(x) + Σ λ_i · g_i(x)

Each constraint gets its own Lagrange multiplier λ_i — a new variable that the optimizer must find alongside x. At the optimum, the gradient of f must be proportional to the gradient of each active constraint. That's what the multipliers capture: how strongly each constraint is "pulling" the solution.

Let's walk through a concrete example. Suppose we want to minimize f(x, y) = x² + y² (find the point closest to the origin) subject to the constraint x + y = 4 (the point must lie on the line x + y = 4). Without the constraint, the answer is (0, 0). With it, we're finding the point on the line x + y = 4 that's closest to the origin.

The Lagrangian is L(x, y, λ) = x² + y² + λ(x + y − 4). Taking partial derivatives and setting them to zero:

∂L/∂x = 2x + λ = 0  →  x = -λ/2
∂L/∂y = 2y + λ = 0  →  y = -λ/2
∂L/∂λ = x + y - 4 = 0

From the first two equations, x = y. Plugging into the constraint: 2x = 4, so x = y = 2 and λ = −4. The optimal point is (2, 2), which is indeed the closest point on the line to the origin. The multiplier λ = −4 tells us the sensitivity: if we relaxed the constraint from x + y = 4 to x + y = 4 + ε, the optimal objective would decrease by about 4ε. The multiplier is the "price" of the constraint.

KKT Conditions

Equality constraints are tidy. Inequality constraints add complexity. The Karush-Kuhn-Tucker (KKT) conditions generalize Lagrange multipliers to handle both types. For a constrained optimization problem to have an optimal solution, these four conditions must hold:

1. Stationarity: The gradient of the Lagrangian with respect to x is zero. This is the familiar "gradient equals zero" condition, but now the Lagrangian includes both the objective and the constraint terms.

2. Primal feasibility: All constraints are satisfied. The solution must actually be in the feasible region — you can't minimize cost by spending money you don't have.

3. Dual feasibility: The multipliers for inequality constraints are non-negative (μ_j ≥ 0). This ensures that the inequality constraints act as penalties, not rewards. If a multiplier were negative, the constraint would be "helping" the objective rather than constraining it.

4. Complementary slackness: For each inequality constraint, either the constraint is tight (h_j(x) = 0, the constraint is active) or its multiplier is zero (μ_j = 0). You can't have both a loose constraint and a nonzero multiplier. The optimizer only "pays attention" to constraints that are actually binding.

Complementary slackness is the deepest of the four. It says that at the optimum, inactive constraints are irrelevant — their multipliers are zero, so they contribute nothing to the Lagrangian. Only the active constraints shape the solution. This is directly connected to how Support Vector Machines work: only the data points where the margin constraint is tight (the support vectors) determine the decision boundary. All other training points could be removed without changing the solution.

Duality

Every constrained optimization problem has a twin: the dual problem. The original is the primal. Where the primal minimizes over x with constraints, the dual maximizes over the Lagrange multipliers. The dual is derived from the Lagrangian:

Primal: minimize_x  f(x)  subject to constraints
Dual:   maximize_λ  min_x L(x, λ)  subject to λ ≥ 0

Weak duality always holds: the dual optimum is a lower bound on the primal optimum. You can't do better in the dual than in the primal. Strong duality says the two are equal — the gap is zero — and it holds whenever the primal problem is convex and satisfies mild regularity conditions (Slater's condition, which roughly says the feasible region has an interior point). When strong duality holds, solving the dual is equivalent to solving the primal, and sometimes the dual is easier.

SVMs are the textbook example. The primal SVM is a quadratic program over the weight vector w. The dual is a quadratic program over the Lagrange multipliers α_i, one per training point. Strong duality holds, so the two have the same solution. But the dual has a crucial advantage: it depends on the training data only through dot products x_i · x_j. This is what makes the kernel trick possible — you can replace the dot product with a kernel function K(x_i, x_j) that implicitly computes dot products in a high-dimensional feature space, without ever constructing the coordinates. The dual formulation doesn't just make SVMs solvable — it makes them powerful.

LP and QP

Constrained optimization problems come in standard forms, each with specialized solvers. Linear Programming (LP) has a linear objective and linear constraints — supply chain routing, resource allocation, diet problems. Quadratic Programming (QP) adds a quadratic objective with linear constraints — this is exactly the SVM formulation. Semidefinite Programming (SDP) constrains a matrix to be positive semidefinite, which appears in kernel learning and relaxations of combinatorial problems.

Interior point methods are the modern workhorse. Unlike the simplex method (which walks along edges of the feasible polytope), interior point methods walk through the interior, using barrier functions to stay away from constraint boundaries. They have polynomial-time complexity and scale well to large problems.

import numpy as np
from scipy.optimize import linprog, minimize

# Linear Programming: diet problem
# Minimize cost: 2*bread + 3.5*milk + 8*cheese
# Subject to:
#   6*bread + 2*milk + 1*cheese >= 10  (protein)
#   1*bread + 8*milk + 7*cheese >= 15  (calcium)
#   all quantities >= 0

c = [2, 3.5, 8]
A_ub = [[-6, -2, -1], [-1, -8, -7]]  # negate for >= to <=
b_ub = [-10, -15]
bounds = [(0, None), (0, None), (0, None)]

result = linprog(c, A_ub=A_ub, b_ub=b_ub, bounds=bounds, method='highs')
print(f"Bread={result.x[0]:.2f}, Milk={result.x[1]:.2f}, "
      f"Cheese={result.x[2]:.2f}")
print(f"Minimum cost: ${result.fun:.2f}")

# Constrained nonlinear: minimize x^2 + y^2 subject to x + y = 4
# Using Lagrangian approach via scipy
from scipy.optimize import minimize

def objective(xy):
    return xy[0]**2 + xy[1]**2

constraint = {'type': 'eq', 'fun': lambda xy: xy[0] + xy[1] - 4}
result = minimize(objective, [0, 0], constraints=constraint)
print(f"\nClosest point on x+y=4 to origin: ({result.x[0]:.2f}, {result.x[1]:.2f})")
print(f"Distance squared: {result.fun:.2f}")  # 8.0
Watch Out

When converting "≥" constraints for solvers that expect "≤", you negate both sides. This is the most common source of sign errors in constrained optimization. Always verify by plugging your solution back into the original constraints. A correct solution to the wrong problem is worse than useless.

Second-Order Methods

I haven't figured out a great way to build intuition for the Hessian. I know the theory, I can write the equations, but when someone asks me to "think about the curvature matrix" for a network with 100 million parameters, my mental image goes blank. So I'll give you what I understand, acknowledge where the intuition gets thin, and at least explain why these methods exist and why they're rarely used.

Gradient descent uses the gradient — first-order information. It knows which direction is downhill but nothing about how the slope changes. Second-order methods use the Hessian — the matrix of second partial derivatives — which tells you about the curvature of the landscape. The Hessian captures whether the slope is getting steeper or flatter, and in which directions.

The canonical second-order method is Newton's method. Instead of taking a step proportional to the gradient, Newton's method takes a step proportional to the inverse Hessian times the gradient:

x_new = x_old - H⁻¹ · ∇f(x_old)

Why? The Hessian tells you the "shape" of the local bowl. If the bowl is wide (gentle curvature), the Hessian says "take a bigger step." If it's narrow (steep curvature), it says "take a smaller step." Newton's method adjusts the step size and direction to match the local geometry, which means it can converge in far fewer steps — often in a single step for quadratic functions, since the quadratic approximation is exact.

For a pure quadratic like f(x) = (1/2)x^T A x − b^T x, Newton's method finds the exact minimum in one step, regardless of how badly conditioned A is. Gradient descent on the same problem might need thousands of steps if A has a wide spread of eigenvalues. This is the appeal of second-order methods: they're immune to the ill-conditioning that plagues first-order methods.

The problem is cost. The Hessian for a model with n parameters is an n × n matrix. For a network with 10 million parameters, that's 10^14 entries — roughly 100 terabytes in single precision. You can't compute it, store it, or invert it. Even approximate methods struggle with this scale.

L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) is the most practical second-order-ish method. It doesn't compute the Hessian at all. Instead, it approximates the inverse Hessian using the last m gradient evaluations (typically m = 5-20). The memory cost is O(mn) instead of O(n²), which is manageable. L-BFGS works well for problems where the number of parameters is moderate (thousands to low millions) and the function is smooth. It's the default optimizer in scipy's minimize function and works beautifully for logistic regression, CRFs, and other classical models. But for deep learning with hundreds of millions of parameters and noisy mini-batch gradients, L-BFGS doesn't shine.

K-FAC (Kronecker-Factored Approximate Curvature, Martens and Grosse, 2015) is designed specifically for neural networks. It approximates the Fisher information matrix (closely related to the Hessian for neural networks) using Kronecker products of smaller matrices, one per layer. This reduces the n² cost to something tractable. K-FAC has shown speedups over SGD on some benchmarks, but it adds significant implementation complexity and memory overhead. It's an active research area, not a mainstream tool.

The bottom line is this: first-order methods dominate deep learning because the Hessian is too expensive to compute, and the noise in mini-batch gradients undermines the accuracy of second-order approximations anyway. Adam is, in a sense, a cheap second-order approximation — its second-moment estimate v tracks the diagonal of the Hessian (roughly: the curvature in each parameter direction independently). It captures the most important curvature information at negligible cost. The dream of practical second-order deep learning optimization remains open.

Wrap-Up

Thank you for sticking with this. We've covered a lot of ground — literally and figuratively, given all the landscape metaphors.

We started with the simplest possible picture: a blindfolded hiker feeling the slope, stepping downhill. That's gradient descent. We wrestled with the learning rate — that maddening single number that controls everything. We saw that convex landscapes are the friendly ones (one valley, guaranteed convergence) and that deep learning's non-convex landscapes are full of saddle points and deceptive valleys.

Momentum gave our hiker inertia — the ability to build speed on consistent slopes and coast through flat regions. Mini-batch SGD added beneficial noise, turning our hiker's blurry vision into a feature that helps find flat, generalizable minima. Adam combined momentum with per-parameter learning rates, and AdamW fixed a subtle bug that made Adam worse at generalizing than SGD.

We explored learning rate schedules (warmup for transformers, cosine annealing for smooth convergence), the surprising dominance of saddle points over local minima in high dimensions, and what loss landscapes actually look like when you project them into 2D. Constrained optimization gave us Lagrangians, KKT conditions, duality, and a connection to SVMs. Second-order methods showed us what we're missing (curvature information) and why we can't afford it (the Hessian is too big).

My hope is that the next time your training run goes sideways — the loss plateaus, the gradients explode, the model won't converge — you'll have the vocabulary and the mental models to diagnose what's happening. Not just "try a different learning rate" but "is this a saddle point problem? A curvature problem? A sharp-minima problem?" That's the difference between tuning and understanding.

Resources

Ruder's overview of gradient descent optimizers — Sebastian Ruder wrote the definitive blog-post-length survey. It covers SGD, momentum, AdaGrad, RMSprop, Adam, and their variants, with animations that make the differences visceral. Start here if you want a broader view than this section provides.

"Visualizing the Loss Landscape of Neural Nets" (Li et al., 2018) — The paper that introduced filter-normalized visualizations. The figures alone are worth the read. Seeing a ResNet's smooth landscape next to a VGG's chaotic one drives the point home better than any argument.

"Identifying and attacking the saddle point problem in high-dimensional non-convex optimization" (Dauphin et al., 2014) — The paper that shifted the community's attention from local minima to saddle points. Dense but rewarding. The argument about random matrix theory and the index of critical points is beautifully constructed.

Boyd and Vandenberghe, "Convex Optimization" (2004) — The bible of convex optimization, and it's free online. Heavier than what most ML practitioners need, but chapters 1-5 give you the theoretical foundation for everything in the constrained optimization section above. The exercises are genuinely enlightening.

"Decoupled Weight Decay Regularization" (Loshchilov and Hutter, 2019) — The AdamW paper. Short, clear, and practically important. Understanding the difference between L2 regularization and weight decay — and why they're not the same in adaptive optimizers — is one of those details that matters more than it seems.

"Super-Convergence" (Smith and Topin, 2018) — The one-cycle learning rate policy paper. Less theoretical, more practical. If you want to train models faster with less tuning, the techniques here are directly applicable.