Calculus

Chapter 2: Mathematical Foundations 9 subtopics
TL;DR

Every neural network learns by nudging its parameters in the direction that makes the loss smaller. That direction comes from derivatives. The chain rule lets you propagate that nudge-signal backward through an entire network — and that is backpropagation. Gradients point uphill on a loss surface; we walk downhill. Taylor expansions justify why gradient descent works at all: we trust a local linear approximation of the loss. Reverse-mode automatic differentiation makes the whole thing computationally tractable, letting PyTorch differentiate through millions of parameters in a single backward pass. This section builds all of that from a single, tiny example.

I avoided calculus for longer than I'd like to admit. All through my early ML work, I treated loss.backward() as a magic incantation — type it, gradients appear, don't ask questions. Every time someone derived something on a whiteboard with ∂ symbols, I'd nod along and quietly move on. Finally the discomfort of not knowing what was actually happening under the hood grew too great for me. Here is that dive.

Calculus is the branch of mathematics concerned with rates of change (derivatives) and accumulation (integrals). Newton and Leibniz developed it independently in the late 1600s. In ML, it's the machinery that makes learning possible — it tells every parameter which direction to move and by how much.

Before we start, a heads-up. We're going to be computing derivatives, building up to gradients, and eventually deriving the backpropagation algorithm from scratch. You don't need to remember any calculus from school. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

The Nudge Test: What Is a Derivative?

Why ML Cares: The Loss Is a Function of the Weights

Differentiation Rules That Actually Matter

The Chain Rule — The Single Most Important Idea

Partial Derivatives and the Gradient Vector

Jacobians and Hessians: When One Number Isn't Enough

The Functions You'll Meet Everywhere: Sigmoid, ReLU, Softmax

Integrals: The Other Half of Calculus

Taylor Series → Gradient Descent: Calculus Produces an Algorithm

Rest Stop

Backpropagation: The Chain Rule, Industrialized

When Gradients Go Wrong: Vanishing and Exploding

Automatic Differentiation: How the Machine Actually Does It

The Softmax + Cross-Entropy Miracle

Advanced Territory: Neural ODEs, Variational Calculus, and Beyond

Resources and Credits

The Nudge Test: What Is a Derivative?

Imagine you're building a model that predicts house prices from square footage. You have one weight, w. The model's prediction is: predicted_price = w × square_footage. You pick an initial weight — say, w = 100 — and the prediction is way off. The loss (the gap between your prediction and reality) is some number, say 5,000.

Now nudge w. Increase it by a tiny amount — from 100 to 100.001. Run the prediction again. The loss changes to, say, 4,998. That nudge of 0.001 in w caused a change of −2 in the loss. The ratio of those two changes — how much the output wobbled per unit of input wobble — is the derivative. Formally, if the loss is a function L(w), the derivative is L'(w) = lim(h→0) [L(w+h) − L(w)] / h. Geometrically, it's the slope of the tangent line at your current position.

I think of this as the nudge test. You poke one knob, you watch what happens to the output. The derivative is the sensitivity of the output to your poke. A large positive derivative means "when I turn this knob up, the loss shoots up — so I should turn it down." A negative derivative means the opposite. A derivative near zero means the loss barely notices this knob right now.

That nudge test is the entire reason ML can learn. Training a neural network means minimizing a loss function that depends on millions of parameters. The derivative of the loss with respect to each parameter tells you exactly which direction to adjust it. Without derivatives, you'd be blind — randomly flailing knobs and hoping the loss goes down.

But there's a problem. Our model has one weight. A real network has millions. And the nudge test, one parameter at a time, would take forever. We need faster tools.

Differentiation Rules That Actually Matter

You don't need to go back to the limit definition every time you need a derivative. A handful of rules cover almost everything you'll encounter. The power rule says: if f(x) = x^n, then f'(x) = n × x^(n−1). The derivative of x³ is 3x². The derivative of x (which is x¹) is 1. The derivative of a constant is zero — constants don't move when you nudge the input. Back to our house price model: if the loss happened to be L(w) = (w − 150)², the power rule (plus the chain rule, which we'll get to) gives L'(w) = 2(w − 150). At w = 100, that's −100, a strong signal to increase w.

The product rule handles multiplication of two functions: if f(x) = g(x) × h(x), then f'(x) = g'(x)×h(x) + g(x)×h'(x). You differentiate one factor at a time and add the results. The quotient rule handles division, but in practice, you can rewrite a quotient as a product with a negative exponent and use the product rule instead. These rules are the vocabulary of differentiation — small, learnable, and enough to disassemble nearly any function you'll meet in ML.

But here's what these rules can't do alone: they handle single-layer functions. Our house price model is one multiplication. A neural network is dozens of functions nested inside each other, layer after layer. To differentiate through that nesting, we need something more powerful.

The Chain Rule — The Single Most Important Idea

I'll be honest: for the longest time, I thought backpropagation was some separate, clever algorithm that someone invented. I didn't realize it was the chain rule. The same chain rule from introductory calculus. That realization changed everything for me.

Let's build up to it. Extend our house model so it has two steps. First, we compute a raw score: z = w × square_footage. Then we squash it through an activation function to get our final output: a = sigmoid(z). The loss depends on a, which depends on z, which depends on w. Functions nested inside functions — a composition.

The chain rule tells you how to differentiate a composition. If y = f(g(x)), then dy/dx = f'(g(x)) × g'(x). You differentiate the outer function evaluated at the inner function, then multiply by the derivative of the inner function. In Leibniz notation: dy/dx = (dy/du) × (du/dx), where u = g(x). It's like asking: "how sensitive is the final output to my knob?" You break the question into hops. How sensitive is the output to the intermediate value? How sensitive is the intermediate value to my knob? Multiply those sensitivities together.

For our two-step model: dL/dw = dL/da × da/dz × dz/dw. Three factors, each a local derivative that's easy to compute. dL/da tells you how the loss reacts to the activation output. da/dz tells you how the activation reacts to its input. dz/dw tells you how the raw score reacts to the weight. The chain rule multiplies these local sensitivities into the global sensitivity — how the loss at the very end reacts to the weight at the very beginning.

Now scale this up. A neural network with 50 layers is 50 functions nested inside each other, with a loss function on top. The chain rule lets you peel them apart, one layer at a time, multiplying local derivatives as you go. That multiplication — from the output all the way back to each parameter — is backpropagation. Nothing more, nothing less. The chain rule, applied systematically from the end back to the beginning. We'll formalize backpropagation later, but the essence is right here.

The limitation is subtle but important: the chain rule gives you exact derivatives, but it says nothing about how those derivatives behave when you multiply dozens of them together. That question — what happens when you chain 50 or 100 local derivatives — is the source of some of deep learning's worst headaches. We'll get there.

Partial Derivatives and the Gradient Vector

Our house price model has been limping along with one weight. Time to add a second feature: number of bedrooms. Now the prediction is predicted_price = w₁ × square_footage + w₂ × bedrooms. Two knobs to tune. The loss depends on both: L(w₁, w₂).

A partial derivative is the nudge test applied to one knob while holding all others fixed. ∂L/∂w₁ means: nudge w₁ by a tiny amount, keep w₂ frozen, and measure how the loss changes. You compute it exactly like an ordinary derivative, pretending the other variable is a constant. If L(w₁, w₂) = (w₁×1200 + w₂×3 − 350000)², then ∂L/∂w₁ = 2×1200×(w₁×1200 + w₂×3 − 350000). The partial derivative for w₂ is similar, but with a 3 instead of 1200. Notice w₁'s partial derivative is much larger in magnitude — the loss is far more sensitive to the square footage weight than the bedroom weight. That makes intuitive sense.

Now stack all the partial derivatives into a vector: ∇L = [∂L/∂w₁, ∂L/∂w₂]. That vector is the gradient. It points in the direction of steepest ascent of the loss function — the direction that makes the loss increase fastest. Its magnitude tells you how steep the slope is.

Here's the key insight for optimization: to minimize the loss, walk in the opposite direction of the gradient. If the gradient points northeast, you step southwest. That's gradient descent in one sentence. A directional derivative tells you the rate of change in any direction — it's the dot product of the gradient with your chosen direction vector. The gradient direction itself gives the maximum directional derivative. It's the steepest way uphill, which is why the negative gradient is the steepest way downhill.

I like to think of this as hiking in fog. You can't see the valley floor, but you can feel the slope under your feet. The gradient is that slope. Step downhill, feel the new slope, step again. Eventually, you reach the bottom. This hiking analogy will come back — the terrain of the loss function turns out to be far more interesting than a simple valley.

But our gradient has two components. A real network has millions. Computing the gradient means computing millions of partial derivatives. That sounds brutally expensive — and it would be, if we did each one separately. The chain rule will rescue us, but first we need to talk about what happens when the output isn't a single number.

Jacobians and Hessians: When One Number Isn't Enough

So far, our loss has been a single number — a scalar. The gradient is a vector of partial derivatives, one per parameter. But what about a layer that takes a vector in and produces a vector out? The softmax layer, for instance, takes a vector of raw scores and produces a vector of probabilities. Every output depends on every input. One number can't capture that. You need a matrix.

The Jacobian matrix is that matrix. If a function f maps n inputs to m outputs, the Jacobian is an m×n matrix where entry (i,j) is ∂f_i/∂x_j — how much output i changes when you nudge input j. Each row is the gradient of one output component. During backpropagation, when you need to push gradients backward through a vector-valued layer, you multiply by the Jacobian (or, more precisely, its transpose). In practice, frameworks never build the full Jacobian explicitly — they compute vector-Jacobian products (VJPs), which give you exactly the row you need without materializing the entire matrix. This is one of those engineering decisions that makes deep learning at scale possible.

The Hessian matrix captures something different: curvature. Its entry (i,j) is ∂²L/(∂w_i ∂w_j) — a second-order partial derivative. Where the gradient tells you the slope of the terrain, the Hessian tells you how the slope itself is changing. Is the ground curving into a bowl (local minimum)? A ridge (saddle point)? A dome (local maximum)? The eigenvalues of the Hessian answer this: all positive means a bowl, all negative means a dome, and a mix of positive and negative means a saddle point.

I'll be honest — I still find Hessians intimidating. They're n×n for n parameters, so for a model with 100 million parameters, the Hessian has 10¹⁶ entries. You cannot store it, let alone invert it. This is why deep learning almost exclusively uses first-order methods — the gradient is O(n) and that's all you can afford. Second-order information (curvature) is strictly better in theory, giving you both direction and appropriate step size. But the computational cost makes it a luxury. Approximations like L-BFGS and Kronecker-factored methods try to capture some curvature cheaply, and they work well for smaller models. For modern deep networks, the gradient alone has to be enough.

The Functions You'll Meet Everywhere

Before we can talk about backpropagation through a real network, we need to know the derivatives of the functions that networks are actually made of. There are only a handful, and each one has a story.

Sigmoid

The sigmoid function, σ(z) = 1/(1 + exp(−z)), squashes any real number into the range (0, 1). For years it was the default activation function — its output looks like a probability, which felt right. Its derivative has an elegant property: σ'(z) = σ(z) × (1 − σ(z)). The derivative is expressed entirely in terms of the function value you already computed during the forward pass. That's computationally convenient — the backward pass is nearly free.

But look at the numbers. The maximum value of σ(z) × (1 − σ(z)) is 0.25, which occurs at z = 0. For any other input, the derivative is smaller. Now imagine stacking 20 sigmoid layers. The gradient flowing backward is a product of 20 numbers, each at most 0.25. That product is 0.25²⁰ ≈ 10⁻¹². The gradient effectively disappears. This is the vanishing gradient problem, and it's why sigmoid fell out of favor for hidden layers. It still shows up at the output layer for binary classification, where you genuinely want a probability.

ReLU

The ReLU (Rectified Linear Unit) is f(z) = max(0, z). Its derivative is 1 when z > 0 and 0 when z < 0. (At z = 0 exactly, the derivative is technically undefined, but implementations pick 0 or 1 — it doesn't matter since z landing exactly on zero has probability zero for continuous inputs.) ReLU was the fix for vanishing gradients. When a neuron is active (z > 0), the gradient passes through completely unattenuated — multiplied by 1. No shrinkage at all.

The tradeoff: if a neuron's input is consistently negative, its gradient is always zero and the neuron stops learning entirely. It's dead. This is the dying ReLU problem. Variants like Leaky ReLU (which uses a small positive slope, say 0.01, for negative inputs) keep a whisper of gradient flowing even for inactive neurons, keeping them alive. In our hiking analogy, ReLU makes the trail either flat or sloped — never the gradual leveling-off that sigmoid creates.

Tanh

The tanh function maps inputs to (−1, 1), centered at zero. Its derivative is tanh'(z) = 1 − tanh(z)². Like sigmoid, it saturates at extreme values and causes vanishing gradients. But because its output is zero-centered — negative inputs produce negative outputs, positive inputs produce positive outputs — it helps gradient flow compared to sigmoid, whose output is always positive. Tanh was the go-to activation before ReLU took over. You'll still find it inside LSTM and GRU gates, where the (−1, 1) range has a specific purpose.

Softmax

The softmax function converts a vector of raw scores (called logits) into a probability distribution. For a vector z of length K: softmax(z)_i = exp(z_i) / Σ_j exp(z_j). Every output is positive, and they sum to one. The Jacobian of softmax is a dense K×K matrix: ∂softmax_i/∂z_j = softmax_i × (δ_ij − softmax_j), where δ_ij is 1 when i = j and 0 otherwise. In practice, you almost never compute this Jacobian directly, because softmax is nearly always paired with cross-entropy loss, and their combined gradient collapses into something beautifully simple. We'll derive that later — it's one of the most satisfying results in all of ML calculus.

One practical hazard: computing exp(z_i) for large z_i causes overflow. The log-sum-exp trick fixes this by subtracting the max before exponentiating: log(Σ exp(z_i)) = max(z) + log(Σ exp(z_i − max(z))). After subtraction, the largest exponent is zero, so nothing overflows. Every production framework does this internally.

Integrals: The Other Half of Calculus

Derivatives tell you about rates of change. Integrals tell you about accumulation — adding up infinitely many infinitely small pieces. The definite integral ∫[a,b] f(x) dx computes the area under the curve f(x) between a and b. If derivatives are the speedometer, integrals are the odometer.

ML uses integrals constantly, though you may not see them written out. Every probability density function (PDF) must integrate to 1 over its entire domain — that's what makes it a valid distribution. The probability that a random variable lands in some range is the integral of the PDF over that range. The expected value of a function g(X) under distribution p(x) is E[g(X)] = ∫ g(x) p(x) dx. Loss functions in probabilistic models are expectations. Marginal distributions require integrating out variables. Bayesian inference requires integrating over the entire parameter space to compute the evidence.

The Fundamental Theorem of Calculus connects the two halves: differentiation and integration are inverse operations. If you can find a function whose derivative equals the integrand, you can compute the integral exactly by evaluating that function at the endpoints. When such a closed-form antiderivative exists, integration is straightforward. When it doesn't — and in ML, it often doesn't — you resort to Monte Carlo approximation: replace the integral with an average over random samples. This is the workhorse behind variational inference, reinforcement learning policy gradients, and much of probabilistic ML.

In higher dimensions, multiple integrals compute volumes and probabilities over joint distributions. To get the marginal p(x) from a joint p(x, y), you integrate out y: p(x) = ∫ p(x, y) dy. In Bayesian inference, computing the posterior means integrating over a parameter space that might have millions of dimensions. The cost grows exponentially with dimension — the curse of dimensionality. This is the core challenge that drives variational inference and Markov Chain Monte Carlo methods. It's worth noting that derivatives dominate the day-to-day practice of training neural networks, while integrals dominate the theory of probabilistic models. Both halves matter, but for different reasons.

Taylor Series → Gradient Descent: Calculus Produces an Algorithm

This is where all the pieces come together, and calculus directly hands us an algorithm.

A Taylor series approximates a function as a sum of polynomial terms built from its derivatives at a point: f(x) ≈ f(a) + f'(a)(x−a) + f''(a)(x−a)²/2! + ... The more terms, the better the approximation near a. We only need the first term — the linear approximation.

Take the loss function L(w). We're standing at our current weights w and wondering where to step next. The first-order Taylor expansion says: L(w + δ) ≈ L(w) + ∇L(w) · δ. The loss at a nearby point is approximately the current loss plus the dot product of the gradient with our step. We want to choose δ so that L(w + δ) is as small as possible. The dot product ∇L(w) · δ is most negative when δ points in the opposite direction of the gradient. So we set δ = −α × ∇L(w), where α is a small positive number called the learning rate.

The update rule becomes: w_new = w − α × ∇L(w). That's gradient descent. Derived directly from a linear approximation of the loss.

The learning rate α controls how far we trust that linear approximation. Too large and we step beyond where the approximation is valid — the loss might actually increase instead of decrease, and training oscillates or diverges. Too small and we inch forward painfully, wasting thousands of iterations. Think of it this way: the Taylor expansion is a tangent plane touching the loss surface at one point. That plane is a good guide for a few steps in any direction, but wander too far and the real surface curves away from the plane. The learning rate keeps you within the zone where the approximation holds.

import numpy as np

def gradient_descent(grad_f, x0, lr=0.01, steps=100):
    x = x0.copy()
    for _ in range(steps):
        x = x - lr * grad_f(x)
    return x

# Our house price model: L(w1, w2) = (w1*1200 + w2*3 - 350000)^2
def grad_L(w):
    residual = w[0]*1200 + w[1]*3 - 350000
    return np.array([2*1200*residual, 2*3*residual])

w_start = np.array([0.0, 0.0])
w_final = gradient_descent(grad_L, w_start, lr=1e-10, steps=5000)
print(f"Learned weights: w1={w_final[0]:.1f}, w2={w_final[1]:.1f}")
# w1 dominates because the loss is far more sensitive to the square footage weight

The second-order Taylor expansion adds the curvature term: L(w + δ) ≈ L(w) + ∇L · δ + ½ δᵀHδ, where H is the Hessian. Setting the gradient of this quadratic to zero gives Newton's method: δ = −H⁻¹ × ∇L. Newton's method accounts for curvature and converges much faster — quadratically near the optimum, compared to gradient descent's linear convergence. But inverting the Hessian costs O(n³) for n parameters. For a network with 100 million parameters, that's 10²⁴ operations. Unthinkable. This is why deep learning stops at first-order methods. The gradient is O(n) — that's all you can afford at scale. Quasi-Newton methods like L-BFGS approximate the Hessian cheaply and work beautifully for smaller models, like logistic regression or small neural networks.

Convex Functions: The Easy Case

A function is convex if a line segment connecting any two points on its graph lies above the graph. Picture a bowl. Convex functions have a property that makes optimization almost trivial: every local minimum is a global minimum. There's only one bowl, and gradient descent will find its bottom. Linear regression's squared loss is convex. Logistic regression's loss is convex. This is why these models train reliably — you can't get trapped.

Neural network losses are emphatically not convex. The loss landscape has countless saddle points, flat regions, and local minima. Yet gradient descent still finds good solutions. Understanding convexity helps you appreciate why: classical ML problems are easy because they're convex. Neural networks work despite non-convexity, and the reasons why — wide minima, high-dimensional geometry, stochastic noise — remain an active area of research. I'm still developing my intuition for why large networks seem to find good solutions so reliably. No one has a complete answer.

Constrained Optimization

Lagrange multipliers handle optimization with constraints. When you want to minimize f(x) subject to g(x) = 0, you introduce a new variable λ and find points where ∇f = λ × ∇g. The geometric picture: at the optimum, the gradient of the objective is parallel to the gradient of the constraint. If they weren't parallel, you could slide along the constraint and still improve the objective. In ML, Lagrange multipliers are the backbone of the support vector machine derivation — the entire SVM dual formulation falls out of this framework. The extension to inequality constraints gives the Karush-Kuhn-Tucker (KKT) conditions, which add the rule: either the constraint is active and the multiplier is positive, or the constraint is slack and the multiplier is zero.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a mental model that covers the core of calculus for ML: derivatives measure sensitivity. The chain rule lets you differentiate through compositions — which is what neural networks are. The gradient is a vector of partial derivatives that points uphill; you walk downhill. A first-order Taylor expansion directly produces the gradient descent algorithm. And the learning rate controls how far you trust that local approximation.

That's a solid foundation. It won't tell you how backpropagation is actually implemented in PyTorch, or why gradients sometimes explode or vanish, or how frameworks differentiate through millions of parameters in seconds. But it's enough to understand the core loop of training a neural network.

The short version of what's ahead: backpropagation is the chain rule applied to a computational graph, reverse-mode autodiff is what makes it fast, and the softmax + cross-entropy gradient simplifies to ŷ − y, which is one of the most elegant results in all of ML. There. You're 70% of the way.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

Backpropagation: The Chain Rule, Industrialized

Chain Rule Through Layers

Backpropagation is the algorithm for computing the gradient of the loss with respect to every parameter in a neural network. Despite the mystique, it's the chain rule applied systematically — nothing more. Consider a network with L layers. The forward pass computes a chain: z₁ = W₁x + b₁, a₁ = activation(z₁), z₂ = W₂a₁ + b₂, a₂ = activation(z₂), and so on, up to the loss. Each layer is a function whose output feeds into the next.

The backward pass starts at the loss and works in reverse. At each layer, you compute two things: the local gradient with respect to the layer's parameters (which you store for the update step), and the local gradient with respect to the layer's input (which you pass backward to the previous layer). The incoming gradient from above gets multiplied by each local gradient — that's the chain rule doing its thing. When you reach the bottom, every parameter has its gradient.

The remarkable thing is the cost. One forward pass to compute the loss, one backward pass to compute all the gradients. The backward pass costs roughly twice the forward pass — regardless of how many parameters the network has. For a network with 100 million parameters, that's 100 million gradients in one shot. Compare this to the nudge test, which would require 100 million separate forward passes.

Computational Graphs

Modern frameworks don't think of networks as sequences of layers. They think of the forward pass as a computational graph: a directed acyclic graph where each node is an elementary operation (add, multiply, exp, matrix multiply) and edges carry tensors. During the forward pass, the framework records every operation in this graph. During the backward pass, it traverses the graph in reverse, applying the chain rule at each node.

The original TensorFlow built the graph once and reused it — a static graph. You had to define the entire computation upfront, which made debugging painful. PyTorch took a different approach: dynamic graphs, rebuilt from scratch on every forward pass. This means you can use normal Python if-statements and for-loops, and the graph adapts. You can print intermediate values. You can set breakpoints. Dynamic graphs made prototyping so much easier that they became the standard — even TensorFlow eventually added an eager execution mode. The flexibility of dynamic graphs is one reason PyTorch dominates research.

When Gradients Go Wrong: Vanishing and Exploding

Remember our hiking analogy — following the slope downhill to find the valley? Now imagine the terrain is strange. In some directions, the slope gets flatter and flatter as you walk, until you can barely feel it under your feet. You're stuck on an almost-flat plateau, unable to tell which way is downhill. That's the vanishing gradient problem.

When a network has many layers, the gradient at each layer is a product of the local derivatives from every layer above it. If each local derivative is less than 1, the product shrinks exponentially. Sigmoid activations are the worst offender — their maximum derivative is 0.25. Twenty sigmoid layers give you a gradient multiplied by 0.25²⁰ ≈ 10⁻¹². The early layers receive a gradient signal so faint that they can't learn. The first time I saw my loss flatline after 10 epochs and couldn't figure out why, it was vanishing gradients. I spent hours checking my data pipeline before I thought to check the gradient magnitudes.

The fixes are now well-established. ReLU activations pass gradients through at full strength (multiplied by 1) when active. Residual connections add a shortcut that lets gradients skip layers entirely — they literally add a highway for the gradient signal. Think of it as adding an express lane to our hiking trail. Careful initialization (He initialization for ReLU, Xavier for tanh) sets the initial weights so that the variance of activations stays roughly constant across layers, preventing gradients from shrinking or growing right from the start.

The mirror-image problem is exploding gradients: if local derivatives exceed 1, the product grows exponentially. Weights get enormous updates, the loss spikes, and training collapses. The standard fix is gradient clipping: if the gradient's norm exceeds a threshold, scale it down proportionally before applying the update. This is particularly important in recurrent neural networks, where backpropagation through time (BPTT) unrolls the network across every timestep. A sequence of length 100 means the gradient passes through 100 copies of the same weight matrix — 100 matrix multiplications. If the largest eigenvalue of that weight matrix exceeds 1, the gradient explodes. If it's less than 1, the gradient vanishes. This razor-thin balance is why vanilla RNNs struggle with long sequences, and why LSTMs and GRUs were invented — their gating mechanisms explicitly control gradient flow, opening and closing gates to let information (and gradients) through selectively.

Automatic Differentiation: How the Machine Actually Does It

For years I assumed PyTorch was doing symbolic math under the hood — rewriting my loss function as a formula and differentiating it the way you would on paper. I was wrong. And the truth is more clever.

There are three ways a computer can compute derivatives. Symbolic differentiation manipulates expressions using rules (power rule, chain rule) to produce a new expression for the derivative. SymPy and Mathematica do this. The result is exact, but expressions can explode in size — the derivative may be far more complex than the original function. And this approach breaks down entirely for functions defined by programs with loops and conditionals.

Numerical differentiation uses finite differences: f'(x) ≈ [f(x+h) − f(x−h)] / (2h). It's dead simple, works for any function you can evaluate, and requires no special implementation. But for n parameters, you need 2n function evaluations — two per parameter. For a model with a million parameters, that's two million forward passes. Unacceptable for training. But excellent for testing: you can use numerical gradients to verify that your analytical gradients are correct, a technique called gradient checking.

import numpy as np

def gradient_check(f, analytical_grad, x, h=1e-5):
    """Verify analytical gradients against numerical approximation."""
    numerical = np.zeros_like(x)
    for i in range(len(x)):
        x_plus, x_minus = x.copy(), x.copy()
        x_plus[i] += h; x_minus[i] -= h
        numerical[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    
    analytical = analytical_grad(x)
    diff = np.linalg.norm(numerical - analytical) / (np.linalg.norm(numerical) + np.linalg.norm(analytical) + 1e-8)
    print(f"Relative difference: {diff:.2e}")  # Should be < 1e-5
    return diff

# Verify: f(x,y) = x^2 + 3xy. Gradient: [2x+3y, 3x]
f = lambda x: x[0]**2 + 3*x[0]*x[1]
grad_f = lambda x: np.array([2*x[0] + 3*x[1], 3*x[0]])
gradient_check(f, grad_f, np.array([1.0, 2.0]))

Forward-Mode vs Reverse-Mode

Automatic differentiation (AD) is the third approach and the one that actually powers deep learning. It doesn't manipulate expressions and it doesn't use finite differences. Instead, it decomposes the function into a sequence of elementary operations — add, multiply, exp, log — each of which has a known derivative. Then it applies the chain rule through this sequence. The result is exact (up to floating-point precision) and efficient.

Forward-mode AD propagates derivatives from inputs to outputs. For each input, you track both the value and its derivative through every operation. To get the derivative with respect to all n inputs, you need n forward passes. This is efficient when you have few inputs and many outputs.

Reverse-mode AD propagates derivatives backward from outputs to inputs. One backward pass gives you the derivative of the output with respect to every input — all of them, in one shot. If you have many inputs and one output — which is exactly the situation in ML, where you have millions of parameters and a single scalar loss — reverse mode is spectacularly efficient.

This asymmetry is the reason deep learning works at the scale it does. One forward pass computes the loss. One backward pass computes every gradient. The backward pass costs roughly 2–4× the forward pass, regardless of the number of parameters. A network with a billion parameters gets its full gradient in one backward pass. This is reverse-mode AD. This is what loss.backward() does in PyTorch. And this is why backpropagation, as an algorithm, is identical to reverse-mode AD applied to a computational graph.

import torch

x = torch.tensor([1.0, 2.0], requires_grad=True)

# Forward pass — PyTorch silently builds the computational graph
loss = x[0]**2 + 3*x[0]*x[1]

# Backward pass — reverse-mode AD through the recorded graph
loss.backward()

print(f"Gradient: {x.grad}")  # tensor([8., 3.]) — both gradients in one pass
# Analytical: ∂L/∂x₁ = 2(1) + 3(2) = 8, ∂L/∂x₂ = 3(1) = 3
import torch
import torch.nn as nn

# Back to our running example, now as a proper neural network
model = nn.Sequential(
    nn.Linear(2, 8),    # 2 features → 8 hidden units
    nn.ReLU(),
    nn.Linear(8, 1)     # 8 hidden → 1 output (price)
)

# Fake batch: 4 houses, 2 features each (sq_ft normalized, bedrooms)
x = torch.randn(4, 2)
y = torch.randn(4, 1)

pred = model(x)
loss = ((pred - y)**2).mean()
loss.backward()

# Every parameter now has a .grad — computed via reverse-mode AD
for name, param in model.named_parameters():
    print(f"{name}: shape={param.shape}, grad_norm={param.grad.norm():.4f}")

The Softmax + Cross-Entropy Miracle

This derivation is so central to classification that it deserves its own section. And it has my favorite ending of any derivation in ML.

You have a network that outputs raw logits z. Softmax converts them to predicted probabilities: ŷ_i = exp(z_i) / Σ_j exp(z_j). The true label is a one-hot vector y — all zeros except a 1 at the correct class k. The cross-entropy loss is L = −Σ_i y_i × log(ŷ_i). Since y is one-hot, this collapses to L = −log(ŷ_k): the negative log-probability assigned to the correct class. If the model is confident and correct, ŷ_k is close to 1 and the loss is close to 0. If the model puts low probability on the correct class, the loss is large.

Now for the gradient ∂L/∂z_i. You need to differentiate through the log and the softmax. This involves the softmax Jacobian, which is a dense K×K matrix — it sounds messy. But something beautiful happens. For the correct class: ∂L/∂z_k = ŷ_k − 1. For every other class: ∂L/∂z_i = ŷ_i. Combining with the one-hot vector: ∇_z L = ŷ − y.

The gradient of cross-entropy loss with respect to the logits is the predicted probability minus the true label. That's it. The Jacobian, the logarithm, the summation — everything cancels into ŷ − y. I'll be honest: the first time I worked through this derivation and saw it collapse, I didn't believe it. I re-derived it twice.

This elegance isn't an accident — the cross-entropy loss was designed to pair with softmax. And it's not just aesthetically pleasing; it's numerically essential. Computing log(softmax(z)) naively is catastrophically unstable — softmax can produce values near zero, and log of near-zero is negative infinity. Frameworks provide a combined log_softmax or cross_entropy that uses the log-sum-exp trick internally for stability. Always pass raw logits to the loss function. Never compute softmax yourself and then take the log. The combined implementation gives you the elegant ŷ − y gradient and numerical stability in one package.

Advanced Territory

L2 Regularization as Weight Decay

L2 regularization adds a penalty (λ/2) × ||w||² to the loss. Take the derivative: the gradient of the regularized loss is ∇L(w) + λw. The update rule becomes w_new = w − α(∇L(w) + λw) = (1 − αλ)w − α∇L(w). That factor (1 − αλ) shrinks every weight slightly before each update — hence the name weight decay. It discourages any single weight from growing too large, acting as a soft constraint toward simpler models. For standard SGD, L2 regularization and weight decay are mathematically identical. For adaptive optimizers like Adam, they differ — which is why AdamW was introduced to decouple them.

OLS Closed-Form Solution

Our house price model, in its simplest linear form, has an exact solution. The ordinary least squares loss is ||Xw − y||². Take the gradient: ∇L = 2XᵀXw − 2Xᵀy. Set to zero: XᵀXw = Xᵀy. If XᵀX is invertible, the solution is w = (XᵀX)⁻¹Xᵀy — the normal equation. No gradient descent needed; one matrix inversion gives you the answer. But inverting XᵀX costs O(d³) for d features. When d is large, gradient descent with its O(nd) per-step cost is more practical. Closed-form solutions are elegant; iterative methods scale.

Neural ODEs

Residual networks compute h_{t+1} = h_t + f(h_t), which is an Euler step for the ordinary differential equation dh/dt = f(h). Neural ODEs take this literally — they parameterize a continuous transformation and solve it with a numerical ODE solver. The backward pass uses the adjoint method, solving another ODE in reverse to compute gradients with constant memory cost regardless of depth. Neural ODEs enable continuous-depth networks, continuous normalizing flows, and time-series models with irregular sampling intervals.

Variational Calculus

Variational calculus extends optimization from finding optimal numbers to finding optimal functions. In ML, it appears in variational inference: approximating an intractable posterior p(z|x) with a simpler distribution q(z) by minimizing KL divergence, or equivalently maximizing the evidence lower bound (ELBO). Variational autoencoders (VAEs) use this, optimizing encoder and decoder jointly with the reparameterization trick to backpropagate through random sampling.

Information Geometry

Information geometry treats probability distributions as points on a curved surface (a manifold), with the Fisher information matrix as its metric. Natural gradient descent uses this metric to take steps that respect the geometry of distribution space rather than flat Euclidean space, making optimization invariant to how you parameterize the model. It's theoretically beautiful but computationally expensive — the Fisher matrix is as big as the Hessian.

Resources and Credits

These are the resources I found most helpful while going down this rabbit hole.

If you're still with me, thank you. I hope it was worth it.

We started with the nudge test — poking one weight and watching the loss change. From there we built up the chain rule to handle compositions, gradients to handle multiple parameters, and Taylor expansions to derive gradient descent itself. Then we industrialized the chain rule into backpropagation, saw how gradients can vanish or explode through deep networks, and discovered that reverse-mode automatic differentiation is what makes computing gradients through millions of parameters feasible. Along the way, the softmax + cross-entropy gradient collapsed into ŷ − y, which still feels like a magic trick.

My hope is that the next time you see loss.backward(), instead of treating it as a black box, you'll picture the computational graph being traversed in reverse, the chain rule being applied at every node, and the gradient of every parameter being accumulated — having a pretty darn good mental model of what's going on under the hood.