Backpropagation

Chapter 7: Deep Learning Foundations Section 4 of 7

TL;DR

Backpropagation is the chain rule from calculus, applied backward through a neural network's computational graph. It answers one question: for each weight, how much did it contribute to the error? The answer comes in a single backward pass — no matter whether the network has nine parameters or nine billion. Each node computes a tiny local derivative, and multiplying these local derivatives together through the chain rule gives every weight its personal "blame signal." That signal is the gradient, and it's all the optimizer needs to know which direction to nudge. The algorithm is the reason deep learning works. Without something this efficient, training anything deeper than a toy network would be computationally impossible.

Why This Matters — The Problem That Demands Backprop

I avoided deriving backpropagation by hand for longer than I'd like to admit. Every time a textbook said "and then we apply the chain rule," I'd nod along and move on. I could write loss.backward() in PyTorch and watch the loss go down, and for a while that was enough. But there was always this nagging discomfort — I didn't truly know what was happening between the loss and the weight update. Finally, that discomfort grew too great for me. Here is that dive.

Backpropagation is the algorithm that computes the gradient of a loss function with respect to every parameter in a neural network. It was first described as reverse-mode automatic differentiation by Seppo Linnainmaa in 1970, applied to neural networks by Paul Werbos in his 1974 PhD thesis, and popularized by Rumelhart, Hinton, and Williams in 1986. Every neural network trained since — every ResNet, every Transformer, every GPT — uses it.

Before we start, a heads-up. We're going to be working through derivatives, the chain rule, and computational graphs. You don't need to remember any of it beforehand. We'll add what we need, one piece at a time.

This isn't a short journey, but I hope you'll be glad you came.

Why This Matters — The Problem That Demands Backprop
The Brute Force Approach (And Why It Fails)
Forward Pass: Building Something to Backpropagate Through
The Chain Rule — Signal Through Stages
Backward Pass: Tracing the Blame
The Universal Pattern
Rest Stop
Three Ways to Compute a Derivative
Computational Graphs and the Machinery of Autograd
Forward Mode vs. Reverse Mode — Why Reverse Wins
When the Signal Dies: Vanishing and Exploding Gradients
PyTorch Autograd in Practice
The Details That Bite You: Accumulation, Detaching, Checkpointing
What Backpropagation Does NOT Do
Wrap-Up
Resources

The Brute Force Approach (And Why It Fails)

Here's the situation. You have a neural network with millions of parameters. You feed in an input, get an output, and it's wrong. The loss function hands you a single number — how wrong. Now you need to answer the hardest operational question in deep learning: which of these millions of weights should change, and by how much?

The most obvious approach: wiggle each weight one at a time. Increase w₁ by a tiny amount, re-run the forward pass, see if the loss goes down. Then reset w₁ and repeat for w₂. Then w₃. This is called numerical differentiation — approximating the gradient by finite differences. For a network with n parameters, it requires n + 1 forward passes to get one gradient vector.

GPT-3 has 175 billion parameters. That's 175 billion forward passes per training step. A single forward pass takes seconds on a massive GPU cluster. The math doesn't work out. The universe doesn't have enough time.

Backpropagation computes the gradient of the loss with respect to every weight in the network in two passes — one forward, one backward. The total cost is roughly 2× a single forward pass, regardless of whether the network has nine parameters or nine billion. That efficiency gap is not a minor optimization detail. It's the reason deep learning exists as a field.

Forward Pass: Building Something to Backpropagate Through

Before we can backpropagate, we need something to backpropagate through. Let's build a deliberately tiny network and trace every computation by hand. Imagine we're predicting whether a customer will buy a product (1 = yes, 0 = no) based on two features: how long they browsed the website (x₁) and how many items they viewed (x₂).

Two inputs, two hidden neurons with ReLU activation, one output neuron with sigmoid, binary cross-entropy loss. Small enough to fit in your head, large enough to contain every pattern that scales to real networks.

Inputs:  x₁ = 1.0 (browsing time, normalized)
         x₂ = 0.5 (items viewed, normalized)
Target:  y = 1  (did buy)

Hidden layer weights and biases:
  w₁₁ = 0.3   w₁₂ = 0.7   b₁ = 0.1   (→ hidden neuron h₁)
  w₂₁ = −0.2  w₂₂ = 0.5   b₂ = 0.0   (→ hidden neuron h₂)

Output layer weights and bias:
  v₁ = 0.6   v₂ = −0.4   b₃ = 0.1

Nine parameters in total. Six weights, three biases. Let's push our input through.

First, each hidden neuron computes a weighted sum of the inputs and adds its bias. This is the pre-activation — the raw value before the activation function touches it:

z₁ = w₁₁·x₁ + w₁₂·x₂ + b₁ = 0.3(1.0) + 0.7(0.5) + 0.1 = 0.75
z₂ = w₂₁·x₁ + w₂₂·x₂ + b₂ = −0.2(1.0) + 0.5(0.5) + 0.0 = 0.05

Now ReLU — a function that passes positive values through unchanged and clamps negative values to zero. Both z₁ and z₂ happen to be positive, so they pass through untouched:

h₁ = ReLU(0.75) = 0.75
h₂ = ReLU(0.05) = 0.05

The output neuron takes these hidden activations, computes its own weighted sum, and applies sigmoid — a function that squashes any real number into the range (0, 1), which we interpret as a probability:

z₃ = v₁·h₁ + v₂·h₂ + b₃ = 0.6(0.75) + (−0.4)(0.05) + 0.1 = 0.53
ŷ  = σ(z₃) = 1 / (1 + e⁻⁰·⁵³) ≈ 0.6295

Finally, the binary cross-entropy loss measures how far this prediction is from the truth. When the target is 1, the formula simplifies to −log(ŷ):

L = −[y·log(ŷ) + (1−y)·log(1−ŷ)] = −log(0.6295) ≈ 0.4629

The network predicted 0.63 when the answer was 1. Not terrible, but wrong enough that the loss is 0.4629. Every intermediate value we computed — z₁, z₂, h₁, h₂, z₃, ŷ — will be needed again during the backward pass. This is important: the forward pass doesn't only produce a prediction, it also produces the raw materials for computing derivatives.

The Chain Rule — Signal Through Stages

Before we go backward through the network, we need the chain rule to be bone-deep intuitive, because everything that follows is applying it over and over.

Imagine a factory assembly line with three stations. Station A shapes the metal. Station B paints it. Station C adds a label. A defective product rolls off the end. You want to know: if we made the raw material slightly better, how much would the final product improve?

The chain rule says: multiply the local effects at each stage. If Station A amplifies a change by a factor of 2 (better raw material → much better shaped piece), Station B preserves it (factor of 1), and Station C dampens it slightly (factor of 0.5), then the total amplification from raw material to final product is 2 × 1 × 0.5 = 1. A small improvement at the input produces an equal-sized improvement at the output.

Mathematically, if you have three nested functions — output = f(g(h(x))) — and you want to know how output changes when you nudge x, the chain rule says:

d(output)/dx = f'(g(h(x))) · g'(h(x)) · h'(x)

Each function contributes its own local derivative — how much it amplifies or dampens what passes through it. The total derivative is the product of all these local amplifications. That product is the gradient.

A neural network is a composition of functions. The weighted sum is one function. ReLU is another. Sigmoid is another. The loss is another. The chain rule tells you exactly how a tiny change in any weight ripples through all subsequent layers to affect the final loss.

That's backpropagation. Start at the loss, compute the local derivative at each operation, and multiply backward. Each weight gets its own gradient — its own blame signal for the error.

We'll keep returning to this assembly line. When gradients vanish, it's because one of the stations has a near-zero amplification factor, and that zero multiplied through the product kills the signal for everything upstream. When gradients explode, the amplification factors are all greater than one, and their product grows out of control. The chain rule is a product, and products are at the mercy of their factors.

Backward Pass: Tracing the Blame

Let's walk backward through our example. We have L ≈ 0.4629 and we need ∂L/∂w for every one of the nine parameters. We'll start at the loss and work our way back, applying the chain rule at every operation.

From Loss to Output Pre-activation

For binary cross-entropy combined with sigmoid, there's a clean simplification that took me a while to trust the first time I saw it. Instead of computing ∂L/∂ŷ and ∂ŷ/∂z₃ separately and multiplying, the product collapses to:

∂L/∂z₃ = ŷ − y = 0.6295 − 1.0 = −0.3705

I'll be honest — when I first saw this, I didn't believe it was this clean. But if you expand the derivatives of −log(σ(z)) and apply the chain rule, the sigmoid derivative σ(z)(1−σ(z)) cancels perfectly with terms in the cross-entropy derivative. The math checks out. I still find this cancellation one of the more satisfying things in all of machine learning.

The sign tells us something real. Negative means the loss decreases if z₃ increases. That makes sense: ŷ = 0.63 is too low (target is 1), so we need z₃ to be larger to push sigmoid's output closer to 1. This number — −0.3705 — is the error signal that will ripple backward through the entire network.

Gradients for the Output Layer

The output pre-activation was z₃ = v₁·h₁ + v₂·h₂ + b₃. To get the gradient for each weight, we multiply the upstream error signal by the local derivative — which, for a linear combination, is the input that weight connects to:

∂L/∂v₁ = (∂L/∂z₃) × h₁ = (−0.3705)(0.75)  = −0.2779
∂L/∂v₂ = (∂L/∂z₃) × h₂ = (−0.3705)(0.05)  = −0.0185
∂L/∂b₃ = (∂L/∂z₃) × 1  =                     −0.3705

Look at the magnitudes. Weight v₁ connects to h₁ = 0.75, a substantial activation, so v₁ gets a large gradient — it had significant influence on the output and therefore significant blame for the error. Weight v₂ connects to h₂ = 0.05, which barely fired, so v₂ gets a tiny gradient. That neuron contributed almost nothing to the prediction, so adjusting v₂ wouldn't help much. The math captures exactly the intuition you'd have if you thought about it.

Propagating Error to the Hidden Layer

To get gradients for the hidden layer weights, we first need the error signal at each hidden neuron — how much did h₁ and h₂ each contribute to the loss? The weights between layers act as routing coefficients, distributing the upstream error signal:

∂L/∂h₁ = (∂L/∂z₃) × v₁ = (−0.3705)(0.6)   = −0.2223
∂L/∂h₂ = (∂L/∂z₃) × v₂ = (−0.3705)(−0.4)  = +0.1482

Something interesting happens here. h₁ gets a negative gradient (we want it to increase), but h₂ gets a positive gradient (we want it to decrease). Why? Because v₂ is negative. A negative weight flips the sign of the blame. This is how the network's own structure shapes the gradient flow — the weights between layers don't only route signals forward, they route blame backward.

Through the ReLU Gate

ReLU's derivative is either 1 (if the input was positive) or 0 (if the input was negative or zero). It's a gate — either fully open or fully shut. Both z₁ = 0.75 and z₂ = 0.05 were positive, so the gate is open and the gradient passes through unchanged:

∂L/∂z₁ = (∂L/∂h₁) × ReLU'(z₁) = (−0.2223) × 1 = −0.2223
∂L/∂z₂ = (∂L/∂h₂) × ReLU'(z₂) = (+0.1482) × 1 = +0.1482

⚠️ The Dying ReLU Problem

If z₂ had been negative — say, −0.3 — then ReLU would have output 0, and its derivative would be 0. The gradient ∂L/∂z₂ would be zero, and every weight feeding into that neuron would get zero gradient. That neuron is dead. It contributes nothing to the output, backprop assigns it zero blame, and zero blame means zero update. The neuron never recovers. This is called the dying ReLU problem. Back at our assembly line, it's as if one station's amplification factor is permanently zero — no signal from upstream can ever pass through it. Leaky ReLU and other variants exist specifically to keep this gate slightly ajar even for negative inputs.

Gradients for the Hidden Layer

Same pattern as before — upstream error signal times the input to that weight:

∂L/∂w₁₁ = (∂L/∂z₁) × x₁ = (−0.2223)(1.0) = −0.2223
∂L/∂w₁₂ = (∂L/∂z₁) × x₂ = (−0.2223)(0.5) = −0.1112
∂L/∂b₁  = (∂L/∂z₁) × 1   =                  −0.2223

∂L/∂w₂₁ = (∂L/∂z₂) × x₁ = (+0.1482)(1.0) = +0.1482
∂L/∂w₂₂ = (∂L/∂z₂) × x₂ = (+0.1482)(0.5) = +0.0741
∂L/∂b₂  = (∂L/∂z₂) × 1   =                  +0.1482

Every parameter has a gradient now. One backward pass. Nine gradients. Let's see the full picture:

Parameter   Value    Gradient    What it means
─────────   ─────    ────────    ──────────────────────────
w₁₁         0.300    −0.2223     increase it (reduces loss)
w₁₂         0.700    −0.1112     increase it
b₁          0.100    −0.2223     increase it
w₂₁        −0.200    +0.1482     decrease it (reduces loss)
w₂₂         0.500    +0.0741     decrease it
b₂          0.000    +0.1482     decrease it
v₁          0.600    −0.2779     increase it
v₂         −0.400    −0.0185     increase it (make less negative)
b₃          0.100    −0.3705     increase it

The gradient points in the direction of steepest ascent. To reduce the loss, the optimizer subtracts some fraction (the learning rate) of each gradient from the corresponding weight. That's gradient descent — but that's the optimizer's job, not backprop's. Backprop's entire contribution was computing these nine numbers in one backward pass.

The Universal Pattern

After tracing through that example, a pattern has probably started to crystallize. At every node in the backward pass, the same thing happened:

gradient for a weight = (upstream error signal) × (local input to that weight)
gradient for an input  = (upstream error signal) × (weight connecting to that input)

That first line — upstream signal times local input — is the only formula you need to remember for the backward pass of a linear layer. It shows up everywhere because the derivative of z = w·x + b with respect to w is x, and the derivative with respect to x is w. The chain rule packages the "blame from downstream" into the upstream signal, and the local derivative tells you how much this particular weight or input contributed.

Activation functions add their own local derivative as a multiplier. ReLU multiplies by 1 or 0. Sigmoid multiplies by σ(z)(1−σ(z)). Tanh multiplies by 1−tanh²(z). Each one shapes the error signal as it passes through — amplifying, dampening, or killing it entirely.

Once you see this pattern, the backward pass through any layer — convolutional, recurrent, attention, normalization — is the same game. Compute the local derivatives. Multiply by the upstream signal. Pass the result further upstream. Every architecture ever invented plays by these rules.

Rest Stop

☕ You Can Stop Here If You Want

If you've followed along to here, you understand backpropagation. Not the hand-wavy version — the real one. You can trace a forward pass, compute the loss, walk backward through each operation multiplying local derivatives, and arrive at a gradient for every weight. You understand why the cost is O(n) — one backward pass, all gradients — and why that efficiency is what makes deep learning computationally feasible.

That's a solid 80% of what matters. You could stop here, write loss.backward() in PyTorch with genuine understanding of what's happening under the hood, and be ahead of most practitioners.

What we haven't covered yet: how frameworks actually implement this (computational graphs, automatic differentiation), why there are two modes of automatic differentiation and which one wins for neural networks, what happens when the gradient signal degrades through many layers (vanishing and exploding gradients), and the practical PyTorch details that trip people up in production.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

Three Ways to Compute a Derivative

Before we look at how PyTorch implements backprop, it helps to understand why it chose the approach it did. There are three fundamentally different ways to compute a derivative, and they trade off in ways that matter enormously at scale.

Numerical differentiation is the brute-force approach we started with: wiggle each input by a tiny amount ε, measure the change in output, divide. The formula is f'(x) ≈ (f(x+ε) − f(x)) / ε. It's dead-simple to implement and it works for any function, differentiable or not. But it requires one forward pass per parameter, and it's approximate — too small an ε and floating-point rounding errors dominate, too large and the approximation is inaccurate. It has one practical use: gradient checking. When you write a custom backward pass and want to verify it, you compare your analytical gradients against numerical ones. That's about all it's good for.

Symbolic differentiation is what Wolfram Alpha does. Given the expression sin(x²), it applies differentiation rules to produce the expression 2x·cos(x²). It manipulates symbols, not numbers. The result is exact and gives you a closed-form formula. The problem is expression swell — for deeply nested functions, each application of the chain rule can double the size of the expression. A 50-layer network would produce a derivative expression so large it would choke any computer algebra system. Symbolic differentiation works beautifully for single equations. It falls apart for programs.

Automatic differentiation (AD) is the sweet spot that deep learning frameworks use. It doesn't produce expressions — it produces values. During the forward pass, it records every operation that happened (multiply, add, relu, etc.) in a trace. During the backward pass, it replays that trace in reverse, applying the chain rule to actual numbers at each step. No expression swell because there's no expression — there are numbers being multiplied at each node. Exact to machine precision because it uses analytical local derivatives, not finite differences. And it handles arbitrary program structure: loops, conditionals, recursion. If Python can execute it, autograd can differentiate through it.

Computational Graphs and the Machinery of Autograd

When you write y = torch.relu(x @ W + b) in PyTorch, something subtle happens beyond the obvious computation. PyTorch builds a computational graph — a directed acyclic graph (DAG) where each node represents an operation (matrix multiply, add, relu) and each edge represents a tensor flowing between operations. Every tensor with requires_grad=True participates in this graph. The tensor itself stores a reference back to the operation that created it — that's the .grad_fn attribute you see when you inspect a computed tensor.

This graph is the "tape" in tape-based automatic differentiation. PyTorch records the forward pass as it executes, building the tape on-the-fly. This is what people mean by a dynamic computational graph — the graph is constructed fresh every time the forward pass runs. If your code has an if statement that takes a different branch on different inputs, the graph structure changes accordingly. Old TensorFlow (1.x) used a static graph that you defined once and then ran — you couldn't use normal Python control flow. PyTorch's dynamic approach won the community over because it means you can use any Python you want and the gradients still work.

When you call loss.backward(), PyTorch walks this graph in reverse topological order — from the loss node back to the leaf parameters. At each node, it performs a vector-Jacobian product (VJP): take the incoming gradient vector from downstream, multiply it by the local Jacobian (the derivative of this operation with respect to its inputs), and pass the result upstream. The framework doesn't compute or store the full Jacobian matrix — that would be enormous for operations like matrix multiplication. Instead, each operation has a custom backward function that directly computes the VJP efficiently. The result for each leaf parameter accumulates in its .grad attribute.

I'm still developing my full intuition for why the vector-Jacobian product framing is the right abstraction. But the key realization is this: at each node, the "vector" flowing backward is the gradient of the loss with respect to that node's output. The Jacobian captures how that node's output depends on its inputs. Multiplying them gives you the gradient of the loss with respect to that node's inputs. That's the chain rule, expressed as a matrix-vector product, applied one node at a time.

Forward Mode vs. Reverse Mode — Why Reverse Wins

Automatic differentiation comes in two flavors, and the choice between them has a dramatic impact on efficiency.

Forward mode propagates derivatives from inputs to outputs. You pick one input parameter, seed it with a derivative of 1, and trace how that derivative flows forward through every operation to every output. At the end, you have the derivative of all outputs with respect to that one input. To get the gradient with respect to a different input, you start over. For a function with n inputs and m outputs, forward mode costs n forward passes to get the full gradient.

Reverse mode propagates derivatives from outputs to inputs. You start with one output (the loss), seed it with a derivative of 1, and trace how that derivative flows backward through every operation to every input. At the end, you have the derivative of that one output with respect to all inputs. For n inputs and m outputs, reverse mode costs m backward passes.

Neural networks have millions of input parameters and one scalar loss. Forward mode would require millions of passes. Reverse mode requires one. The efficiency gap isn't 2× or 10× — it's a factor of millions.

Reverse-mode automatic differentiation is backpropagation. They are the same algorithm. The name "backpropagation" comes from the neural network community. The name "reverse-mode AD" comes from the numerical computing community. They converged on the same insight independently: when you have many inputs and few outputs, work backward.

Forward mode isn't useless — it's ideal when you have few inputs and many outputs, like computing the Jacobian of a physics simulation with three control knobs and a million mesh points. But for training neural networks, reverse mode is the only game in town.

When the Signal Dies: Vanishing and Exploding Gradients

The chain rule computes a gradient as a product of local derivatives. And products are fragile things. If any factor is small, the whole product shrinks. If many factors are small, the product vanishes.

Think back to the assembly line. Each station has an amplification factor — its local derivative. If Station A's factor is 0.25, and so is Station B's, and Station C's, and Station D's, the error signal that reaches the first station is 0.25⁴ = 0.004. That's 99.6% of the signal gone after four layers. After ten layers? 0.25¹⁰ ≈ 0.000001. The first layers of the network get essentially no gradient. They can't learn. This is the vanishing gradient problem.

Here's why it haunted early deep learning. The sigmoid activation function has a maximum derivative of 0.25 — it occurs when the input is exactly zero, and the derivative gets smaller in both directions from there. If every layer uses sigmoid, the gradient signal decays by at least a factor of 4 at every layer. Hochreiter documented this in 1991, and it's the core reason people couldn't train networks deeper than a few layers for decades.

The mirror problem is exploding gradients. If the local derivatives are consistently greater than 1, the product grows exponentially. After 50 layers with amplification factors of 2, the gradient is 2⁵⁰ ≈ 10¹⁵. Weight updates become enormous, the loss oscillates wildly, and training diverges. This is particularly common in recurrent neural networks where backpropagation through time (BPTT) unrolls the same recurrence for hundreds of time steps — the "depth" is the sequence length, and the same weight matrix gets multiplied into the gradient product at every step.

The solutions that unlocked deep learning all attack this product-of-many-numbers problem:

ReLU (2011, popularized by Glorot et al.) has a derivative of exactly 1 for positive inputs. Unlike sigmoid's maximum of 0.25, ReLU doesn't systematically shrink the gradient. The chain rule product stays closer to 1 through many layers. This single change — swapping sigmoid for ReLU — was a turning point.

Careful weight initialization keeps the variance of activations and gradients stable across layers. Xavier initialization (Glorot, 2010) was designed for sigmoid/tanh. He initialization (2015) was designed for ReLU. Both set the initial weight scale based on the fan-in and fan-out of each layer, so that the expected magnitude of signals neither grows nor shrinks as they pass through.

Residual connections (He et al., 2015) add the input of a block directly to its output: output = block(x) + x. During backprop, the gradient flows through the addition unchanged — the derivative of x + something with respect to x is always 1, regardless of what the block does. This creates a "gradient highway" that bypasses the block entirely. Even if the block's gradients vanish, the residual path keeps the signal alive. This is how ResNets train hundreds of layers where plain networks collapse.

Batch normalization (Ioffe and Szegedy, 2015) normalizes each layer's inputs to have zero mean and unit variance. This prevents activations from drifting into the saturated regions of sigmoid/tanh where derivatives are tiny.

Gradient clipping caps the gradient norm at a threshold during training. If the gradient vector's norm exceeds the threshold, the entire vector is scaled down. This doesn't prevent vanishing gradients but it tames exploding ones, and it's standard practice for training RNNs and Transformers.

PyTorch Autograd in Practice

With all that theory behind us, let's look at what you actually write. The computational graph, VJPs, and reverse-mode traversal are all invisible. What you see in PyTorch is this:

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(2, 2),
    nn.ReLU(),
    nn.Linear(2, 1),
    nn.Sigmoid()
)
loss_fn = nn.BCELoss()

x = torch.tensor([[1.0, 0.5]])
y = torch.tensor([[1.0]])

y_hat = model(x)          # forward pass — builds the computational graph
loss = loss_fn(y_hat, y)   # compute loss — extends the graph

loss.backward()            # backward pass — computes ALL gradients

for name, param in model.named_parameters():
    print(f"{name:15s}  grad_norm = {param.grad.norm().item():.4f}")

model(x) runs the forward pass and records the tape. loss.backward() walks the tape backward and populates .grad on every parameter with requires_grad=True (which is the default for anything inside nn.Module). After that, the optimizer reads those gradients and updates the weights.

A complete training loop has three lines at its core:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for epoch in range(1000):
    y_hat = model(x)
    loss = loss_fn(y_hat, y)

    optimizer.zero_grad()    # clear old gradients
    loss.backward()          # compute new gradients
    optimizer.step()         # update weights using those gradients

zero_grad(), backward(), step(). Every training loop you'll ever write has this skeleton. Why zero_grad is necessary — we'll get to that in a moment.

During inference, you don't need gradients. Turning off graph construction saves memory and speeds things up:

with torch.no_grad():
    predictions = model(test_inputs)

# Or, for an entire function:
@torch.inference_mode()
def predict(model, inputs):
    return model(inputs)

The Details That Bite You: Accumulation, Detaching, Checkpointing

Gradient Accumulation — The Most Common Backprop Bug

PyTorch accumulates gradients by default. Each call to .backward() adds to existing .grad tensors instead of replacing them. Forget zero_grad(), and gradients grow with every batch. Weight updates become enormous. Loss explodes. I still occasionally get tripped up by this when prototyping quickly.

# BUG: gradients accumulate and grow every iteration
for batch in dataloader:
    loss = compute_loss(model, batch)
    loss.backward()
    optimizer.step()

# FIX: reset gradients before each backward pass
for batch in dataloader:
    optimizer.zero_grad()
    loss = compute_loss(model, batch)
    loss.backward()
    optimizer.step()

But accumulation isn't a design flaw — it's a deliberate feature. It lets you simulate larger batch sizes when GPU memory is limited. Compute gradients on four small batches, accumulate them, then take one optimizer step. The effective batch size is 4× the actual batch size:

accumulation_steps = 4
optimizer.zero_grad()

for i, batch in enumerate(dataloader):
    loss = compute_loss(model, batch) / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Detaching Tensors from the Graph

.detach() creates a tensor that shares the same data but is disconnected from the computational graph. Gradients will not flow through it. This is essential when you need to use a computed value without it influencing gradient computation — target networks in reinforcement learning, teacher models in distillation, stop-gradient tricks in self-supervised learning:

# RL: target Q-values must NOT influence gradient computation
with torch.no_grad():
    target_q = target_network(next_state).max(dim=1).values
loss = F.mse_loss(current_q, target_q.detach())

# Self-supervised learning (SimSiam): stop gradient on one branch
z1 = projector(encoder(x1))
z2 = projector(encoder(x2)).detach()
loss = -F.cosine_similarity(z1, z2).mean()

The difference between .detach() and torch.no_grad(): detach() disconnects a specific tensor from the graph while the rest of the graph stays intact. torch.no_grad() prevents graph construction entirely for all operations inside the context manager. Use no_grad() for inference. Use detach() when part of the computation needs gradients and part doesn't.

Gradient Checkpointing — Trading Compute for Memory

During the forward pass, PyTorch stores every intermediate activation because the backward pass needs them to compute gradients. For a network with 100 layers, that's 100 layers' worth of tensors sitting in memory. For large models with long sequences, this memory cost dwarfs the model weights themselves.

Gradient checkpointing attacks this by storing only a subset of activations — the "checkpoints." During the backward pass, when a non-stored activation is needed, PyTorch re-runs the forward pass from the nearest checkpoint to recompute it. Memory drops from O(n) to O(√n) at the cost of roughly one additional forward pass. In practice, this translates to about 30% more compute for 60% less activation memory.

from torch.utils.checkpoint import checkpoint

class DeepModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.block1 = HeavyBlock()
        self.block2 = HeavyBlock()
        self.block3 = HeavyBlock()

    def forward(self, x):
        x = checkpoint(self.block1, x, use_reentrant=False)
        x = checkpoint(self.block2, x, use_reentrant=False)
        x = self.block3(x)
        return x

This is standard practice for LLM fine-tuning and training deep models on limited hardware. The tradeoff is almost always worth it.

What Backpropagation Does NOT Do

Backprop is powerful, but its scope is narrower than people think. Three things frequently get misattributed to it.

Backprop does not update weights. It computes gradients, nothing more. The actual update — subtracting lr × gradient, maintaining momentum, adapting per-parameter learning rates — is the optimizer's job. SGD, Adam, AdaGrad, LAMB — these are entirely separate algorithms that consume backprop's output. Backprop doesn't know or care which one you use.

Backprop does not find the global minimum. The loss landscape of a neural network is non-convex, riddled with local minima, saddle points, and flat plateaus. Backprop gives you the gradient at your current location — the direction of steepest local descent. Whether following that direction leads to a good solution depends on initialization, learning rate schedule, optimizer choice, and the geometry of the loss surface. My favorite thing about neural network loss landscapes is that, aside from high-level explanations about wide minima generalizing better, no one is completely certain why SGD converges to solutions that work as well as they do.

Backprop does not choose your architecture. It works on any differentiable computational graph — feed-forward, convolutional, recurrent, attention-based, graph networks. It computes gradients for whatever structure you give it. Whether that structure is well-suited to your problem is an entirely separate question that backprop has no opinion on.

The Three-Step Dance

Forward pass: input → prediction → loss. What the network currently does.
Backward pass (backprop): loss → gradients for every parameter. How each weight contributed to the error.
Optimizer step: update weights using their gradients. How to fix the error.

Backprop is step 2. That's its entire job.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a problem — millions of weights, one loss number, and the question of which knobs to turn. We traced a forward pass through a tiny network by hand, computed every intermediate value, then walked backward through the chain rule, multiplying local derivatives to produce a gradient for all nine parameters in one pass. We saw the universal pattern — upstream signal times local input — and recognized it as the engine behind every architecture ever trained. We lifted the hood on automatic differentiation, understood why reverse mode wins by a factor of millions for neural networks, confronted the vanishing and exploding gradient problems that kept deep learning stuck for decades, and landed on the practical PyTorch code that makes all of this three lines in a training loop.

My hope is that the next time you write loss.backward(), instead of treating it as a magic incantation that makes the loss go down, you'll picture that backward walk through the computational graph — the error signal splitting at branches, routing through weights, passing through ReLU gates, accumulating blame at every parameter — and have a solid mental model of what's actually happening under the hood.

Resources

The 1986 Rumelhart, Hinton & Williams paper — the paper that launched a thousand networks. Shorter and more readable than you'd expect.

Andrej Karpathy's micrograd video — builds backprop from scratch in Python. The best "build it yourself to understand it" resource I've found.

Stanford CS231n lecture notes on backprop — the computational graph walkthrough is wildly clear. The gradient checking section is also insightful for building trust in your own implementations.

PyTorch Autograd Mechanics docs — the official deep dive into how the tape, grad_fn, and backward engine work. Denser than the tutorials, but comprehensive.

Bengio et al., "Learning long-term dependencies with gradient descent is difficult" (1994) — the definitive analysis of vanishing gradients. Still worth reading to understand why this problem was so fundamental.

Baydin et al., "Automatic Differentiation in Machine Learning: A Survey" (2018) — the most complete treatment of forward-mode, reverse-mode, and everything in between. Unforgettable if you want the full picture.

What You Should Now Be Able To Do

Explain backpropagation as "the chain rule applied in reverse through a computational graph" — not as a memorized phrase, but grounded in having traced the backward pass step by step and seen how multiplying local derivatives gives every weight its gradient.
Trace both the forward and backward pass through a small network by hand. Compute weighted sums, apply activations, compute the loss forward, then walk backward multiplying upstream signal by local derivative at each node.
State the universal pattern: gradient for a weight = (upstream error signal) × (input to that weight). Verify that gradient signs make intuitive sense — negative means "increase this to reduce loss."
Explain the three types of differentiation (numerical, symbolic, automatic) and why automatic differentiation dominates for deep learning — no expression swell, exact to machine precision, handles arbitrary program structure.
Articulate why reverse-mode AD wins over forward-mode for neural networks: one loss scalar, millions of parameters, so reverse mode's cost of one backward pass beats forward mode's cost of millions of forward passes.
Explain vanishing and exploding gradients as a consequence of the chain rule being a product — small factors compound to kill the signal, large factors compound to blow it up. Name the solutions: ReLU, proper initialization, residual connections, batch norm, gradient clipping.
Write the zero_grad() → backward() → step() training loop and explain every line, including why accumulation is both a common bug and a deliberate feature.
Use .detach() and torch.no_grad() correctly, and explain gradient checkpointing as trading ~30% more compute for ~60% less memory.
Clearly distinguish what backprop does (computes gradients) from what it doesn't (update weights, find global minima, choose architectures).

← Previous Loss Functions Next → Gradient Problems