Recurrent Models: RNN, LSTM, GRU

Chapter 10: Sequence Models & Attention Memory through time · Gates · The gradient problem

I avoided recurrent neural networks for longer than I'd like to admit. Every time I saw those loop-arrow diagrams — the ones with the hidden state feeding back into itself — I'd nod politely, mutter something about LSTMs having gates, and move on to whatever shiny transformer paper had landed that week. Recurrent models felt like the flip phones of deep learning: important historically, safe to ignore now. But I kept running into them. Time-series problems at work. Legacy codebases. Interview questions that asked why attention was needed, which you can't answer without understanding what came before. Finally the discomfort of not knowing what's really happening inside that loop grew too great for me. Here is that dive.

Recurrent Neural Networks were the dominant paradigm for sequence modeling from the early 1990s through 2017. The vanilla RNN was formalized in the late 1980s. The Long Short-Term Memory (LSTM) arrived in 1997, courtesy of Hochreiter and Schmidhuber. The Gated Recurrent Unit (GRU) followed in 2014 from Cho et al. Together, these architectures powered machine translation, speech recognition, and language modeling for over two decades before transformers swept the field.

Before we start, a heads-up. We're going to be working through hidden state updates, matrix multiplications, and gradient flow — but you don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation. If you're comfortable with how a basic neural network does a forward pass, you have everything you need.

This isn't a short journey, but I hope you'll be glad you came.

What we'll cover

Why order matters
The sticky note — hidden state as memory
Walking through a vanilla RNN by hand
Unrolling through time
Backpropagation through time (BPTT)
The vanishing gradient catastrophe
Rest stop
LSTM — the conveyor belt
The forget gate
The input gate
The output gate
The key line — additive cell state
Peephole connections and practical tips
GRU — same idea, fewer moving parts
Bidirectional RNNs
The sequential bottleneck
Wrap-up and resources

Why Order Matters

Imagine you're building a tiny weather predictor. You have three days of observations: hot, hot, cold. Your job is to predict tomorrow. If someone shuffled those observations into cold, hot, hot, the prediction should change — the trend reversed. The order carries meaning.

Language works the same way. "Dog bites man" and "Man bites dog" use identical words. Swap the order and the meaning flips entirely. So does audio — the same frequencies in a different sequence produce a different word. Stock prices, DNA, sensor readings — all sequential. The identity of each element depends on where it sits relative to its neighbors.

A standard feedforward network takes in one fixed-size vector and produces one output. It has no notion of "before" or "after." You could flatten a sequence into a single long vector — concatenate all three days into one input — but that locks you into a fixed length. What happens when the sequence is four days? Or a hundred? And it throws away the temporal structure that makes the data meaningful in the first place.

We need a network that can read inputs one at a time, in order, and remember what it's seen. That's the idea behind recurrence.

The Sticky Note — Hidden State as Memory

Picture yourself watching those weather observations arrive one day at a time. You have a sticky note. Each morning, you read the new temperature, glance at what's written on your sticky note from yesterday, and write a new summary that combines both. Then you hand the updated sticky note to your tomorrow-self. That sticky note is the hidden state.

More precisely, the hidden state is a vector — a list of numbers — that encodes a compressed summary of everything the network has seen so far. It gets updated at every time step. It's the network's sole mechanism for carrying information from the past into the future.

The entire vanilla RNN is captured by a single equation:

hₜ = tanh(W_x · xₜ + W_h · hₜ₋₁ + b)

Here, xₜ is the input at time step t — today's weather observation. hₜ₋₁ is the hidden state from the previous step — yesterday's sticky note. W_x is a weight matrix that transforms the input, W_h is a weight matrix that transforms the previous hidden state, and b is a bias. The tanh activation squashes the result into the range [-1, 1].

The critical detail: W_x and W_h are the same matrices at every time step. The network reuses the same parameters whether it's processing day 1 or day 100. This weight sharing is what allows RNNs to handle sequences of any length with a fixed number of parameters.

Walking Through a Vanilla RNN by Hand

Let's make this concrete with the smallest possible example. Our weather station encodes temperatures as single numbers: hot = 1.0, mild = 0.5, cold = 0.0. Our hidden state is also a single number (one-dimensional, to keep things we can trace by hand). We start with h₀ = 0 — a blank sticky note.

Suppose after training, the network has learned these weights: W_x = 0.5, W_h = 0.8, b = 0. Now let's feed in the sequence [hot, hot, cold] — that's [1.0, 1.0, 0.0].

Day 1 (hot, x₁ = 1.0): We combine the new input with the blank sticky note.

h₁ = tanh(0.5 × 1.0 + 0.8 × 0.0) = tanh(0.5) ≈ 0.46

The sticky note now reads 0.46 — the network has registered one hot day.

Day 2 (hot, x₂ = 1.0): Another hot day. The previous state carries forward.

h₂ = tanh(0.5 × 1.0 + 0.8 × 0.46) = tanh(0.5 + 0.37) = tanh(0.87) ≈ 0.70

The sticky note is now 0.70 — higher than before. Two consecutive hot days have pushed the hidden state up. The network is encoding a "hot streak."

Day 3 (cold, x₃ = 0.0): Temperature drops to zero.

h₃ = tanh(0.5 × 0.0 + 0.8 × 0.70) = tanh(0.56) ≈ 0.51

The hidden state dropped, but it didn't reset to zero. The memory of those hot days is still there, fading but present — the 0.8 multiplier on the previous state kept some of the old information alive. That final hidden state, 0.51, is what we'd feed into an output layer to make our prediction for tomorrow.

That's the entire forward pass of a vanilla RNN. One equation, applied repeatedly.

Unrolling Through Time

When we draw an RNN as a box with an arrow looping back into itself, it looks compact but hides the actual computation. Unrolling means we copy that box once for each time step and lay the copies out left to right, making the information flow explicit.

For our three-day weather sequence, the unrolled view looks like this:

  x₁(hot)      x₂(hot)      x₃(cold)
    ↓             ↓             ↓
 [RNN Cell] → [RNN Cell] → [RNN Cell] → h₃ → prediction
  h₀=0  →  h₁=0.46  →  h₂=0.70  →  h₃=0.51

Every box is the same cell with the same weights. The arrows between them are the hidden state being passed forward — our sticky note traveling through time. This unrolled picture is what we actually compute during a forward pass, and it's the graph we backpropagate through during training.

In code, unrolling is literally a for loop:

import torch, torch.nn as nn

class VanillaRNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.W_x = nn.Linear(input_dim, hidden_dim, bias=False)
        self.W_h = nn.Linear(hidden_dim, hidden_dim, bias=False)
        self.b   = nn.Parameter(torch.zeros(hidden_dim))
        self.fc  = nn.Linear(hidden_dim, output_dim)
        self.hidden_dim = hidden_dim

    def forward(self, x):                          # x: (batch, seq_len, input_dim)
        batch, seq_len, _ = x.size()
        h = torch.zeros(batch, self.hidden_dim, device=x.device)
        for t in range(seq_len):                   # the unrolling
            h = torch.tanh(self.W_x(x[:, t]) + self.W_h(h) + self.b)
        return self.fc(h)                          # predict from final hidden state

That for loop is the unrolling. Each iteration applies the same weights to a new input and the carried-forward hidden state. Clean, elegant — and, as we're about to see, deeply flawed.

Backpropagation Through Time

Training an RNN means computing gradients with respect to those shared weights so we can update them. Because the computation graph is the unrolled chain of cells, backpropagation flows backward through each time step. This is called Backpropagation Through Time, or BPTT.

The key quantity is: how does a change in the hidden state at an early time step affect the loss at the end? To answer that, we need the chain rule. The gradient of the loss with respect to the hidden state at step k passes through every intermediate step between k and the final step T:

∂L/∂hₖ = ∂L/∂hₜ · ∂hₜ/∂hₜ₋₁ · ∂hₜ₋₁/∂hₜ₋₂ · ... · ∂hₖ₊₁/∂hₖ

Each of those intermediate derivatives, ∂hⱼ/∂hⱼ₋₁, involves multiplying by the weight matrix W_h and by the derivative of tanh. So the full gradient is a product of many such terms — one for each time step between k and T.

This is where the trouble starts.

The Vanishing Gradient Catastrophe

I'll be honest — I didn't truly understand the vanishing gradient problem until I stopped reading about eigenvalues and did some arithmetic on paper. Let me show you what made it click for me.

The derivative of tanh has a maximum value of 1.0, and in practice it's usually much smaller — around 0.1 to 0.5 for typical activations. Let's say the effective multiplier at each time step is 0.9. That sounds close to 1. But here's what happens when you multiply 0.9 by itself repeatedly:

After  5 steps: 0.9⁵  = 0.59
After 10 steps: 0.9¹⁰ = 0.35
After 25 steps: 0.9²⁵ = 0.072
After 50 steps: 0.9⁵⁰ = 0.0052

After 50 steps, the gradient has shrunk to half a percent of its original size. It's gone. The network cannot learn that something it saw 50 steps ago matters for the output right now. It's not that the information is lost from the hidden state — it might still be there, faintly. The problem is that the gradient signal needed to learn that relationship has evaporated.

This is the vanishing gradient problem. The gradient of the loss with respect to early hidden states decays exponentially as you backpropagate through time.

The flip side exists too. If that per-step multiplier is greater than 1 — say 1.1 — then 1.1⁵⁰ ≈ 117. The gradient explodes to infinity, your loss becomes NaN, and training collapses. This is the exploding gradient problem, and it's fixable with a technique called gradient clipping: if the gradient norm exceeds a threshold, you scale it down. Clipping handles explosions, but it does nothing for vanishing. You can't clip a gradient back into existence.

Think of it like a game of telephone. One person whispers a message to the next, who whispers to the next. With each relay, the message degrades. After 50 people, the original message is unrecognizable. The vanilla RNN is playing telephone with its gradient signal across time. And the longer the sequence, the worse the distortion.

In practice, vanilla RNNs can learn dependencies spanning roughly 10 to 20 time steps. Beyond that, they have effective amnesia. This isn't a problem with the optimizer or the learning rate. It's architectural. The multiplicative nature of the hidden state update guarantees that gradients either vanish or explode — there's no stable middle ground.

The field lived with this limitation for years. Everyone knew it was a problem. The question was how to fix it.

Rest stop — you can stop here if you want

Congratulations on making it this far. You now have a solid mental model: an RNN carries a sticky note (hidden state) forward through a sequence, updating it at each step by combining the new input with the old note using shared weights. Training happens by unrolling this loop and backpropagating through it (BPTT). The fundamental flaw is that gradients decay exponentially, limiting the network to short-range memory.

That's a genuinely useful understanding. If someone asks you in an interview what the vanishing gradient problem is and why it matters, you can now explain it mechanically — not as a vague handwave about "gradients getting small," but as the consequence of multiplying a number less than 1 by itself dozens of times.

What comes next is the solution: gates and additive cell states. The LSTM introduces a second channel — a conveyor belt for long-term memory — that bypasses the multiplicative bottleneck. The GRU achieves something similar with less machinery. If the discomfort of not knowing how they actually work is nagging at you, read on.

LSTM — The Conveyor Belt

I'll confess: the LSTM equations intimidated me for years. Four lines of sigmas and tanh and element-wise products, with variables named things like C̃ₜ. I'd stare at diagrams with boxes and circles and colored arrows and come away more confused than before. What finally made it click was a physical metaphor.

Picture a conveyor belt in a factory. Items ride along the belt from one end to the other. At each station along the belt, workers can do three things: remove items from the belt (forget), place new items onto the belt (input), and inspect what's on the belt to make a report (output). The belt itself moves forward steadily. Items stay on the belt unless a worker actively removes them.

That conveyor belt is the LSTM's cell state. It's a separate vector that runs alongside the hidden state through time. The critical insight — the reason the LSTM was such a breakthrough — is that the cell state is updated using addition, not multiplication. Items are placed on or removed from the belt, but the belt itself doesn't get squeezed through a bottleneck at each station. This is what lets information (and gradients) travel long distances without decay.

The Long Short-Term Memory network (Hochreiter & Schmidhuber, 1997) controls this conveyor belt with three gates. Let's build them up one at a time, using our weather station.

The Forget Gate

Back at our weather station, suppose the cell state has been tracking "we're in a warm spell." Then a cold front arrives. The network needs a way to say: "that warm spell information? Let it go. It's no longer relevant." That's the forget gate.

fₜ = σ(W_f · [hₜ₋₁, xₜ] + b_f)

The forget gate looks at the previous hidden state (hₜ₋₁) and the current input (xₜ), concatenates them, multiplies by a learned weight matrix W_f, adds a bias, and passes the result through a sigmoid function (written σ). Sigmoid outputs a value between 0 and 1 for each element. A value of 1 means "keep this entirely." A value of 0 means "erase this completely." Values in between mean partial forgetting.

This gate produces a vector the same size as the cell state. It will be multiplied element-wise with the previous cell state — like the conveyor belt workers deciding, for each item on the belt, whether to let it continue or toss it.

A practical detail that tripped me up when I first implemented an LSTM: the forget gate bias is typically initialized to 1.0, not 0.0. Jozefowicz et al. (2015) showed that this matters enormously. When the bias starts at zero, σ(0) = 0.5, meaning the network forgets half of its memory from the very first step of training. Initializing to 1 gives σ(1) ≈ 0.73, so the network defaults to remembering. It can learn to forget later. This one initialization trick is the difference between an LSTM that works and one that struggles to learn long-range patterns.

The Input Gate

Now the network needs to decide what new information to write onto the conveyor belt. This happens in two parts.

First, the network generates a candidate memory — a proposal for new information to add:

C̃ₜ = tanh(W_c · [hₜ₋₁, xₜ] + b_c)

This uses tanh, which produces values between -1 and 1. Think of it as the network composing a new sticky note: "today was cold, after two hot days — that's a temperature drop."

Second, the input gate decides how much of that candidate to actually write:

iₜ = σ(W_i · [hₜ₋₁, xₜ] + b_i)

Again sigmoid, again values between 0 and 1. The input gate is the conveyor belt worker deciding: "should I put this new item on the belt, and how much of it?" If the input is routine and unremarkable, the gate stays low. If the input is novel and important — like a sudden temperature drop — the gate opens wide.

The Key Line — Additive Cell State

Now comes the single most important equation in this entire chapter. The cell state update:

Cₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜ

The symbol ⊙ means element-wise multiplication. Read this equation left to right, and you can hear the conveyor belt in action. First, take the old cell state (Cₜ₋₁) and multiply it element-wise by the forget gate (fₜ) — this erases the stuff we decided to forget. Then add the new candidate memory (C̃ₜ) scaled by the input gate (iₜ) — this places new items on the belt.

That plus sign is everything. Compare this to the vanilla RNN, where the hidden state update was tanh(W_h · h + W_x · x) — a completely multiplicative transformation. Here, the cell state update is additive. When the forget gate is close to 1 and the input gate is close to 0, the equation becomes approximately Cₜ ≈ Cₜ₋₁. The cell state passes through unchanged. And because it passes through unchanged, gradients during backpropagation also pass through unchanged — no exponential decay, no vanishing.

I think of it this way: the vanilla RNN forces every piece of information through a narrow doorway (matrix multiplication + tanh) at every step. The LSTM provides an express lane — the cell state — that bypasses that doorway. Information on the express lane travels as far as it needs to without being squeezed. The gates are the on-ramps and off-ramps that control what enters and exits the express lane.

The Output Gate

The cell state carries the long-term memory, but we don't always want to expose all of it. The output gate controls what portion of the cell state becomes visible as the hidden state — which is what gets passed to the next layer or used for prediction.

oₜ = σ(W_o · [hₜ₋₁, xₜ] + b_o)
hₜ = oₜ ⊙ tanh(Cₜ)

The cell state is passed through tanh (to normalize it back to [-1, 1]) and then filtered by the output gate. Back at the weather station: the cell state might remember both the recent temperature trend and the season. But for today's prediction, maybe the trend matters more than the season. The output gate lets the network selectively reveal only the relevant parts.

To put all four equations together in one place:

fₜ = σ(W_f · [hₜ₋₁, xₜ] + b_f)        # forget gate: what to erase
iₜ = σ(W_i · [hₜ₋₁, xₜ] + b_i)        # input gate: how much new info to write
C̃ₜ = tanh(W_c · [hₜ₋₁, xₜ] + b_c)     # candidate: the new info itself
Cₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜ              # cell state update (THE KEY LINE)
oₜ = σ(W_o · [hₜ₋₁, xₜ] + b_o)        # output gate: what to reveal
hₜ = oₜ ⊙ tanh(Cₜ)                     # hidden state: the visible output

Notice the pattern. Every gate has exactly the same structure: take the concatenation of the previous hidden state and current input, multiply by a learned weight matrix, add a bias, and pass through sigmoid. They differ only in their learned weights — each gate learns independently what information to care about. The candidate memory is the same structure but with tanh instead of sigmoid, because it's producing content rather than a filter.

Peephole Connections and Practical Tips

In the standard LSTM, the gates base their decisions on the previous hidden state and the current input. But Gers, Schmidhuber, and Cummins (2000) pointed out something that, in hindsight, seems obvious: shouldn't the gates also be able to peek at the cell state itself? After all, the cell state is where the long-term memory lives. The hidden state is a filtered view of it — the gates are making decisions based on a secondhand summary.

Peephole connections add the cell state as an additional input to each gate. I haven't figured out a great way to build intuition for exactly when peepholes help, but empirically they improve tasks that require precise timing or counting — like learning to produce a signal after exactly N time steps.

In practice, standard LSTMs without peepholes are the default. Peephole variants are a niche optimization. If someone asks you about them in an interview, the answer is: "They let the gates look at the cell state directly, which helps for timing-sensitive tasks, but they're rarely used because the improvement is marginal for most applications."

A few more practical notes from hard-won experience:

Gradient clipping is non-negotiable for recurrent networks. Even with LSTMs, gradients can occasionally spike. Setting the maximum gradient norm to 1.0 or 5.0 is standard practice.

Orthogonal initialization for the hidden-to-hidden weight matrix (W_h) helps maintain gradient norms during backpropagation. An orthogonal matrix has eigenvalues with magnitude 1, which means it neither amplifies nor shrinks — exactly what we want for long-range gradient flow.

Truncated BPTT is a compromise for very long sequences. Instead of backpropagating through the entire sequence, you chop it into windows (say 100 steps) and backpropagate within each window. You lose the ability to learn dependencies longer than the window, but you save memory and computation. Most real-world LSTM training uses some form of truncation.

GRU — Same Idea, Fewer Moving Parts

The LSTM works. It solved the vanishing gradient problem convincingly and dominated the field for nearly two decades. But it has a lot of machinery — three gates, a separate cell state, four weight matrices to learn. In 2014, Cho et al. asked a natural question: can we get the same benefit with less complexity?

The Gated Recurrent Unit merges the cell state and hidden state into a single state vector, and reduces three gates to two. Let's build it up.

The update gate decides how much of the old state to keep versus how much new information to write. It plays the combined role of the LSTM's forget and input gates:

zₜ = σ(W_z · [hₜ₋₁, xₜ] + b_z)

The reset gate controls how much of the previous hidden state to use when computing the candidate for the new state. When the reset gate is close to 0, the candidate is computed almost entirely from the current input — as if the network is starting fresh:

rₜ = σ(W_r · [hₜ₋₁, xₜ] + b_r)

The candidate state uses the reset gate to optionally suppress the previous hidden state:

h̃ₜ = tanh(W · [rₜ ⊙ hₜ₋₁, xₜ] + b)

And the final state is an interpolation between old and new:

hₜ = (1 - zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ

Look at that last line. When zₜ is close to 0, we keep the old hidden state almost entirely — equivalent to the LSTM's forget gate being 1 and input gate being 0. When zₜ is close to 1, we replace the old state with the candidate — equivalent to forgetting everything and writing something new. The (1 - z) and z terms are coupled — they're constrained to sum to 1 — which means the GRU can't independently control forgetting and updating. The LSTM can: its forget and input gates are independent, so it can simultaneously erase old memory and write new memory, or keep old memory and also write new memory. The GRU forces a trade-off.

Is this coupling a problem? I'm still developing my intuition for this. Empirically, on most benchmarks, GRUs match LSTMs — and on small datasets, GRUs sometimes do better because they have about 25% fewer parameters and are less prone to overfitting. But on tasks with very long-range dependencies and plenty of training data, LSTMs occasionally edge ahead. In practice, the difference is rarely large enough to be decisive. Start with GRU if you want speed. Switch to LSTM if you see a gap.

Back at our conveyor belt: the GRU is like a factory that decided the belt and the inspection station can be the same thing. Instead of a separate belt for long-term memory and a separate output for short-term use, the GRU has one state that serves both purposes. It's a more compact design that handles most of the same workloads, at the cost of some flexibility.

Bidirectional RNNs

So far, our weather predictor reads observations left to right — from past to future. That makes sense for prediction: you can't use tomorrow's data to predict tomorrow. But for other tasks, the constraint is artificial.

Consider classifying the sentiment of a weather report: "The forecast was grim, but then the sun broke through." If you only read left to right, by the time you reach "grim" you don't know about the "but" that reverses everything. If you also read right to left, the word "grim" gets context from both directions — the gloom that precedes it and the reversal that follows it.

A bidirectional RNN runs two separate RNNs over the same sequence: one forward (left to right) and one backward (right to left). At each time step, the two hidden states are concatenated to form a richer representation that captures context from both past and future.

Forward:    h₁→  h₂→  h₃→  h₄→
Backward:   ←h₁  ←h₂  ←h₃  ←h₄
Combined:  [h₁→;←h₁]  [h₂→;←h₂]  [h₃→;←h₃]  [h₄→;←h₄]

The combined hidden state at each position has twice the dimension. If each directional RNN has a hidden size of 256, the concatenated output is 512. The two RNNs have separate weights — they learn different things. The forward RNN learns patterns based on what came before. The backward RNN learns patterns based on what comes after. Together they give each position a full-context representation.

Bidirectional RNNs are workhorses for tasks where the entire sequence is available upfront: sentiment analysis, named entity recognition, part-of-speech tagging, and the encoder half of sequence-to-sequence models. They cannot be used for real-time generation — you can't peek at future tokens when you're generating them one at a time.

encoder = nn.LSTM(
    input_size=128,
    hidden_size=256,
    num_layers=2,           # stacked: 2 layers deep
    bidirectional=True,     # forward + backward
    batch_first=True,
    dropout=0.3,            # dropout between layers
)
# Output hidden dim per step: 256 × 2 = 512

Stacking layers — putting one RNN on top of another — is the other common extension. The hidden state of layer l at time t becomes the input to layer l+1 at the same time step. Lower layers tend to capture local patterns (today's temperature), while higher layers capture more abstract patterns (seasonal trends). Google's Neural Machine Translation system (2016) was an 8-layer stacked LSTM. Adding residual connections when going beyond 2–3 layers helps gradient flow — the same principle as in deep feedforward networks.

The Sequential Bottleneck

The LSTM and GRU solved the vanishing gradient problem. That was an enormous achievement. But they left another problem untouched: the sequential bottleneck.

Look back at the unrolled RNN. Computing h₂ requires h₁. Computing h₃ requires h₂. Every step depends on the one before it. You cannot compute them in parallel. For a sequence of length 500, you must perform 500 sequential operations. On a GPU that can do thousands of operations simultaneously, this is heartbreaking — you're using a fraction of the available power.

It gets worse. In a sequence-to-sequence model for translation, the encoder compresses the entire input sentence into a single fixed-length vector — the final hidden state. Everything the decoder needs to know about a 50-word sentence must fit into one vector. That vector becomes a bottleneck, and information gets lost. We'll see in the next section how the attention mechanism alleviates this by letting the decoder look back at every encoder hidden state, not the final one. And in the section after that, we'll see how the transformer eliminates the sequential dependency entirely.

Recurrent models dominated sequence processing from the late 1990s through 2017. Then Vaswani et al. published "Attention Is All You Need," and the era of recurrence began to close. The sequential bottleneck — the inability to parallelize — was the main reason. Attention connects every position to every other position directly, with no sequential chain, and training becomes massively parallel.

But knowing why recurrence was abandoned makes the motivation for attention click instantly. That's why this chapter exists.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a question — how do you give a neural network memory? — and built the answer from scratch. First, the sticky note: a hidden state that gets updated at each time step and carried forward, giving the network a compressed summary of everything it's seen. We walked through a vanilla RNN by hand, three days of weather data, and saw how the hidden state accumulates information. Then we unrolled the loop, saw how backpropagation travels through it, and watched the gradients vanish — 0.9 multiplied by itself 50 times shrinks to nothing. That failure motivated the conveyor belt: the LSTM's additive cell state, controlled by three gates that learn when to forget, what to write, and what to reveal. The GRU showed that two gates and a single state can achieve most of the same benefit with less machinery. Bidirectional variants gave us context from both directions. And finally, the sequential bottleneck — the inability to parallelize recurrent computation — set the stage for what comes next.

My hope is that the next time you encounter an LSTM in a legacy codebase, or get asked in an interview why transformers replaced recurrent models, instead of reaching for vague hand-waves about "vanishing gradients" and "attention is better," you'll be able to trace the full story — from the sticky note to the conveyor belt to the bottleneck — having a pretty solid mental model of what's going on under the hood.

Resources and Credits

Colah's "Understanding LSTM Networks" (2015) — Still the single best visual explanation of LSTM gates. If you read one thing after this section, make it this. The diagrams are unforgettable.

Hochreiter & Schmidhuber, "Long Short-Term Memory" (1997) — The O.G. paper. Dense and formal, but worth reading the introduction to feel the frustration that motivated the invention.

Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder" (2014) — The GRU paper. Surprisingly short and readable for something that spawned an entire variant family.

Jozefowicz et al., "An Empirical Exploration of Recurrent Network Architectures" (2015) — Wildly helpful for practical insights. This is where the forget gate bias initialization trick came from.

Karpathy, "The Unreasonable Effectiveness of Recurrent Neural Networks" (2015) — A brilliant blog post that showcases what character-level RNNs can learn. Insightful and fun — the Shakespeare-generating RNN is a classic.

Pascanu et al., "On the difficulty of training Recurrent Neural Networks" (2013) — The rigorous treatment of vanishing and exploding gradients. Essential if you want the full mathematical story.

Next → Seq2Seq & the Birth of Attention