Seq2Seq & the Birth of Attention

Chapter 10: Sequence Models & Attention The bottleneck problem · Bahdanau · Luong · Dynamic context

I avoided the attention mechanism for longer than I care to admit. I could use LSTMs. I could build encoder-decoder models that sort of worked on short sentences. Every time someone mentioned "Bahdanau attention" I'd nod slowly and change the subject. Finally the discomfort of not knowing what was actually happening inside these models grew too great for me. Here is that dive.

The sequence-to-sequence (seq2seq) architecture was introduced in 2014 by Sutskever, Vinyals, and Le, alongside parallel work by Cho et al. It's the workhorse behind machine translation, text summarization, and chatbot response generation — any task where both the input and output are sequences of variable length. The attention mechanism, introduced by Bahdanau, Cho, and Bengio in 2015, fixed the most crippling flaw of the original design, and in doing so, planted the seed that grew into the Transformer.

Before we start, a heads-up. We'll be working through matrix multiplications, probability distributions, and a bit of neural network machinery. You don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

A tiny translation problem

Two RNNs stitched together

The sticky-note bottleneck

Training the decoder: teacher forcing

Choosing words at inference: beam search and friends

Rest stop and an off-ramp

Letting the decoder peek: Bahdanau attention

A faster way to score: Luong attention

The three-step pattern you'll see everywhere

Reading the alignment map

From cross-attention to self-attention

Wrap-up

Resources

A Tiny Translation Problem

Imagine we're building a model that translates short Spanish cooking instructions into English. We'll start absurdly small — a vocabulary of seven Spanish words and seven English words — so we can trace every moving part by hand.

Our training data has three sentences:

"corta las cebollas"   →  "chop the onions"
"mezcla las especias"  →  "mix the spices"
"cocina las cebollas"  →  "cook the onions"

Both sides are sequences, but they don't have to be the same length. A Spanish sentence might be three words; the English translation might be four. Or vice versa. We can't do this with a single RNN that maps one input token to one output token at each time step. We need something with two stages: one that reads the Spanish, and one that writes the English.

That two-stage setup is the heart of everything in this section. Let's build it.

Two RNNs Stitched Together

The idea, published independently by Sutskever et al. and Cho et al. in 2014, is elegantly simple. Take two RNNs and connect them through a single vector.

The first RNN is the encoder. It reads the Spanish input one token at a time — "corta," then "las," then "cebollas" — and after each token, updates its hidden state. Think of the hidden state as the encoder's running summary of everything it's seen so far. After the last token, the final hidden state is supposed to capture the meaning of the entire input sentence. This final vector is called the context vector.

The second RNN is the decoder. It receives the context vector as its starting state and generates the English output one token at a time. It predicts "chop," feeds "chop" back as input, predicts "the," feeds "the" back, predicts "onions," and finally emits a special end-of-sequence token that says "I'm done."

That's it. Two RNNs, one vector passed between them. The architecture is called encoder-decoder, or seq2seq (sequence-to-sequence).

import torch
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.LSTM(embed_dim, hidden_dim, batch_first=True)

    def forward(self, src):
        embedded = self.embedding(src)
        # all_states: hidden state at every position (we'll need these later)
        # (hidden, cell): the final states — our "context vector"
        all_states, (hidden, cell) = self.rnn(embedded)
        return all_states, hidden, cell

class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc_out = nn.Linear(hidden_dim, vocab_size)

    def forward(self, token, hidden, cell):
        embedded = self.embedding(token)
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        prediction = self.fc_out(output.squeeze(1))
        return prediction, hidden, cell

The encoder reads "corta las cebollas" and compresses it into a vector. The decoder takes that vector and produces "chop the onions," one word at a time.

Let's trace through our tiny example. After the encoder processes "corta," its hidden state holds some representation of "corta." After "las," the state blends in "las." After "cebollas," the final hidden state — a vector of, say, 256 numbers — is supposed to represent "chop the onions." The decoder starts from that vector, and at each step it predicts the next English word.

Elegant. And deeply flawed.

The Sticky-Note Bottleneck

Here's the flaw. That context vector is a fixed-size vector. Maybe 256 dimensions. Maybe 1024. Doesn't matter. It's fixed.

For our three-word cooking instructions, this is fine. 256 dimensions is more than enough to represent "chop the onions." But imagine we're translating a 50-word recipe paragraph. All 50 words, their relationships, which ingredients go with which verbs, the order of the steps — crammed into the same 256-dimensional vector. Now imagine a 200-word recipe. Same vector size.

I'll be honest — when I first saw BLEU scores tanking on long sentences, I assumed it was a data problem. Not enough long examples in the training set, maybe. It took me an embarrassingly long time to realize the problem was structural: one vector is fundamentally not enough to represent a variable-length sequence.

Picture it this way. Someone reads you a full page of cooking instructions in Spanish. Then you have to recite the English version. But the only thing you're allowed to carry from the reading to the reciting is one sticky note. A sticky note of fixed size. Short instructions? You scribble the key words and you're fine. A full page? You're going to lose details. The quantities, the timing, the specific technique for julienning vs. dicing — gone.

What you want is to keep the original Spanish text on the table and glance back at the relevant line while you're translating each part. Not all of it at once. The line that matters for whatever word you're currently producing.

That's attention. But before we get there, we need to talk about how we train the decoder in the first place, because the training story has its own plot twist.

Training the Decoder: Teacher Forcing

During inference, the decoder feeds its own previous prediction as input to the next step. If it predicts "chop" at step 1, "chop" goes into step 2 to predict the next word. This sounds reasonable until you think about what happens during training.

Early in training, the model's predictions are garbage. Suppose it predicts "onions" when it should have said "chop." Now step 2 receives "onions" as context and has to somehow predict "the" based on the information that the sentence starts with "onions." It can't. The prediction at step 2 is garbage too. And step 3 receives that garbage, and it compounds. The errors cascade down the sequence like dominoes. Training becomes agonizingly slow.

The fix is called teacher forcing, and the name captures the idea well. During training, we ignore the model's own predictions. Instead, we feed the correct previous token — the ground truth — at every step. The model always sees "chop" before predicting "the," and always sees "the" before predicting "onions," regardless of what it would have predicted on its own. This is like a teacher who corrects each step before the student moves on. Convergence is dramatically faster.

But teacher forcing creates a subtle problem. During training, the decoder always sees perfect previous tokens. During inference, it sees its own imperfect predictions. The model has practiced translating in a world where it never makes mistakes. Then we drop it into a world where it makes mistakes constantly. This train/inference mismatch is called exposure bias, and it's especially damaging for long outputs where a single early error can derail the entire sequence.

I still occasionally mix up which direction the fix goes, so let me be explicit. Scheduled sampling (Bengio et al., 2015) starts training with 100% teacher forcing — all ground truth, all the time. Then, as training progresses, it gradually increases the probability of feeding the model's own predictions back in. By the end of training, the model has practiced recovering from its own errors. In code, it's a coin flip at each time step:

import random

def train_decoder_step(decoder, encoder_hidden, target_seq,
                       teacher_forcing_ratio=0.5):
    hidden, cell = encoder_hidden
    input_token = target_seq[:, 0:1]   # start with <SOS> token
    outputs = []

    for t in range(1, target_seq.size(1)):
        prediction, hidden, cell = decoder(input_token, hidden, cell)
        outputs.append(prediction)

        # coin flip: ground truth or own prediction?
        if random.random() < teacher_forcing_ratio:
            input_token = target_seq[:, t:t+1]
        else:
            input_token = prediction.argmax(dim=-1, keepdim=True)

    return torch.stack(outputs, dim=1)

The ratio typically follows a schedule — 1.0 at the start of training (pure teacher forcing), decaying toward 0.5 or lower as the model improves. The exact schedule is one of those things you tune by experiment.

Choosing Words at Inference: Beam Search and Friends

At inference time, the decoder produces a probability distribution over the entire vocabulary at each step. It needs a strategy for choosing which token to emit. This choice matters more than most people realize — the same trained model can produce wildly different outputs depending on how you decode.

The most obvious strategy is greedy decoding: at each step, pick the single highest-probability token and move on. It's fast, but it's short-sighted. A word that looks best right now might lead to a dead end two steps later. Imagine the model is 60% sure the next word is "cook" and 40% sure it's "prepare." Greedy picks "cook." But maybe "prepare the spiced mixture" is a much better translation, and the model never gets a chance to explore that path.

Beam search fixes this by keeping multiple candidates alive at once. The key parameter is the beam width, often called k. Let's trace through beam search with k=2 on our cooking example, starting from the <SOS> token.

Step 1: the model predicts probabilities for the first word. Suppose "chop" gets probability 0.6 and "cook" gets 0.3. We keep both — these are our two beams. Step 2: for each beam, we expand to all possible next words. "chop the" might score 0.6 × 0.8 = 0.48, "chop a" might score 0.6 × 0.1 = 0.06, "cook the" might score 0.3 × 0.7 = 0.21. We have four candidates (two per beam), but we only keep the top 2: "chop the" (0.48) and "cook the" (0.21). We continue expanding and pruning until both beams produce an end token. The beam with the highest cumulative score wins.

Beam search dominates when there's a single "right" answer, as in translation. A beam width of 4 to 10 is typical. Wider beams find better translations but cost proportionally more compute, and they tend to produce repetitive, "safe" text — the kind of bland translation that is technically correct but lacks any spark.

For tasks where you want diversity — story generation, dialogue, creative writing — sampling methods win. Top-k sampling picks randomly from the k most likely tokens. Nucleus sampling (also called top-p) picks from the smallest set of tokens whose cumulative probability exceeds a threshold p, typically 0.9 to 0.95. This adapts the candidate set size dynamically — confident predictions get a narrow set, uncertain ones get a wider pool. Temperature is a knob that sharpens or flattens the distribution before sampling: low temperature (0.2–0.5) makes the model more confident, high temperature (1.0+) makes it more adventurous.

All of these decoding strategies work with any seq2seq model — with or without attention. They're about what happens after the model produces probabilities. But the quality of those probabilities is only as good as the architecture that produces them, and our vanilla seq2seq architecture has that sticky-note bottleneck.

Rest Stop and an Off-Ramp

Congratulations on making it this far. If you want, you can stop here.

You now have a working mental model of the seq2seq architecture: an encoder that reads, a decoder that writes, a context vector that connects them. You understand why that context vector is a bottleneck for long sequences. You know how teacher forcing speeds up training and how scheduled sampling patches its train/inference gap. You know how beam search explores multiple candidate outputs.

That's a solid foundation. It doesn't tell the complete story — we haven't addressed the bottleneck yet — but if you're in a hurry, here's the one-sentence version of what's coming: attention lets the decoder look back at every encoder state instead of relying on one compressed vector, and the specific way it does the looking is a three-step pattern (score, normalize, aggregate) that becomes the foundation of the entire Transformer architecture.

There. You're 70% of the way there.

But if the discomfort of not knowing what's underneath is nagging at you, read on.

Letting the Decoder Peek: Bahdanau Attention

In 2015, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio published "Neural Machine Translation by Jointly Learning to Align and Translate." The title is modest. The impact was seismic.

The idea: instead of forcing the decoder to work from one sticky note, let it look at the full stack of encoder notes — one for every input position — and decide which notes matter right now.

Let's trace through it with our cooking example. The encoder reads "corta las cebollas" and produces a hidden state at each position: h₁ for "corta," h₂ for "las," h₃ for "cebollas." In the original paper the encoder is bidirectional — it reads the input left-to-right and right-to-left, then concatenates the two hidden states at each position — so each hᵢ captures context from both directions. We now have three vectors sitting there, waiting to be consulted.

The decoder is about to predict its first English word. Its current hidden state is s₀ (initialized from the encoder). The question is: which of the three encoder states should it pay attention to?

The first time I saw the scoring formula, it looked like three random operations stitched together. But there's a logic to it. We want a single number — a score — that says "how relevant is encoder state hᵢ to what the decoder is trying to do right now?" Bahdanau's solution is to run both vectors through a small neural network:

eₜᵢ = v^T tanh(W₁ sₜ₋₁ + W₂ hᵢ)

Let me unpack that. W₁ projects the decoder state into a shared space. W₂ projects the encoder state into the same space. We add them together — that's why this is called additive attention. The tanh squashes the sum into the range [-1, 1]. Then v (a learned vector) collapses the result into a single scalar score. Three operations, but each one has a clear job: project, combine, score.

We compute this score for every encoder position, giving us a list of scores: e₁, e₂, e₃ for our three-word input. Now we normalize them into a probability distribution using softmax:

αₜᵢ = exp(eₜᵢ) / (exp(eₜ₁) + exp(eₜ₂) + exp(eₜ₃))

These α values are the attention weights. They sum to 1 and tell us: "When generating the current output token, pay this fraction of your attention to each input position."

Suppose when generating "chop," the weights come out as α₁=0.85 (strong focus on "corta"), α₂=0.05 ("las"), α₃=0.10 ("cebollas"). The model is telling us it's mostly looking at "corta" — the verb — to produce the English verb "chop." That makes intuitive sense.

The final step: we compute a weighted sum of the encoder hidden states.

cₜ = α₁ h₁ + α₂ h₂ + α₃ h₃

This weighted sum cₜ is the context vector for this specific decoder step. And here's what makes it powerful: it's different at every step. When the decoder moves on to predict "the," it recomputes new scores, gets new weights (maybe more spread out, since "las" is an article that maps to "the"), and produces a new context vector. When it predicts "onions," it shifts attention to "cebollas." Each output token gets its own custom view of the input.

The sticky note is gone. The decoder has the full Spanish text on the table and glances at the relevant line for each word.

import torch
import torch.nn as nn
import torch.nn.functional as F

class BahdanauAttention(nn.Module):
    def __init__(self, encoder_dim, decoder_dim, attention_dim):
        super().__init__()
        self.W1 = nn.Linear(decoder_dim, attention_dim, bias=False)
        self.W2 = nn.Linear(encoder_dim, attention_dim, bias=False)
        self.v = nn.Linear(attention_dim, 1, bias=False)

    def forward(self, decoder_state, encoder_outputs):
        # decoder_state:   (batch, decoder_dim)
        # encoder_outputs: (batch, src_len, encoder_dim)

        query = self.W1(decoder_state).unsqueeze(1)   # (batch, 1, attn_dim)
        keys = self.W2(encoder_outputs)               # (batch, src_len, attn_dim)

        # additive scoring: project, combine, score
        scores = self.v(torch.tanh(query + keys))     # (batch, src_len, 1)
        scores = scores.squeeze(-1)                   # (batch, src_len)

        weights = F.softmax(scores, dim=-1)           # (batch, src_len)

        # weighted sum of encoder states
        context = torch.bmm(
            weights.unsqueeze(1), encoder_outputs
        ).squeeze(1)                                  # (batch, enc_dim)

        return context, weights

One detail that tripped me up for a while: Bahdanau computes attention before the decoder RNN step. The decoder state used for scoring is sₜ₋₁ — the state from the previous step. The resulting context vector gets concatenated with the current input embedding and fed into the decoder RNN to produce sₜ. This ordering matters because it means attention influences the decoder's hidden state, not the other way around.

The limitation? That small neural network inside the scoring function. It has learnable parameters (W₁, W₂, v), which makes it flexible but also somewhat slow. For every decoder step, we compute a forward pass through this mini-network for every encoder position. With a 100-token input and a 50-token output, that's 5,000 forward passes through the scoring network in a single sentence. People tolerated this cost for a year, but everyone suspected there had to be a leaner way.

A Faster Way to Score: Luong Attention

Later in 2015, Minh-Thang Luong, Hieu Pham, and Christopher Manning published "Effective Approaches to Attention-based Neural Machine Translation" with a streamlined alternative. The question they asked: do we really need a neural network to compute relevance scores, or can we get away with something cheaper?

Their most popular variant replaces Bahdanau's mini-network with a plain dot product:

eₜᵢ = sₜ^T hᵢ

That's it. Multiply the decoder state and the encoder state element-wise, sum the products. No learned parameters at all — the dot product is free. This is called multiplicative attention (because scores arise from multiplying vectors, as opposed to Bahdanau's additive combination).

Let's make this concrete with our cooking example. Suppose at decoder step 2, the decoder state sₜ is a vector [0.3, -0.5, 0.8, 0.1] and the encoder state for "cebollas" is h₃ = [0.2, -0.4, 0.9, 0.0]. The dot product is (0.3×0.2) + (-0.5×-0.4) + (0.8×0.9) + (0.1×0.0) = 0.06 + 0.20 + 0.72 + 0.0 = 0.98. High score — the decoder considers "cebollas" highly relevant at this step. Compare that with h₂ for "las," which might score 0.15. After softmax, "cebollas" gets most of the attention weight. Same three-step pattern: score, normalize, weighted sum.

Luong proposed three scoring variants in total:

The dot variant (sₜᵀ hᵢ) has zero extra parameters, which makes it fast and memory-efficient. The catch: it requires the encoder and decoder hidden dimensions to match — you can't take a dot product of vectors with different sizes.

The general variant (sₜᵀ W hᵢ) inserts a learnable weight matrix W between the two vectors. This allows different dimensions and gives the model one knob to turn, adding expressiveness without the full overhead of Bahdanau's neural network.

The concat variant (vᵀ tanh(W[sₜ; hᵢ])) concatenates the two states and runs them through the same architecture as Bahdanau. It's equivalent in expressiveness, included for completeness.

There's a second structural difference between Bahdanau and Luong that's easy to miss. Bahdanau computes attention before the decoder RNN step, using the previous state sₜ₋₁. Luong computes attention after the RNN step, using the current state sₜ. The Luong decoder first runs the RNN to get sₜ, then uses sₜ to attend over the encoder, then combines the context vector with sₜ to make the final prediction. In practice, both orderings work well. The Luong approach is slightly more common in modern implementations because the dot product scoring integrates cleanly with the RNN output.

class LuongDotAttention(nn.Module):
    """Simplest Luong variant: dot-product scoring."""
    def forward(self, decoder_output, encoder_outputs):
        # decoder_output:  (batch, hidden_dim)
        # encoder_outputs: (batch, src_len, hidden_dim)

        scores = torch.bmm(
            encoder_outputs,
            decoder_output.unsqueeze(2)
        ).squeeze(2)                        # (batch, src_len)

        weights = F.softmax(scores, dim=-1)
        context = torch.bmm(
            weights.unsqueeze(1),
            encoder_outputs
        ).squeeze(1)                        # (batch, hidden_dim)

        return context, weights

The dot product scoring is not only simpler — it's a direct ancestor of what happens inside the Transformer. When Vaswani et al. designed scaled dot-product attention in 2017, they took Luong's dot product and added one tweak: dividing by √d_k (the square root of the key dimension) to keep the scores from growing too large in high dimensions. The lineage is direct.

But I'm getting ahead of myself. Let's step back and name the pattern that both Bahdanau and Luong share.

The Three-Step Pattern You'll See Everywhere

Every attention mechanism — Bahdanau, Luong, self-attention, cross-attention, multi-head attention — does exactly three things:

First, it computes a score between a query and each key. The query is what you're trying to match. The keys are the candidates you're searching through. The score says: "how relevant is this key to this query?" Bahdanau uses a neural network to compute the score. Luong uses a dot product. The Transformer uses a scaled dot product. Different mechanisms, same purpose.

Second, it normalizes the scores into weights that sum to 1 using softmax. This turns raw scores into a probability distribution. "I'm 85% interested in position 1, 5% in position 2, 10% in position 3."

Third, it computes a weighted sum of the values using those weights. In Bahdanau and Luong attention, the values are the encoder hidden states — the same vectors used as keys. In the Transformer, values get their own separate learned projection, which gives the model more flexibility.

Score. Normalize. Aggregate. Query, key, value. That's the entire machinery. If an interviewer asks you "what is attention?" — that's the answer. Everything else is details about where the Q, K, and V come from and how you compute the score.

In the attention we've built so far, the query comes from the decoder (it's asking "what should I focus on?"), and the keys and values both come from the encoder (they're offering "here's what I have"). Because the query and key/value sources are different sequences, this is called cross-attention. We'll revisit that naming when we get to the Transformer, where queries, keys, and values can all come from the same sequence.

I'm still developing my intuition for why this particular formulation works so extraordinarily well. The mathematical structure is a soft version of a hash table lookup — the query finds the most relevant keys, and returns a blend of their associated values. My favorite thing about it is that, aside from high-level explanations like the one I gave, no one is completely certain why this pattern generalizes so powerfully across domains — language, vision, audio, protein structure. It works. Beautifully.

Reading the Alignment Map

One unexpectedly beautiful side effect of attention: the weights form a matrix you can look at with your eyes.

Every time the decoder generates an output token, it produces a set of attention weights — one weight per input position. Stack all those weight vectors into rows, and you get a 2D grid. Columns are input (Spanish) positions. Rows are output (English) positions. Each cell's intensity shows how much the decoder focused on that input word when generating that output word.

For our cooking example "corta las cebollas" → "chop the onions," the alignment matrix might look something like this:

import numpy as np

# Rows = English output tokens, Columns = Spanish input tokens
#                 corta   las    cebollas
alignment = np.array([
    [0.88,  0.04,  0.08],    # "chop"   — focuses on "corta" (the verb)
    [0.06,  0.82,  0.12],    # "the"    — focuses on "las" (the article)
    [0.03,  0.07,  0.90],    # "onions" — focuses on "cebollas" (the noun)
])

The pattern is nearly diagonal — "chop" aligns with "corta," "the" aligns with "las," "onions" aligns with "cebollas." Spanish and English share similar word order here, so the alignment is tidy. For languages with different word order — say, English to Japanese, where the verb moves to the end — you'd see the alignment pattern shift and cross. Adjective-noun order flips. Subject-object-verb rearrangements. All visible in the matrix.

These visualizations are not decorative. They're a genuine debugging tool. If your model is failing on long sentences, plot the attention. If you see the weights collapsing toward uniform (every input position gets roughly equal attention), the model has given up on aligning and is spreading its bets. If you see the decoder fixating on one input position for many output steps, it's probably stuck in a repetition loop — generating the same word over and over because it can't look away from one source token.

I haven't figured out a great way to explain the subtlety of what "good" attention looks like without showing many examples, but the quick heuristic is: sharp attention (most weight concentrated on one or two source positions per output step) usually indicates the model has learned meaningful alignment. Fuzzy, spread-out attention usually indicates it hasn't.

The Full Picture: Seq2Seq with Attention

Let's wire everything together. The complete architecture: an encoder that stores a hidden state at every input position, a Bahdanau attention module that computes a fresh context vector at every decoder step, and a decoder that uses both the context and its own state to predict each output token.

class Seq2SeqWithAttention(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, embed_dim,
                 hidden_dim, attn_dim):
        super().__init__()
        self.encoder = Encoder(src_vocab, embed_dim, hidden_dim)
        self.attention = BahdanauAttention(
            hidden_dim, hidden_dim, attn_dim
        )
        self.dec_embedding = nn.Embedding(tgt_vocab, embed_dim)
        self.dec_rnn = nn.LSTMCell(embed_dim + hidden_dim, hidden_dim)
        self.output_proj = nn.Linear(hidden_dim * 2, tgt_vocab)

    def forward(self, src, tgt, teacher_forcing_ratio=0.5):
        batch_size, tgt_len = tgt.size()
        enc_outputs, hidden, cell = self.encoder(src)
        h = hidden.squeeze(0)
        c = cell.squeeze(0)

        outputs = []
        input_token = tgt[:, 0]

        for t in range(1, tgt_len):
            embedded = self.dec_embedding(input_token)

            # attention: decoder asks "what should I look at?"
            context, _ = self.attention(h, enc_outputs)

            # feed both the embedded token and context into decoder
            rnn_input = torch.cat([embedded, context], dim=1)
            h, c = self.dec_rnn(rnn_input, (h, c))

            # predict from decoder state + context
            prediction = self.output_proj(
                torch.cat([h, context], dim=1)
            )
            outputs.append(prediction)

            # teacher forcing decision
            if random.random() < teacher_forcing_ratio:
                input_token = tgt[:, t]
            else:
                input_token = prediction.argmax(dim=-1)

        return torch.stack(outputs, dim=1)

The difference from vanilla seq2seq is concentrated in one line: the attention call inside the loop. Instead of starting from one compressed vector and hoping for the best, the decoder consults the full set of encoder states at every step. It's the difference between memorizing a recipe and cooking with the recipe open on the counter.

On the original WMT translation benchmarks, this architecture improved BLEU scores dramatically, especially for sentences longer than 20 tokens — exactly where vanilla seq2seq collapsed. The bottleneck was gone.

From Cross-Attention to Self-Attention

Look at what attention gave us. The decoder no longer relies on a compressed sticky note. It has direct access to the encoder's full memory through a dynamic, learned lookup. The cooking recipe stays open on the counter.

Here's the question that unlocks the next revolution: what if we applied this same lookup not between two separate sequences, but within a single sequence?

In everything we've built so far, the query comes from the decoder and the keys and values come from the encoder. Two different sequences. But nothing about the score-normalize-aggregate pattern requires that. We could make the query, keys, and values all come from the same sequence.

Consider the sentence "The chef seasoned the steak before it was served." What does "it" refer to? A human reader knows "it" means "the steak," not "the chef." If we let every word attend to every other word in the same sentence, the representation of "it" can incorporate information from "steak" — regardless of how far apart they are. No recurrence needed. No hidden state threading the information through step by step. Direct connection.

That's self-attention. Each position in the sequence attends to all positions, including itself. The query, key, and value all come from the same sequence. The scoring pattern — dot product, softmax, weighted sum — is exactly the same machinery you've been building intuition for this entire section.

In 2017, Vaswani et al. published "Attention Is All You Need" and built the Transformer architecture entirely on self-attention, discarding recurrence altogether. The timeline is direct: Sutskever (2014) gave us the encoder-decoder framework. Bahdanau (2015) eliminated the bottleneck with cross-attention. Luong (2015) simplified the scoring to a dot product. Vaswani (2017) asked "what if attention is all we need?" and proved it was. Each step is a consequence of the one before it.

We'll build the Transformer from scratch in the next section. But the hardest part — understanding what attention actually does and why it works — that's behind us.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with the simplest possible problem — translating three-word cooking instructions between languages — and used it to expose the fundamental flaw in the original seq2seq design: the sticky-note bottleneck. We built the encoder-decoder architecture, grappled with the training challenges of teacher forcing and exposure bias, explored how beam search navigates the space of possible outputs, and then watched attention dissolve the bottleneck by letting the decoder consult the full input at every step. We traced the three-step pattern — score, normalize, aggregate — through Bahdanau's neural network and Luong's dot product, and saw how this same pattern points directly toward self-attention and the Transformer.

My hope is that the next time you encounter the word "attention" in a paper or a codebase, instead of nodding vaguely and hoping context makes it clear, you'll picture the decoder scanning through encoder states, computing scores, softmaxing them into weights, and blending values — and you'll have a pretty darn good mental model of what's going on under the hood.

Resources

Sutskever, Vinyals, Le — "Sequence to Sequence Learning with Neural Networks" (2014). The O.G. seq2seq paper. Worth reading for the input-reversal trick alone — they found that reversing the source sentence improved LSTM-based translation by reducing the effective dependency distance.

Bahdanau, Cho, Bengio — "Neural Machine Translation by Jointly Learning to Align and Translate" (2015). The paper that birthed attention. The alignment visualizations in Figure 3 are unforgettable.

Luong, Pham, Manning — "Effective Approaches to Attention-based Neural Machine Translation" (2015). Streamlined attention scoring and the global vs. local attention distinction. Cleaner writing than most papers in this space.

Jay Alammar — "Visualizing A Neural Machine Translation Model." Wildly helpful animated walkthrough of seq2seq with attention. If you learn better from visuals, start here.

Lilian Weng — "Attention? Attention!" (2018). An insightful survey that covers every major attention variant in one place, with consistent notation. I keep this bookmarked.

Vaswani et al. — "Attention Is All You Need" (2017). Where all of this leads. The Transformer paper. Everything we built in this section is the backstory to that paper's opening line.

← Previous Recurrent Models: RNN, LSTM, GRU Next → Self-Attention & the Transformer