Self-Attention & the Transformer

Chapter 10: Sequence Models & Attention Q/K/V · Multi-head · Positional encoding · The architecture that changed ML

I avoided a proper deep dive into transformers for longer than I'd like to admit. Every time someone mentioned "queries, keys, and values," I'd nod along, maybe sketch a box-and-arrow diagram from memory, and change the subject. I could use transformers — fine-tune BERT, prompt GPT — but I couldn't build one from nothing, couldn't trace the numbers through the attention mechanism by hand. Finally the discomfort of not knowing what's happening under the hood grew too great for me. Here is that dive.

The Transformer was introduced in 2017 by Vaswani et al. in a paper titled "Attention Is All You Need." It was designed for sequence transduction — converting one sequence of symbols into another, like translating English to German. Within two years it had taken over machine translation, language modeling, speech recognition, and eventually computer vision. Every frontier model you've heard of — GPT, BERT, LLaMA, Claude — is a Transformer or a close descendant.

Before we start, a heads-up. We're going to be doing a lot of matrix multiplications, walking through dot products, and building up an architecture piece by piece. You don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

What We'll Build

Why a sequence needs to talk to itself
The filing cabinet: Query, Key, and Value from first principles
Scaled dot-product attention — with actual numbers
Why we divide by √d_k (and what goes wrong if we don't)
Multi-head attention: letting the model look in many directions at once
The order problem: positional encoding
    Sinusoidal encoding
    Learned embeddings
    RoPE — the modern default
Rest stop
The residual stream: how information flows through a Transformer
Layer normalization and the Pre-Norm revolution
The feed-forward network: where knowledge lives
Wiring it all up: the encoder block
The decoder block and causal masking
Cross-attention: the bridge between encoder and decoder
The full architecture
Putting it together in PyTorch
Wrap-up and resources

Why a Sequence Needs to Talk to Itself

Let's start with a concrete problem. Imagine we're building a small translation system. Our input is the English sentence "The cat sat on the mat because it was tired." We want to translate it to Spanish.

When the model reaches the word "it," something important needs to happen. The model needs to figure out what "it" refers to. Is it the cat? The mat? The act of sitting? In Spanish, the pronoun has to agree in gender with the referent — él for the cat (masculine), ella for the mat if treated as feminine. Getting this wrong produces a grammatically broken translation.

In an RNN, the information about "cat" has to survive a chain of hidden states — one hop per word — slowly decaying as it travels forward through "sat," "on," "the," "mat," "because." By the time the RNN reaches "it," the signal from "cat" six words back is faint. The signal from "mat," two words back, is stronger. The model might pick "mat" not because it's semantically correct, but because it's closer. That's a failure of architecture, not of learning.

Self-attention offers a different approach. Instead of passing messages one word at a time along a chain, what if every word could directly look at every other word in the sentence, all at once? When processing "it," the model computes a relevance score between "it" and every other word — "cat," "mat," "sat," "soft," all of them — in a single operation. If the learned weights are good, "cat" gets a high relevance score and "it" absorbs information from "cat" directly. One hop. No chain. No decay.

That's the core idea. Replace the sequential chain with a fully-connected graph where every word can attend to every other word in one step. The rest of this section is about how to make that idea precise and practical.

The Filing Cabinet: Query, Key, and Value from First Principles

To build up the attention mechanism, I want to start with a physical analogy that I'll keep coming back to throughout this section.

Imagine a room full of filing cabinets. Each cabinet drawer has a label on the front — something like "Animals," "Actions," "Locations." Inside the drawer are actual documents: detailed notes, data, context. You walk into the room holding a question written on a slip of paper: "What kind of creature is being discussed?"

You compare your question against every drawer label. "Animals" — strong match, you pull that drawer open. "Locations" — weak match, you peek inside but don't take much. "Weather" — no match, you skip it entirely. Then you take a weighted mix of the contents: mostly from the "Animals" drawer, a little from "Locations," nothing from "Weather."

In self-attention, every word in the sentence plays all three roles at the same time. Every word is a drawer with a label and contents. And every word is also a person walking through the room with a question.

The question is called a Query (Q). The drawer label is called a Key (K). The contents inside the drawer are called a Value (V).

Here's the part that confused me for a long time: all three — Q, K, and V — come from the same input. The same word embedding gets transformed three different ways through three different learned weight matrices. Each matrix creates a different "view" of the same word.

Let's make this concrete with our translation sentence. Suppose each word starts as a vector of numbers — an embedding. The word "cat" might be represented as some vector x_cat. We create three versions of it:

q_cat = x_cat × W_Q    (the question "cat" asks of other words)
k_cat = x_cat × W_K    (what "cat" advertises about itself)
v_cat = x_cat × W_V    (what "cat" actually provides when attended to)

W_Q, W_K, and W_V are weight matrices that the model learns during training. Through backpropagation, the model figures out what makes a good question (W_Q), what makes a useful label (W_K), and what information is worth passing along (W_V). The entire attention mechanism is learned linear projections plus a dot product. No hand-crafted features. No special-purpose rules.

I'll be honest — when I first encountered Q, K, V, I thought they must be three fundamentally different kinds of information. It took me a while to internalize that they're three learned perspectives on the same input. The model decides, through training, what to ask, what to advertise, and what to share. That's what makes it so flexible.

Scaled Dot-Product Attention — With Actual Numbers

The full attention equation fits on one line:

Attention(Q, K, V) = softmax(Q K⊤ / √d_k) V

Four operations, chained together. Rather than explain them abstractly, let's trace through them with actual numbers. We'll use three words from our translation sentence — "The," "cat," "sat" — each represented as a 4-dimensional embedding. Small enough to work through by hand.

Our input matrix X has three rows (one per word) and four columns (one per dimension):

X = [[1.0, 0.0, 1.0, 0.0],    ← "The"
     [0.0, 2.0, 0.0, 2.0],    ← "cat"
     [1.0, 1.0, 1.0, 1.0]]    ← "sat"

We need three weight matrices, W_Q, W_K, and W_V, each 4×4. In a real model these are learned. Here, we'll pick specific values so the math stays clean:

import torch, math

X = torch.tensor([[1.,0.,1.,0.], [0.,2.,0.,2.], [1.,1.,1.,1.]])

W_Q = torch.tensor([[1.,0.,1.,0.],[0.,1.,0.,1.],[1.,0.,0.,1.],[0.,1.,1.,0.]])
W_K = torch.tensor([[0.,1.,1.,0.],[1.,0.,0.,1.],[0.,1.,0.,1.],[1.,0.,1.,0.]])
W_V = torch.tensor([[1.,0.,0.,1.],[0.,1.,1.,0.],[1.,1.,0.,0.],[0.,0.,1.,1.]])

Q = X @ W_Q   # each word's "question"
K = X @ W_K   # each word's "label"
V = X @ W_V   # each word's "contents"

After the matrix multiplications, we get:

Q = [[2, 0, 1, 1],    ← "The" is asking this question
     [0, 4, 2, 2],    ← "cat" is asking this question
     [2, 2, 2, 2]]    ← "sat" is asking this question

K = [[0, 2, 1, 1],    ← "The" advertises this
     [4, 0, 2, 2],    ← "cat" advertises this
     [2, 2, 2, 2]]    ← "sat" advertises this

V = [[2, 1, 0, 1],    ← "The" will provide this
     [0, 2, 4, 2],    ← "cat" will provide this
     [2, 2, 2, 2]]    ← "sat" will provide this

Now, the first real operation: compute Q × K⊤. This gives us a 3×3 matrix where entry (i, j) is the dot product of word i's query with word j's key. A large value means "word i is looking for something that word j has."

scores = Q @ K.T

scores = [[ 2, 12,  8],    ← "The"'s question matched strongly with "cat" (12)
          [ 4,  8,  8],    ← "cat"'s question matched equally with "cat" and "sat"
          [ 8, 16, 16]]    ← "sat"'s question matched very strongly with "cat" and "sat"

Every word has been compared against every other word. That's a full pairwise comparison — three words means nine scores. For a 1,000-word document, that's one million scores. For a 100,000-token context window, it's ten billion. This is both the power and the cost of self-attention.

But we can't feed these raw scores into softmax yet. The numbers are too large. Here's why that matters.

Why We Divide by √d_k

I'll be honest — I glossed over this scaling factor for months. "Divide by square root of the dimension." Okay, sure, whatever. It seemed like a minor implementation detail. It's not. It's the difference between a model that learns and a model that's frozen.

Here's the problem. When we compute the dot product of two vectors, the result's magnitude depends on how many dimensions those vectors have. If Q and K each have elements drawn from a standard normal distribution (mean 0, variance 1), the dot product of two d_k-dimensional vectors has a variance of d_k. Not 1. Not something small. d_k.

Let me make this concrete. With d_k = 4 (our toy example), the standard deviation of the dot products is √4 = 2. Our scores range from 2 to 16 — spread out, but manageable. Now imagine d_k = 64, a typical value in real models. The standard deviation becomes √64 = 8. Scores might range from -20 to +25.

Feed those into softmax and watch what happens:

softmax([20, 1, -15]) ≈ [1.000, 0.000, 0.000]

One score dominates completely. The softmax has saturated — it's pushed all its probability mass onto a single position. Every other position gets essentially zero weight. The gradient of softmax at these extreme values is vanishingly small. The model can't learn to redistribute attention. It's stuck in whatever pattern it fell into early in training.

Dividing by √d_k rescales the variance back to approximately 1, regardless of the dimension:

Variance of raw dot product:    d_k
Variance after dividing by √d_k: d_k / d_k = 1

Now the softmax operates in a healthy region where multiple positions can share attention and gradients actually flow. Let's see this with our example:

scaled_scores = scores / √4 = scores / 2

scaled_scores = [[1.0, 6.0, 4.0],
                 [2.0, 4.0, 4.0],
                 [4.0, 8.0, 8.0]]

Apply softmax row by row. Each row becomes a probability distribution that sums to 1:

weights = softmax(scaled_scores, dim=-1)

weights ≈ [[0.006, 0.880, 0.114],    ← "The" puts 88% on "cat," 11% on "sat"
           [0.035, 0.483, 0.483],    ← "cat" splits evenly between itself and "sat"
           [0.002, 0.499, 0.499]]    ← "sat" splits evenly between "cat" and itself

Without scaling, "The"'s raw scores [2, 12, 8] would softmax to approximately [0.00005, 0.982, 0.018] — nearly 100% on "cat" with almost nothing left for "sat." With scaling, "sat" gets a meaningful 11% weight. That matters because the gradient now flows to that position too, and the model can learn finer-grained attention patterns.

In real models with d_k = 64 or d_k = 128, the effect without scaling is catastrophic. The Vaswani et al. paper puts it plainly: "We suspect that for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients."

The last operation: multiply the attention weights by V. Each output row is a weighted blend of all the value vectors:

output = weights @ V

"The" ≈ 0.006·[2,1,0,1] + 0.880·[0,2,4,2] + 0.114·[2,2,2,2]
       ≈ [0.24, 1.99, 3.75, 1.99]

"cat" ≈ 0.035·[2,1,0,1] + 0.483·[0,2,4,2] + 0.483·[2,2,2,2]
       ≈ [1.04, 1.97, 2.90, 1.97]

"sat" ≈ [1.00, 2.00, 3.00, 2.00]

Look at what happened to "The." It started as [1, 0, 1, 0] and ended up as approximately [0.24, 1.99, 3.75, 1.99]. It absorbed information heavily from "cat" — 88% of "cat"'s value vector is blended in. The output for "The" is no longer about "The" alone. It's a context-aware representation — a blend of the entire sentence, weighted by what the model judged to be relevant.

That's self-attention. The filing cabinet analogy holds: "The" walked through the room, compared its question against every drawer label, and pulled out a weighted mix of contents. Let's keep that analogy in mind — we'll come back to it.

Multi-Head Attention: Looking in Many Directions at Once

There's a limitation in what we built so far. A single attention computation produces one set of weights — one probability distribution per word. That means each word can express only one pattern of relevance at a time.

But consider our sentence again: "The cat sat on the mat because it was tired." When processing "it," the model simultaneously needs to figure out several things. It needs coreference — "it" refers to "cat." It needs syntactic structure — "it" is the subject of "was." It needs semantic role — "tired" is a predicate about "it." Asking one softmax distribution to capture all of these relationships at once is like asking a single person to look left, right, and behind them simultaneously.

Multi-head attention solves this by running multiple attention operations in parallel, each with its own set of W_Q, W_K, W_V weight matrices. Each parallel computation is called a head. Each head gets to develop its own notion of what makes a good question, a good label, and a good response.

The mechanics work like this. Instead of one attention operation with d_model = 512 dimensions, we split into h = 8 heads, each operating on d_k = 512 / 8 = 64 dimensions. Each head has its own projection matrices, runs its own attention computation, and produces its own output. Then all the head outputs get concatenated back into a 512-dimensional vector and multiplied by one final weight matrix W_O.

MultiHead(Q, K, V) = Concat(head₁, head₂, ..., headₕ) × W_O

where headᵢ = Attention(X × Wᵢ_Q, X × Wᵢ_K, X × Wᵢ_V)

Going back to our filing cabinet analogy: a single head is one person searching the room with one question. Multi-head attention sends eight people into the room, each with a different question. One asks about coreference. Another asks about syntactic role. A third asks about positional proximity. They each come back with their own findings, and the results are combined.

Research has confirmed that heads do specialize. Clark et al. (2019) visualized BERT's attention heads and found that specific heads consistently track subject-verb agreement, others track coreference chains, and others attend primarily to adjacent tokens for local syntax. I still can't always predict which patterns different heads will learn for a given task — the model figures that out on its own. But the specialization is real and measurable.

An important detail that surprised me: the total computation cost of multi-head attention is essentially the same as single-head attention with the full d_model dimensions. You're not paying more — you're reorganizing the same computation to be more expressive. Each head works in a 64-dimensional subspace instead of the full 512, so the per-head cost is smaller. Eight heads at 64 dimensions equals one head at 512 dimensions, plus one extra matrix multiplication (W_O) at the end. It's a free expressivity upgrade.

In practice, you don't create eight separate weight matrices. You use one big projection of shape (512, 512), then reshape and transpose the tensor to split it into heads. Mathematically identical, but much faster on a GPU because it's one large matrix operation instead of eight small ones.

The Order Problem: Positional Encoding

We've built something powerful, but it has a fatal flaw. Look back at the attention formula: softmax(QK⊤/√d_k) V. Where in that formula does the model know that "cat" comes before "sat" which comes before "on"?

Nowhere.

Self-attention is permutation-equivariant. If you shuffle the input words, the output words get shuffled in the same way — the attention weights between any pair of words don't change. "Cat sat mat" and "mat cat sat" produce identical attention patterns. For language, where word order is everything — "dog bites man" versus "man bites dog" — this is a disaster.

The fix: inject positional information into the word embeddings before they enter the attention layers. The model's input becomes token_embedding + positional_encoding, and now every word carries a signal about where it sits in the sequence.

Three approaches have emerged, each an improvement on the last.

Sinusoidal Positional Encoding

The original Transformer paper encoded each position using a pattern of sine and cosine waves at different frequencies:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Each dimension oscillates at a different frequency. The first dimensions change rapidly — like the seconds hand on a clock. The last dimensions change slowly — like the hour hand. Together, they give every position a unique "fingerprint" of wave values.

Why sines and cosines? There's an elegant mathematical property: for any fixed offset k, PE(pos + k) can be expressed as a linear function of PE(pos). This means the model can learn to attend to "the word 3 positions to my left" by learning a fixed linear combination of the positional dimensions. The relative position between any two words is encoded as a rotation in the sine/cosine space — a kind of foreshadowing of the approach that would eventually replace this one.

The other advantage: since the formula works for any position, the model can handle sequences longer than anything it saw during training. No lookup table to run off the end of.

import torch, math

def sinusoidal_pe(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    pos = torch.arange(max_len).unsqueeze(1).float()
    div = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(pos * div)
    pe[:, 1::2] = torch.cos(pos * div)
    return pe

pe = sinusoidal_pe(100, 16)
# pe[0] and pe[1] have different patterns — every position is unique
# pe[50] - pe[49] ≈ pe[30] - pe[29] — the offset pattern is consistent

Learned Positional Embeddings

BERT and GPT-2 took a more direct approach: make the positions trainable parameters. Create an embedding table with one row per possible position, and let backpropagation figure out what each position vector should look like.

This works well — often slightly better than sinusoidal when training data is plentiful, because the model can learn whatever positional patterns are useful rather than being constrained to sine waves. The catch: learned embeddings can't extrapolate. If you trained with max_len = 512, position 513 doesn't exist. You'd need to retrain or fine-tune to handle longer sequences.

RoPE — Rotary Position Embeddings

Rotary Position Embeddings (RoPE), introduced by Su et al. in 2021, are what most modern large language models use — LLaMA, Mistral, Qwen, and many others.

The insight is subtle and beautiful. Instead of adding a position signal to the embedding, RoPE rotates the query and key vectors by an angle that depends on position. Pairs of dimensions are treated as 2D coordinates and rotated by pos × θᵢ, where θᵢ varies across dimension pairs.

In complex number notation, if we treat each pair of dimensions as a complex number z_k, then RoPE at position p computes: z_k' = z_k · e^(iθ_k·p). That's a rotation in the complex plane.

The beauty shows up in the dot product. When we compute the attention score between positions m and n, the rotation at position m and the conjugate rotation at position n partially cancel, leaving a factor of e^(iθ_k(m-n)). The result depends only on the relative distance (m - n), not on the absolute positions. RoPE naturally encodes relative position without anyone explicitly computing it.

I'm still building my intuition for why rotation specifically is such a natural fit here. The mathematical argument is clean — the dot product only sees the difference in angles — but there's something deeper about geometry and position that I suspect connects to much older ideas in signal processing. What I can say with confidence is that RoPE extrapolates to longer sequences far better than learned embeddings, and it's the default choice if you're building a Transformer today.

🛑 Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a working mental model of the core attention mechanism. You understand that self-attention lets every word look at every other word in one step. You understand the Q/K/V framework — three learned projections that create questions, labels, and content. You can trace through the math: dot products for similarity, scaling by √d_k to prevent softmax saturation, softmax for weights, and a weighted blend of values. You know that multi-head attention runs this in parallel to capture different relationship types. And you know that positional encoding solves the order-blindness problem.

That's enough to hold your own in most conversations about transformers. If someone asks "what is attention?" you have a complete, honest answer.

But there's a gap. We've described the attention mechanism — the engine. We haven't described the car. How do you actually wire attention into a network that you can stack, train, and scale to billions of parameters? That's the Transformer architecture, and it involves a few more ideas: residual connections, layer normalization, feed-forward networks, and the encoder-decoder split.

If the discomfort of not knowing what's underneath is nagging at you, read on.

The Residual Stream: How Information Flows Through a Transformer

Before we assemble the architecture, I want to introduce a mental model that changed how I think about Transformers.

Picture a river. The token embeddings — the initial numerical representations of our words — enter at the headwaters and flow downstream through every layer of the network. This river is the residual stream. Each attention layer and each feed-forward layer sits on the riverbank. They reach in, scoop out a copy of the water, process it, and pour their result back in. The river keeps flowing with its original water plus whatever each layer added.

Mechanically, this is the residual connection: the output of each sub-layer is x + SubLayer(x), not SubLayer(x) alone. The sub-layer doesn't replace the representation — it adds a correction, a delta, on top of it. The original embedding travels through the entire network essentially untouched, with each layer contributing its refinement.

Why does this matter so much? Two reasons.

First, during backpropagation, the gradient flows through two paths: through the sub-layer (which may shrink the gradient) and directly through the skip connection (which preserves it perfectly). Our river analogy holds — even if one tributary dries up, the main channel keeps flowing. This is why you can stack 96 Transformer layers (like GPT-3) and still train successfully. Without residual connections, gradients vanish and deep networks collapse.

Second, it gives us a way to think about what each layer does. The attention layer reads from the stream, identifies relevant information from other positions, and writes an update back. The feed-forward layer reads from the stream, processes each position independently, and writes its update. Each layer is a small contribution. The final output is the sum of all contributions plus the original input. This "residual stream" interpretation, popularized by the mechanistic interpretability community, turns out to be deeply useful for understanding how information moves through Transformers.

Our river will come back. For now, let's look at what sits on the banks.

Layer Normalization and the Pre-Norm Revolution

If you stack many layers deep and each one adds its update to the residual stream, the magnitudes of the activations can drift — growing larger and larger, or oscillating wildly. Something needs to keep them in check. That something is layer normalization.

Layer normalization works on each token independently. It takes the d_model-dimensional vector for a single token, computes its mean and variance across those dimensions, and normalizes to zero mean and unit variance. Then it applies learned scale (γ) and shift (β) parameters to let the model restore any useful scaling.

Unlike batch normalization (which normalizes across the batch), layer norm doesn't depend on what other sentences happen to be in the same training batch. This makes it stable for variable-length sequences and small batch sizes — the regime Transformers operate in.

Now, here's a detail that seems minor but changed the course of Transformer development.

The original 2017 paper places the normalization after the residual addition:

output = LayerNorm(x + SubLayer(x))          ← Post-Norm

Most modern models flip it — normalization happens before the sub-layer:

output = x + SubLayer(LayerNorm(x))          ← Pre-Norm

In Pre-Norm, the residual connection is a clean identity shortcut — the gradient flows through it without passing through any normalization. The practical effect is dramatic. Pre-Norm models train reliably without learning rate warmup tricks, even at 100+ layers depth. Post-Norm models are notoriously finicky about warmup and initialization — skip either one and the first few training steps produce NaN losses.

GPT-2, GPT-3, LLaMA — they all use Pre-Norm. The original Post-Norm Transformer is like a classic car: historically important, but you wouldn't choose it for a road trip today unless you enjoy tinkering with the engine at every rest stop.

import torch.nn as nn

class PreNormBlock(nn.Module):
    def __init__(self, d_model, sublayer):
        super().__init__()
        self.norm = nn.LayerNorm(d_model)
        self.sublayer = sublayer

    def forward(self, x):
        return x + self.sublayer(self.norm(x))    # clean skip connection

In that code, notice how x passes straight through the addition untransformed. The normalization only affects what goes into the sub-layer. The skip connection is an unimpeded highway for both information and gradients.

The Feed-Forward Network: Where Knowledge Lives

After attention has gathered information from across the sequence, each token's representation needs to be processed — transformed, combined, refined. That's the job of the position-wise feed-forward network (FFN).

"Position-wise" means the same two-layer network is applied independently to each token. There's no interaction between positions in the FFN — that already happened in the attention layer. The FFN is a per-token operation.

FFN(x) = W₂ · activation(W₁ · x + b₁) + b₂

W₁ expands the dimension — typically by a factor of 4. In the original paper, d_model = 512 expands to d_ff = 2048. W₂ projects it back down to d_model. The expansion creates a high-dimensional space where the model can represent complex feature combinations; the contraction squeezes it back down.

Going back to our filing cabinet analogy: the attention layer is like walking through the room and collecting documents from various drawers. The FFN is like sitting down at a desk and actually reading them — synthesizing, cross-referencing, extracting the useful parts.

Here's something that genuinely fascinated me. Geva et al. (2021) showed that FFN layers in Transformers act as key-value memories. The first linear layer (W₁) functions as a set of "keys" — patterns the network has learned to recognize. The activation function selects which keys match the current input. The second linear layer (W₂) functions as "values" — the information associated with each matched key. Research from Meng et al. (2022) further demonstrated that factual knowledge — things like "The capital of France is Paris" — is primarily stored in FFN weights, not in attention patterns. You can surgically edit these weights to change specific facts without retraining the whole model.

My favorite thing about this finding is that, aside from the high-level explanation I described, no one is completely certain why factual knowledge concentrates in the FFN rather than in attention. The attention layer has access to all the same information. Yet empirically, the knowledge gravitates to the feed-forward weights. It's one of those results that we can measure and exploit, even as the deeper "why" remains open.

The original paper used ReLU as the activation function. Modern models have moved on. GELU (used in GPT-2, BERT) is smoother. SwiGLU (used in LLaMA, PaLM) goes further — it uses a gated mechanism where the network learns which dimensions to activate, with the gate controlled by a Swish function (x · sigmoid(x)). SwiGLU consistently outperforms ReLU and GELU in practice, at a modest additional computational cost. If you see a modern Transformer's FFN layer with three weight matrices instead of two, that's SwiGLU.

Wiring It All Up: The Encoder Block

We have all the pieces. Time to wire them together.

Each encoder block chains four operations. The input flows through multi-head self-attention, a residual connection adds it back, layer norm stabilizes it, then the FFN processes it, another residual connection adds that back, and another layer norm stabilizes again. Using Pre-Norm ordering:

x₁ = x + MultiHeadAttention(LayerNorm(x))
x₂ = x₁ + FFN(LayerNorm(x₁))

In our river analogy: the water (token representations) flows past the attention station, which reaches in, reads the water, and pours its processed result back. Then the water continues to the FFN station, which does the same. The main channel — the residual stream — keeps flowing through, accumulating the contributions of each station.

Stack N of these blocks. The original paper used N = 6. Each block refines the representations further — early layers tend to capture syntactic relationships, later layers capture more abstract semantic ones. The output of the final block is a set of contextualized representations: every token's vector now encodes not only that token but also its relationship to every other token in the sequence, refined through six rounds of attention and processing.

The Decoder Block and Causal Masking

The decoder generates output one token at a time. It needs to know two things: what it has already generated, and what the input says. This gives it three sub-layers instead of the encoder's two.

x₁ = x + MaskedSelfAttention(LayerNorm(x))          ← look at past output tokens
x₂ = x₁ + CrossAttention(LayerNorm(x₁), encoder_out) ← look at input
x₃ = x₂ + FFN(LayerNorm(x₂))                         ← process everything

The first sub-layer is self-attention with a twist: a causal mask that prevents each position from attending to future positions. When generating the 5th output word, the model can look at words 1 through 5, not word 6 onward — because word 6 hasn't been generated yet.

The mask is a lower-triangular matrix:

import torch

causal_mask = torch.tril(torch.ones(5, 5))
# [[1, 0, 0, 0, 0],   ← word 1 sees only itself
#  [1, 1, 0, 0, 0],   ← word 2 sees words 1-2
#  [1, 1, 1, 0, 0],   ← word 3 sees words 1-3
#  [1, 1, 1, 1, 0],   ← word 4 sees words 1-4
#  [1, 1, 1, 1, 1]]   ← word 5 sees everything so far

Positions with a zero get their attention scores set to -∞ before softmax, which drives those weights to exactly zero. No information leaks from the future.

During training, something clever happens. You feed the entire target sequence at once (a technique called teacher forcing) and let the mask enforce causality. All positions are computed in parallel, but each position can only see the past. This preserves the autoregressive property — "I can only condition on what came before me" — while still getting the full parallelism advantage of self-attention. It's the best of both worlds: the generation semantics of an RNN with the training speed of a Transformer.

The Most Common Transformer Bug

Forgetting the causal mask or applying it incorrectly. Your model "cheats" during training by looking at future tokens, achieves suspiciously low training loss, and then produces garbage at inference time (where future tokens don't exist). Always verify your mask by checking that attention weights form a lower-triangular pattern.

Cross-Attention: The Bridge Between Encoder and Decoder

The second sub-layer of the decoder is cross-attention. The mechanism is identical to self-attention, with one difference: where do Q, K, and V come from?

In self-attention, all three come from the same sequence. In cross-attention, the queries come from the decoder — "what does the decoder need to know?" — but the keys and values come from the encoder's output — "what information is available from the input?"

Back to our translation example. The encoder has processed "The cat sat on the mat because it was tired" and produced a set of contextualized representations. The decoder is generating the Spanish translation and has produced "El gato se sentó en la alfombra porque" so far. When it reaches the next position, it uses cross-attention to ask "what should come next?" The query (from the decoder) routes to the encoder representation of "it" and "tired," retrieves the information it needs, and produces "estaba" (was) or "cansado" (tired).

The filing cabinet analogy applies here too, but now the person with the question (the decoder) is searching through a different room of cabinets (the encoder's outputs). The questions are about what to generate next. The drawers and their contents are about what the input said.

The Full Architecture

Stack N encoder blocks and N decoder blocks. The original paper used N = 6 for each. Here's the complete flow:

INPUT TOKENS
  → Token Embedding + Positional Encoding
  → Encoder Block 1 (self-attention → FFN)
  → Encoder Block 2
  → ...
  → Encoder Block N
  → [Encoder output: contextualized representations]
         ↓ (keys & values flow to every decoder cross-attention layer)
OUTPUT TOKENS (shifted right by one)
  → Token Embedding + Positional Encoding
  → Decoder Block 1 (masked self-attn → cross-attn → FFN)
  → Decoder Block 2
  → ...
  → Decoder Block N
  → Linear projection to vocabulary size
  → Softmax
  → Next-token probabilities

The "shifted right" detail matters. During training, the decoder's input is the target sequence shifted by one position — at each step, it sees the correct previous tokens and learns to predict the next one. This is teacher forcing in action.

One more thing that tripped me up for a while: the same encoder output gets fed as keys and values to every decoder block's cross-attention layer. The encoder runs once. The decoder reads from it repeatedly, asking different questions at each layer.

The Three Variants

The original Transformer has both encoder and decoder. But most modern models use only one half. BERT is encoder-only — no decoder, no cross-attention, no causal mask. Great for classification, named entity recognition, understanding. GPT is decoder-only — no encoder, no cross-attention, only causal self-attention. Great for generation. T5 and BART use the full encoder-decoder. Decoder-only models have dominated since GPT-3 because they unify understanding and generation into a single paradigm. When someone says "transformer" in 2024, they almost always mean decoder-only with causal masking.

Putting It Together in PyTorch

With all the pieces understood, let's wire up a complete Transformer encoder in code. Multi-head attention, FFN, Pre-Norm residuals, positional embeddings — the whole thing. No hidden abstractions.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math


class MultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model, n_heads, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        # One big projection for all heads, then reshape
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        B, N, _ = x.shape

        # Project, then split into heads: (B, N, d_model) → (B, n_heads, N, d_k)
        Q = self.W_q(x).view(B, N, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(B, N, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(B, N, self.n_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention
        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = self.dropout(F.softmax(scores, dim=-1))

        # Weighted sum, merge heads, final projection
        out = (attn @ V).transpose(1, 2).contiguous().view(B, N, -1)
        return self.W_o(out)


class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=2048, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)


class EncoderBlock(nn.Module):
    """Pre-Norm Transformer encoder block."""
    def __init__(self, d_model=512, n_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.attn = MultiHeadSelfAttention(d_model, n_heads, dropout)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = FeedForward(d_model, d_ff, dropout)

    def forward(self, x, mask=None):
        x = x + self.attn(self.norm1(x), mask)   # attention + residual
        x = x + self.ffn(self.norm2(x))           # FFN + residual
        return x


class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model=512, n_heads=8, n_layers=6,
                 d_ff=2048, max_len=512, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        self.layers = nn.ModuleList([
            EncoderBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        self.final_norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, token_ids, mask=None):
        B, N = token_ids.shape
        positions = torch.arange(N, device=token_ids.device).unsqueeze(0)

        x = self.tok_emb(token_ids) * math.sqrt(self.d_model)
        x = x + self.pos_emb(positions)
        x = self.dropout(x)

        for layer in self.layers:
            x = layer(x, mask)

        return self.final_norm(x)

Every line maps to a concept we've discussed. The view and transpose calls in the attention module are splitting the d_model dimensions into separate heads. The * math.sqrt(self.d_model) in the encoder scales up the embeddings so they aren't dwarfed by the positional encodings (a detail from the original paper). The final LayerNorm after the last block is the Pre-Norm convention's finishing touch.

encoder = TransformerEncoder(vocab_size=30000)
dummy = torch.randint(0, 30000, (2, 128))
output = encoder(dummy)
print(output.shape)   # torch.Size([2, 128, 512])

total_params = sum(p.numel() for p in encoder.parameters())
print(f"Parameters: {total_params:,}")   # ~39M for this config

39 million parameters for a 6-layer, 512-dimension encoder. GPT-3 has 175 billion. The architecture is the same — the difference is scale. More layers, wider dimensions, more heads, more data. That scaling is what made the Transformer the foundation of the current era.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a specific frustration: RNNs force information to travel one hop per word, and signals decay. We built self-attention from first principles — three learned projections (Q, K, V) and a scaled dot product — to give every word a direct line to every other word. We traced the math by hand with three tiny vectors and watched "The" absorb 88% of "cat"'s information in one step. We added multiple heads so the model could look in many directions at once. We confronted the order-blindness problem and saw three generations of solutions — sinusoidal, learned, and rotary encodings. Then we assembled the full architecture: residual connections as a river that carries information forward, layer normalization to keep things stable, feed-forward networks where factual knowledge lives, and the encoder-decoder split that wires it all together.

My hope is that the next time you see "Attention(Q, K, V) = softmax(QK⊤/√d_k)V" written on a whiteboard, instead of nodding along and changing the subject, you'll be able to trace every operation, explain why the √d_k is there, and describe how this single equation — repeated across heads, layers, and billions of parameters — became the engine behind nearly every frontier AI system in the world.

Resources

"Attention Is All You Need" — Vaswani et al. (2017). The O.G. paper. Surprisingly readable. The architecture diagrams alone are worth the visit.

"Transformers from Scratch" — Brandon Rohrer's blog post. Builds from one-hot encoding to the full Transformer with a voice-controlled computer example. Wildly helpful for building intuition from absolute zero.

"The Illustrated Transformer" — Jay Alammar. Gorgeous visual walkthrough. If you learn better from diagrams than equations, start here.

"RoFormer: Enhanced Transformer with Rotary Position Embedding" — Su et al. (2021). The RoPE paper. Concise and elegant.

"Transformer Feed-Forward Layers Are Key-Value Memories" — Geva et al. (2021). Changes how you think about what FFN layers do. Insightful and well-written.

"What Does BERT Look At?" — Clark et al. (2019). Visualization of what different attention heads learn. Fascinating empirical work showing heads specializing in syntax, coreference, and positional patterns.

← Previous Seq2Seq & the Birth of Attention Next → BERT & GPT: The Two Paradigms