LLM Families & Architecture

Chapter 12: Large Language Models Section 2

I avoided the LLM model zoo for an embarrassingly long time. Every week some lab would release a new model — LLaMA-something, Mistral-whatever, yet another GPT variant — and I'd nod along in meetings pretending I understood how they differed. "Oh yeah, that one uses GQA," I'd say, having no earthly idea what that meant at the hardware level. I'd scroll past architecture diagrams on Twitter, register a vague sense of guilt, and move on. Finally the discomfort of not knowing what's actually under the hood of these models — the ones I use every day — grew too great. Here is that dive.

Large Language Models come in families, much the way car manufacturers release model lineups. GPT is a lineage. LLaMA is a lineage. Mistral, Claude, Gemini — each is a family with its own design philosophy, its own set of architectural bets, and its own evolutionary arc from first release to current flagship. What's remarkable is that despite coming from different labs with different budgets and different goals, they've all converged on a surprisingly similar set of internal components. Understanding those shared components — and where each family diverges — is what this section is about.

Before we start, a heads-up. We're going to be talking about attention variants, position encodings, activation functions, and some matrix-level mechanics. You don't need to be an expert in any of it beforehand. We'll build each concept from a concrete example before giving it its proper name. If you've read the transformer and attention sections earlier in this book, you'll have a head start, but we'll re-derive what we need as we go.

This isn't a short journey, but I hope you'll be glad you came.

What we'll cover:

The decoder-only bet

The GPT lineage — scaling as a strategy

The LLaMA revolution — open weights change everything

Inside the modern LLM block — RMSNorm, SwiGLU, and why they won

How RoPE encodes position through rotation

Rest stop

The KV cache problem and Grouped-Query Attention

Mistral's tricks — sliding windows and Mixture-of-Experts

The broader landscape — Claude, Gemini, DeepSeek, Phi

Open vs. closed — the great divide

Model cards and responsible release

Wrap-up

The Decoder-Only Bet

Let me set up a scenario we'll return to throughout. Imagine you're an ML engineer at a small startup called TinyChat. Your job: build a chatbot that answers customer questions for a restaurant chain. You have a modest GPU budget and a launch deadline. You open Hugging Face, type "language model," and get hit with thousands of results. Where do you even start?

The first decision that narrows the field is architecture type. The original 2017 transformer had two halves — an encoder that reads input and a decoder that produces output. After that paper, the field split into three camps, each keeping different halves of the blueprint.

Encoder-only models (BERT and its descendants) kept the encoder half. They read the entire input at once — every token can see every other token, in both directions. This bidirectional view makes them exceptional at understanding text: classifying sentiment, tagging named entities, computing sentence embeddings. But they have no built-in mechanism for generating new tokens one after another.

Encoder-decoder models (T5, BART) kept both halves. The encoder reads the input, the decoder generates the output. This is the most faithful descendant of the original transformer. For a while, T5's elegant "text-to-text" framing — where every task is reformulated as "translate this input into that output" — seemed like it might win.

Decoder-only models (GPT and its descendants) kept the decoder half. They process tokens left-to-right, each token attending only to what came before it. This causal attention pattern — "I can see the past but not the future" — is what makes autoregressive generation possible. You predict one token, append it, predict the next, and so on.

For your TinyChat project, you need a model that generates answers. That rules out encoder-only. So the real question is: encoder-decoder or decoder-only?

Decoder-only won, and the reasons are worth understanding because they shaped the entire LLM landscape. First, one stack is easier to scale than two. When you're pushing model size from 1 billion to 100 billion parameters, every architectural simplification matters for distributed training. Second, the autoregressive training objective — predict the next token — turns out to be a universal task interface. Want translation? Format the prompt as "Translate to French: [text]". Want classification? "Is this review positive or negative? [text]". You don't need an encoder-decoder boundary when prompting handles the routing. Third, decoder-only models get a KV cache — during generation, you compute each token's key-value pair once and reuse it for all subsequent tokens. This makes inference efficient. Encoder-decoder models need to re-process the full input through the encoder, which is wasteful for long conversations.

I'll be honest — I resisted this for a while. Encoder-decoder felt more "principled" to me, more like the clean machine translation architecture from the original paper. But engineering won over elegance. One stack, scaled far enough, with a flexible enough training objective, ate the world.

That said, encoder-only models aren't dead — they're specialized. If TinyChat needed to classify customer complaints into categories (billing, food quality, delivery), a 110M-parameter DeBERTa would outperform a 7B decoder model at a fraction of the cost. Right tool, right problem.

The GPT Lineage — Scaling as a Strategy

With decoder-only established as the winning architecture type, the question became: what do you do with it? OpenAI's answer was startling in its simplicity. Make it bigger. Train it longer. That's it.

The GPT story is a story about removing constraints, one generation at a time. Let's trace it through TinyChat's eyes — imagine you're building your chatbot in each era and see what each GPT generation would have given you.

GPT-1 (2018, 117 million parameters) proved a single idea: if you pre-train a decoder-only transformer on a large book corpus using next-token prediction, and then fine-tune it on a specific task, it outperforms models that were trained from scratch on that task alone. The architecture was modest — 12 layers, 12 attention heads, 768-dimensional hidden states. If you were building TinyChat in 2018, you'd download GPT-1, fine-tune it on your restaurant Q&A data, and get reasonable results. But you'd need that fine-tuning step. The base model alone couldn't answer questions — it would ramble incoherently about whatever came next in its training distribution.

GPT-2 (2019, 1.5 billion parameters) changed that calculus. The paper's title said it all: "Language Models are Unsupervised Multitask Learners." Scale the model 13× and train it on a larger, more diverse web corpus (WebText, ~40GB), and something unexpected happens — the model starts performing tasks it was never explicitly trained for. Summarization, translation, question answering, all at respectable quality, without a single gradient update on task-specific data. This was zero-shot capability emerging from scale. For TinyChat in 2019, you could prompt GPT-2 with "Q: What are your hours? A:" and get a coherent answer. Not great, but coherent. Without any restaurant-specific training.

OpenAI initially withheld GPT-2's full weights, worried about misuse. That decision — and the debate it sparked — was the first serious public conversation about responsible AI release. We'll come back to that.

GPT-3 (2020, 175 billion parameters) made the leap to few-shot in-context learning. The key discovery: if you give the model a few examples of the task you want, right there in the prompt, it figures out the pattern and continues accordingly. No fine-tuning. No gradient updates. The task specification lives entirely in the prompt.

# The GPT-3 breakthrough: few-shot in-context learning
# No training required — the examples ARE the task specification

prompt = """
Customer: What time do you close on Sundays?
Answer: We close at 9 PM on Sundays.

Customer: Do you have vegan options?
Answer: Yes! We have a full vegan menu including our popular jackfruit tacos.

Customer: Is there parking available?
Answer:"""

# GPT-3 completes: "Yes, we have a free parking lot behind the restaurant
# that can accommodate up to 30 vehicles."
#
# It inferred the format (Q&A), the domain (restaurant), the tone
# (helpful, specific), and generated a plausible answer — all from
# three examples in the prompt. That's in-context learning.

GPT-3's training cost was estimated at $4.6 million. It had 96 layers, 96 attention heads, and a hidden dimension of 12,288. It created the API-as-a-product business model — you couldn't download the weights, but you could pay per token to use it through an API. For TinyChat, this was transformative. You didn't need GPUs, training expertise, or ML infrastructure. You needed a credit card and a good prompt.

InstructGPT / GPT-3.5 (2022) added the human alignment layer. Raw GPT-3 was capable but erratic — it would sometimes be helpful, sometimes offensive, sometimes confidently wrong. Reinforcement Learning from Human Feedback (RLHF) fixed this by fine-tuning the model to follow instructions and be helpful, harmless, and honest. This was the model that powered early ChatGPT. The revelation: alignment didn't make the model dumber — it made it more useful. A model that follows your intent is worth more than a model that's smarter but unpredictable.

GPT-4 (2023) and GPT-4o (2024) extended the trajectory to multimodal inputs (images, audio, text in a single model) and dramatically improved reasoning. GPT-4 passes the bar exam. It writes production code. Its architecture has never been officially disclosed, though it's widely rumored to use a Mixture-of-Experts design — a concept we'll unpack later. GPT-4o added native audio and vision processing, making it omni-modal at 2× faster speed and half the cost of GPT-4 Turbo.

Here's the throughline that matters: the architecture barely changed across six years. Each generation removed a constraint by adding scale or training signal. GPT-1 proved pre-training transfers. GPT-2 showed you can skip fine-tuning. GPT-3 showed you can skip zero-shot and use few-shot prompting. InstructGPT showed you can align models. GPT-4 added modalities. OpenAI's core thesis was always: a good enough architecture, scaled far enough, will develop capabilities that no amount of architectural cleverness at small scale can match.

Whether that thesis holds forever is one of the most debated questions in the field right now. But from 2018 to 2024, the scaling camp won every round. And every competitor — from Google to Meta to Mistral — had to respond.

The LLaMA Revolution — Open Weights Change Everything

The response came from an unexpected direction. In February 2023, Meta released LLaMA-1 — a family of models (7B, 13B, 33B, 65B parameters) that were competitive with GPT-3-class systems. The weights were restricted to researchers, but within a week they leaked. And the open-source LLM explosion began.

What made LLaMA-1 different wasn't a breakthrough architecture — it was a breakthrough training philosophy. A 2022 paper from DeepMind called "Chinchilla" had shown that most large models were undertrained. The field had been making models bigger and bigger, but not training them on enough data. Chinchilla demonstrated that a smaller model trained on more tokens could match or beat a larger model trained on fewer tokens. LLaMA-1 applied this aggressively: its 65B model was trained on 1.4 trillion tokens of publicly available data. For context, GPT-3 (175B parameters — nearly 3× larger) was trained on about 300 billion tokens.

Think of it like this. The scaling labs had been building bigger engines and putting in the same amount of fuel. LLaMA showed that a smaller engine with a full tank goes further.

LLaMA-1's architecture introduced the combination of components that would become the modern standard: RoPE for position encoding, RMSNorm for normalization, SwiGLU for the feed-forward activation, and pre-normalization throughout. We'll unpack each of these in detail. For now, the important point is that these weren't novel inventions — each had been proposed in earlier papers — but LLaMA was the first widely-available model to package them all together into a clean, reproducible recipe.

LLaMA-2 (July 2023) shipped with a commercial license — the first time a frontier-class open model was available for businesses. It added Grouped-Query Attention (GQA) for the 70B variant (a memory optimization we'll dig into later) and included chat-tuned variants fine-tuned with RLHF, making it a drop-in alternative to API-based chatbots.

LLaMA-3 (April 2024) jumped to 15 trillion training tokens — more than 10× LLaMA-1 — with an expanded vocabulary of 128,000 tokens (up from 32,000). It shipped in 8B, 70B, and 405B parameter variants. The 8B model rivals LLaMA-2's 70B. The 405B model competes with GPT-4-class systems on many benchmarks. All with open weights.

For your TinyChat startup, the LLaMA moment was pivotal. Before LLaMA, building a capable chatbot meant either paying OpenAI's API costs forever or spending millions training your own model. After LLaMA, you could download a 7B model, fine-tune it on your restaurant data for a few hundred dollars of GPU time, and deploy it on your own servers. No API dependency. No per-token costs. Full control over the model's behavior. The economics of AI changed overnight.

I still find it remarkable that a 7B-parameter model you can run on a gaming laptop can carry on a coherent conversation about restaurant menus. In 2020 that required 175 billion parameters and a data center. The engine got smaller. The fuel got better.

Inside the Modern LLM Block — RMSNorm and SwiGLU

Now we need to open the hood. Every model we've discussed — LLaMA, Mistral, Qwen, Phi — is built from the same type of building block: a transformer decoder layer. Stack 32 of these blocks and you get a 7B model. Stack 80 and you get a 70B model. The differences between model families come down to the specific components inside each block.

Let's build a single block from scratch, using our TinyChat model as the motivating example. Suppose our model has a hidden dimension of 4 — absurdly small, but it lets us see every number. A real model like LLaMA-2 7B uses a hidden dimension of 4,096, but the mechanics are identical.

A transformer decoder layer does two things: it lets tokens talk to each other (attention), and then it processes each token's representation independently (the feed-forward network, or FFN). The original 2017 transformer did these sequentially, with normalization after each step:

# Post-norm (original transformer, GPT-2 era):
# Token comes in as vector x (dimension 4 in our toy example)

x = x + Attention(x)       # tokens talk to each other
x = LayerNorm(x)           # normalize
x = x + FFN(x)             # each token processed independently
x = LayerNorm(x)           # normalize again

This worked for models with 12 layers. But when people tried to stack 60, 80, 96 layers — the kind of depth you need for a 70B model — post-norm became a minefield. Gradients flowing backward through dozens of LayerNorm operations would accumulate instabilities. Training runs would collapse without warning. You needed extremely careful learning rate warmup schedules, and even then, some runs would diverge unpredictably at step 50,000.

The fix was disarmingly straightforward: put the normalization before each sub-layer instead of after.

# Pre-norm (LLaMA, Mistral, every modern LLM):
x = x + Attention(RMSNorm(x))   # normalize BEFORE attention
x = x + FFN(RMSNorm(x))         # normalize BEFORE FFN

Why does this help? Look at the residual path. In pre-norm, the main highway — the addition x = x + ... — is a clean sum. Gradients flow straight through the addition without passing through any normalization. The normalization happens on a side road, inside the sub-layer, where it stabilizes the computation without disrupting the gradient highway. In post-norm, the normalization sits on the highway. Every layer forces gradients through another nonlinear operation, and that compounds over 80 layers.

I'll be honest — when I first saw this change, I thought it was trivially cosmetic. Moving a layer norm up by one line? That fixes deep training? But the gradient math is unambiguous: pre-norm preserves the residual stream identity, and that's what makes 80-layer training stable.

Now, about that normalization function itself. Standard LayerNorm computes the mean and variance across a token's hidden dimensions, subtracts the mean, and divides by the standard deviation. RMSNorm — Root Mean Square Normalization — drops the mean-centering step entirely and normalizes by the root-mean-square alone:

# LayerNorm: center then scale
#   mean = (x₁ + x₂ + x₃ + x₄) / 4
#   variance = Σ(xᵢ - mean)² / 4
#   LayerNorm(x) = (x - mean) / √(variance + ε) * γ
#
# RMSNorm: scale only (no centering)
#   rms = √(x₁² + x₂² + x₃² + x₄²) / 4)
#   RMSNorm(x) = x / rms * γ
#
# For our tiny vector x = [2.0, -1.0, 0.5, 1.5]:
#   rms = √((4.0 + 1.0 + 0.25 + 2.25) / 4) = √(1.875) ≈ 1.37
#   RMSNorm(x) ≈ [1.46, -0.73, 0.37, 1.10] * γ
#
# The mean subtraction in LayerNorm costs extra compute and,
# empirically, doesn't help training quality in deep transformers.
# RMSNorm saves ~10-15% of normalization compute for free.

Now for the feed-forward network. The original transformer used a simple two-layer MLP with ReLU: take the input, project it to a wider dimension (4× wider, typically), apply ReLU, project back down. It works. But it treats every dimension identically — the activation function applies element-wise, with no interaction between dimensions.

SwiGLU (Shazeer, 2020) adds a gating mechanism. Think of it as a dimmer switch. Instead of every dimension passing through with the same on/off treatment, SwiGLU lets the network learn which dimensions to amplify and which to suppress, for each token individually:

# Original FFN:
#   FFN(x) = ReLU(x @ W₁) @ W₂
#   Two weight matrices. Simple. Each dimension independent.
#
# SwiGLU FFN:
#   SwiGLU(x) = (Swish(x @ W_gate) ⊙ (x @ W₁)) @ W₂
#
#   Three weight matrices now: W_gate, W₁, W₂
#   ⊙ means element-wise multiplication (the "gating")
#   Swish(z) = z × sigmoid(z) — a smooth, non-monotonic activation
#
# Walkthrough with our tiny 4-dim example, x = [2.0, -1.0, 0.5, 1.5]:
#
#   gate_values = Swish(x @ W_gate)   → e.g., [0.9, 0.1, 0.7, 0.3]
#   main_values = x @ W₁              → e.g., [1.5, -2.0, 0.8, 1.2]
#   gated = gate_values ⊙ main_values → [1.35, -0.2, 0.56, 0.36]
#                                         ↑ amplified  ↑ suppressed
#   output = gated @ W₂               → back to model dimension
#
# The gate decides: "Let this dimension through strongly,
# choke that one off." The network learns WHAT to gate
# during training. Different tokens get different gates.

My favorite thing about SwiGLU is that, aside from high-level intuitions like "gating helps," no one is completely certain why it works so well. Noam Shazeer — the author — tested a dozen gated activation variants. SwiGLU came out on top empirically, giving ~1–3% better performance across benchmarks for the same compute budget. The field adopted it because the numbers spoke for themselves, not because we have an airtight theoretical explanation.

The cost of the gating mechanism is a third weight matrix (W_gate). To keep the total parameter count roughly the same as the old FFN, modern models shrink the inner dimension from 4×d_model to approximately (8/3)×d_model. It's a parameter-neutral swap that buys better performance. Everyone takes that deal.

How RoPE Encodes Position Through Rotation

Attention has no built-in sense of order. If you shuffle the tokens in a sentence before feeding them to the attention mechanism, the attention scores don't change — each token attends to every other token based on content alone, with no notion of "this word came before that one." This is why transformers need some form of position encoding: a way to inject information about where each token sits in the sequence.

Early approaches tried two things. Learned absolute position embeddings (used by BERT, GPT-2) assign a trainable vector to each position — position 0 gets one vector, position 1 gets another, up to the maximum sequence length. This works, but it has a fatal flaw: the model has never seen position 513 during training (if trained on 512-token sequences), so it falls off a cliff the moment you try to process longer text.

Sinusoidal position embeddings (the original transformer) use fixed sine and cosine waves at different frequencies. They don't need to be learned, but they encode absolute positions. The model has to learn on its own that position 5 and position 8 are "3 apart" — that relative relationship isn't baked in.

What we really want is for the attention mechanism to naturally understand relative positions. "The word 3 positions back" matters far more than "the word at position 847." Rotary Position Embeddings — RoPE (Su et al., 2021) — achieve this through a beautiful geometric trick.

Here's the core idea, and I want to build it from the ground up because the math is more visual than it looks. Imagine a 2D vector — two numbers. We can visualize it as an arrow on a flat plane. Now imagine rotating that arrow by some angle. A rotation in 2D is a well-defined operation: multiply by a rotation matrix.

# A 2D vector [x₁, x₂] rotated by angle θ:
#
#   ┌ cos(θ)  -sin(θ) ┐   ┌ x₁ ┐     ┌ x₁·cos(θ) - x₂·sin(θ) ┐
#   │                  │ × │    │  =  │                          │
#   └ sin(θ)   cos(θ) ┘   └ x₂ ┘     └ x₁·sin(θ) + x₂·cos(θ) ┘
#
# RoPE's insight: use position as the rotation angle.
# Token at position m → rotate by angle m·θ₀
# Token at position n → rotate by angle n·θ₀
#
# So the query at position 5 gets rotated by 5·θ₀
# And the key at position 3 gets rotated by 3·θ₀

Now here's where it gets elegant. When the query at position m and the key at position n compute their dot product (which is how attention scores work), the two rotations combine. The dot product of two rotated vectors depends on the difference in their rotation angles: m·θ₀ − n·θ₀ = (m−n)·θ₀. The attention score becomes a function of (m−n) — the relative distance — not of m or n individually.

That's it. That's the entire trick. Relative position encoding, for free, from rotation.

For vectors longer than 2D (and real models use 128-dimensional head vectors), RoPE pairs up dimensions: (dim 1, dim 2), (dim 3, dim 4), and so on. Each pair gets rotated at a different frequency, similar to how a clock has a fast second hand and a slow hour hand. The different frequencies let the model distinguish positions at different scales — nearby tokens through high-frequency rotations, distant tokens through low-frequency ones.

# RoPE for a d-dimensional vector (d=128 in LLaMA):
# Pair dimensions: (0,1), (2,3), (4,5), ..., (126,127)
# Each pair i gets frequency: θᵢ = 10000^(-2i/d)
#
# Position m rotates pair i by angle m · θᵢ
#
# Low i (early pairs): high frequency — changes fast with position
#   → good for distinguishing nearby tokens
# High i (late pairs): low frequency — changes slowly
#   → good for distinguishing distant tokens
#
# Like a clock:
#   Second hand (high freq) = tells you "3 seconds ago"
#   Hour hand (low freq)    = tells you "2 hours ago"
#   Both together           = full position information

The practical superpower of RoPE is context length extension. Because positions are encoded as rotation frequencies, you can scale them to handle sequences longer than the training length. The simplest approach — linear interpolation — divides all position indices by a factor (to go from 4K to 16K context, divide positions by 4). More sophisticated approaches like NTK-aware scaling adjust the base frequency, and YaRN (Yet another RoPE extensioN) combines scaling with attention temperature adjustments for each dimension. This is why LLaMA-3 can train at 8K context but extend to 128K with continued pre-training — the mathematical structure of RoPE supports stretching gracefully.

There's an alternative approach worth knowing: ALiBi (Attention with Linear Biases), used in models like Falcon and MPT. Instead of rotating vectors, ALiBi adds a simple linear penalty to attention scores based on distance — the further apart two tokens are, the more their attention score gets reduced. ALiBi is simpler to implement and extrapolates well to long contexts, but it's less expressive than RoPE. As of 2024, RoPE has won the popularity contest: LLaMA, Mistral, Qwen, Phi, and most new models use it. ALiBi appears in fewer new architectures.

Rest Stop

Congratulations on making it this far. You can stop here if you want.

You now have a solid mental model of the LLM landscape. You know why decoder-only won. You can trace the GPT lineage and explain what changed at each generation. You know that LLaMA democratized open-weight models and introduced the modern architectural recipe. You understand that recipe at the component level: pre-norm with RMSNorm for stable training, SwiGLU for expressive gating in the FFN, and RoPE for relative position encoding through rotation.

That's a genuinely useful picture. You could walk into an interview and explain why every modern LLM looks similar under the hood, and you'd be right. The short version is: the field converged on a set of components that work well together — RMSNorm, SwiGLU, RoPE, pre-norm — and the interesting differences between models are about training data, scale, alignment methods, and a handful of attention/efficiency tricks.

There. You're about 60% of the way through this section's knowledge.

But if the discomfort of not knowing what those attention tricks actually are — GQA, sliding window attention, Mixture-of-Experts, and the design philosophies of Claude, Gemini, and DeepSeek — is nagging at you, read on.

The KV Cache Problem and Grouped-Query Attention

Let's go back to TinyChat. You've deployed a LLaMA-2 7B model and it's working well. Customers love it. Your restaurant chain wants to expand the chatbot to handle longer conversations — full booking flows, menu exploration, dietary restriction discussions. You increase the context length from 4K to 32K tokens. And your GPU runs out of memory.

The culprit isn't the model weights — those haven't changed. It's the KV cache. During autoregressive generation, the model stores the key and value vectors for every token it's already processed, so it doesn't have to recompute them when generating the next token. That cache grows linearly with sequence length and with the number of attention heads.

Let's do the arithmetic with real numbers. Standard Multi-Head Attention (MHA) gives every attention head its own query, key, and value projections. For a 70B model (like LLaMA-2 70B) with 64 attention heads, a head dimension of 128, 80 layers, at 32K sequence length in float16:

# KV cache = 2 (keys + values) × num_heads × head_dim × seq_len × layers × bytes
#
# MHA (64 KV heads, the standard):
#   2 × 64 × 128 × 32768 × 80 × 2 bytes
#   = ~85 GB
#
# An A100 GPU has 80 GB of memory total.
# The KV cache alone doesn't fit. And we haven't loaded
# the model weights yet (~140 GB in fp16).
#
# For TinyChat, this means: with standard attention,
# you literally cannot serve a 70B model with 32K context
# on anything less than 4 GPUs.

The first attempt to fix this was Multi-Query Attention (MQA): make all 64 query heads share a single set of key-value projections. KV cache drops by 64×. But the quality drop is noticeable — with only one key and one value to attend to, the attention mechanism loses expressiveness.

Grouped-Query Attention (GQA) is the compromise that stuck. Instead of 64 KV heads (MHA) or 1 KV head (MQA), you use a small number of KV head groups. Each group serves multiple query heads:

# Think of it like a restaurant kitchen (our TinyChat theme):
#
# MHA = 64 chefs, each with their own pantry (64 pantries)
#   Expensive. Every chef stores their own ingredients.
#
# MQA = 64 chefs, one shared pantry (1 pantry)
#   Cheap, but everyone's fighting over the same spice rack.
#   Quality drops.
#
# GQA = 64 chefs, 8 shared pantries (8 groups of 8 chefs)
#   8 chefs share one well-stocked pantry. Enough variety,
#   way less storage.
#
# In attention terms:
#   MHA:  Q₁→KV₁   Q₂→KV₂   ...  Q₆₄→KV₆₄  (64 KV heads)
#   GQA:  Q₁ thru Q₈ → KV₁,  Q₉ thru Q₁₆ → KV₂, ... (8 KV heads)
#   MQA:  Q₁ thru Q₆₄ → KV₁                          (1 KV head)
#
# LLaMA-2 70B: 64 query heads, 8 KV heads → 8× KV cache reduction
# KV cache drops from 85 GB to ~10.5 GB. Fits on one GPU.

The quality impact of GQA is negligible — within noise of full MHA on every benchmark that's been tested. The KV cache savings are enormous. This is why GQA is now standard: LLaMA-2 70B, LLaMA-3 (all sizes), Mistral-7B (8 KV groups for 32 query heads), and most new models use it.

I still find it slightly counterintuitive that you can share key-value projections across 8 query heads and lose essentially nothing. My working intuition is that the KV projections capture "what information is available at each position," which doesn't need as much diversity as the queries, which capture "what am I looking for." The queries need to be diverse — different heads should search for different patterns. The keys and values can be shared because the underlying information at each position is the same regardless of which query is asking.

There's one more innovation worth flagging: Multi-head Latent Attention (MLA), introduced by DeepSeek-V2. Instead of sharing KV heads in groups, MLA compresses the entire key-value representation into a low-dimensional latent space before caching it. It's a more aggressive compression that achieves even larger memory savings. MLA is newer and less widely adopted, but it signals the direction the field is heading — finding increasingly creative ways to shrink the KV cache bottleneck.

Mistral's Tricks — Sliding Windows and Mixture-of-Experts

In September 2023, a small French startup called Mistral AI released a 7B model that beat LLaMA-2 13B on every benchmark — at half the parameter count. They did it with two architectural innovations, both aimed at the same goal: doing more with less memory.

Sliding Window Attention

Standard causal attention lets every token attend to all previous tokens. Token 8,000 looks back at tokens 0 through 7,999. That's 8,000 attention entries per token, and it grows linearly (making the total computation quadratic in sequence length).

Mistral-7B's insight: restrict each token to a local window of W = 4,096 previous tokens. Token 8,000 attends to tokens 3,904 through 7,999 and ignores everything before that.

Your immediate objection — mine too, when I first read it — is "doesn't that mean the model forgets everything beyond the window?" The answer is more subtle, and it involves thinking about what happens across layers.

# Layer 1: Token 8000 attends to tokens [3904, 7999]
#          Token 5000 attends to tokens [904, 4999]
#
# Layer 2: Token 8000 attends to tokens [3904, 7999]
#          BUT those tokens already saw [0, 7999] in layer 1.
#          So token 8000 now has INDIRECT access to [0, 7999].
#
# Layer 3: The receptive field widens further...
#
# After L layers: effective receptive field = L × W
# Mistral-7B: 32 layers × 4096 window = 131K tokens
#
# Think of it like binoculars (window) vs. telescope (full attention):
# Binoculars see a limited field, but if you hand what you see
# to the next person up a tower (next layer), and they use
# their binoculars too... the person at the top of a 32-story
# tower has information from the entire horizon.

The information from early tokens reaches late tokens — it propagates upward through the layer stack, getting compressed at each layer. It's lossy compared to full attention, but empirically the quality difference is small. And the memory benefit is dramatic: Mistral pairs sliding window attention with a rolling KV cache — a fixed-size circular buffer. Positions older than the window get overwritten. Memory usage is constant regardless of how long the sequence gets. You never run out of KV cache memory during inference.

This was the binoculars analogy I mentioned — we'll come back to it when we compare architecture choices at the end.

Mixture-of-Experts (MoE)

Mixture-of-Experts keeps appearing in this chapter — Mixtral, GPT-4 (rumored), Gemini. Let's nail down what it actually is, because it's the key to understanding how a model can have 47 billion parameters but run at the speed of a 13 billion parameter model.

In a standard transformer, every token passes through the same FFN in each layer. The FFN is the biggest parameter hog — it typically accounts for about two-thirds of the total parameters. Mixture-of-Experts replaces that single FFN with N parallel FFNs (the "experts"), and a small router network selects which experts process each token:

# Standard FFN: Every token goes through the same kitchen.
#   y = FFN(x)
#
# MoE FFN: Multiple kitchens, each with a specialty.
#   A host (router) directs each dish (token) to the right kitchen.
#
#   router_scores = softmax(x @ W_router)   → scores for each expert
#   top_2 = pick the 2 highest-scoring experts
#   y = weight₁ × Expert₁(x) + weight₂ × Expert₂(x)
#
# Mixtral 8×7B:
#   8 expert FFNs per layer, 2 active per token
#   Total parameters: ~46.7B (all experts counted)
#   Active parameters: ~13B per token (only 2 experts fire)
#
# Result: The model has the knowledge of a 47B model
# but the inference speed of a 13B model.

The router learns to specialize experts during training. In analysis of trained MoE models, researchers have found that different experts activate for different types of content — one handles code, another mathematical reasoning, another conversational text. It's not perfectly clean specialization, and I'm still developing my intuition for exactly how the router learns to partition the input space, but the empirical results are consistent.

The main engineering challenge is load balancing. Without intervention, the router tends to collapse — it finds two favorite experts and sends everything to them, leaving the other six undertrained. The fix is an auxiliary loss that penalizes uneven expert usage, encouraging the router to spread the load. There's also a communication overhead during distributed training: if experts live on different GPUs (which they have to, at scale), tokens need to be routed across the network.

For TinyChat, MoE models offer an incredible deal: Mixtral 8×7B gives you GPT-3.5-class quality at a fraction of the GPU cost, because only 13B parameters are active per token. The catch is that the full 47B parameters still need to fit in GPU memory — you're saving compute, not memory. But with quantization (covered in a later section), even that becomes manageable.

The Broader Landscape — Claude, Gemini, DeepSeek, Phi

GPT, LLaMA, and Mistral don't exist in a vacuum. Several other model families have introduced distinctive ideas worth knowing, because they show up in interviews, in production decisions, and in the evolution of the field.

Claude (Anthropic) is built on Constitutional AI (CAI). Most LLMs are aligned via RLHF: human labelers rank model outputs, and the model is fine-tuned to match those preferences. Claude's twist is that the model critiques its own outputs against a written set of principles — a "constitution" — and revises them. Then it trains on its own revisions. This self-critique loop reduces the dependence on expensive human labelers and makes the alignment criteria explicit and auditable. Claude 3 (2024) came in three sizes — Haiku (fast, cheap), Sonnet (balanced), Opus (maximum capability) — with 200K token context. The safety-first philosophy means Claude tends to be more cautious, but also more reliable for sensitive applications.

Gemini (Google) is natively multimodal. While GPT-4 was trained primarily on text and had vision capabilities added (likely through a separate vision encoder), Gemini was trained from the ground up on interleaved text, images, audio, and video. The practical difference: Gemini can reason across modalities more fluidly — understanding a diagram in a document, or answering questions about a video's content. Gemini 1.5 Pro pushed the context window to 1 million tokens using a Mixture-of-Experts architecture. Google also released Gemma (2B and 7B) as an open-weight derivative, using the same architectural innovations but optimized for community use.

DeepSeek (China) introduced Multi-head Latent Attention (MLA) in DeepSeek-V2, compressing the KV cache through a low-dimensional latent projection. DeepSeek-V3 combined MLA with MoE for a model that's remarkably efficient at inference. These models demonstrate that architectural innovation is happening globally, not only in US labs.

Phi (Microsoft) challenged the "scale is all you need" thesis head-on. Phi-1 (1.3B parameters) was trained primarily on textbook-quality data and synthetic examples generated by larger models. It matched models 10× its size on coding benchmarks. Phi-2 (2.7B) and Phi-3 (3.8B) continued the thesis: data quality matters more than data quantity. If you're deploying on a phone, a laptop, or at the edge — anywhere compute is limited — Phi demonstrates that a small, well-trained model can be surprisingly capable.

Qwen (Alibaba) and Yi (01.AI) are leading Chinese open-weight families with strong multilingual and coding capabilities. Qwen-2 ships in sizes from 0.5B to 72B and competes with LLaMA-3 class models on many benchmarks.

The thing that strikes me about this landscape is the convergence. Strip away the names and marketing, and Claude, Gemini, LLaMA, Mistral, Qwen, and DeepSeek all share the same engine block: decoder-only transformer, pre-norm with RMSNorm, SwiGLU FFN, RoPE positions. They differ in fuel (training data), tuning (alignment methods), turbochargers (MoE, MLA), and paint job (API vs. open weights). The engine is the same. We'll make that explicit in a moment.

Open vs. Closed — The Great Divide

The LLM world has a fundamental fault line, and if you're building TinyChat (or any production system), understanding it is more important than understanding any single architecture choice.

Closed models — GPT-4, Claude, Gemini — push the frontier of capabilities. You access them through APIs, paying per token. You don't see the weights. You can't fine-tune them (mostly). You can't deploy them on your own infrastructure. You're subject to the provider's usage policies, pricing decisions, and uptime guarantees. If OpenAI changes their content policy, your chatbot's behavior changes overnight.

Open-weight models — LLaMA, Mistral, Qwen, Phi — give you the weights. You can fine-tune them on your data. You can quantize them to run on cheaper hardware. You can deploy them on-premise, in any cloud, on a laptop. You control the model's behavior completely. If your restaurant chain handles health-related dietary restrictions, you can fine-tune specifically for food allergy accuracy in a way that API-based models don't allow.

The capability gap between closed and open has been shrinking rapidly. LLaMA-3 405B approaches GPT-4 on many benchmarks. Mistral's models punch far above their weight. For most production use cases — and I say this after spending too long agonizing over the decision for my own projects — the question isn't "which model has the highest benchmark score?" It's "what are my constraints?" Latency, cost, data privacy, deployment control, and fine-tuning needs often matter more than the last few points on MMLU.

For TinyChat: if you're prototyping, use GPT-4's API. It's the fastest path to a working demo. If you're going to production at scale — hundreds of thousands of restaurant customers generating conversations daily — an open-weight model you own and operate will be dramatically cheaper and more controllable. There's no universal right answer. There's a right answer for your specific constraints.

I still don't have a clean framework for this decision. Every project I've been involved with ends up in a spreadsheet weighing latency, cost, privacy, and quality, and the answer is always "it depends." That ambiguity is honest. Anyone who tells you to always use open models or always use closed APIs is selling something.

Model Cards and Responsible Release

We've been talking about architecture and capabilities, but there's a dimension we haven't addressed: the information that ships alongside a model. This matters more than most technical sections acknowledge.

A model card is a document that accompanies a model release, describing what the model is, how it was trained, what it's good at, what it's bad at, and what risks it carries. The concept was formalized in a 2019 paper by Margaret Mitchell et al., and it's become the standard practice for responsible model releases.

A good model card includes the training data composition (or at least a high-level description), benchmark results with methodology, known limitations and failure modes, intended use cases and out-of-scope uses, bias evaluations, and environmental impact (compute used). LLaMA-3's model card, for example, explicitly lists failure modes around code generation in certain languages, limitations in mathematical reasoning, and the fact that the model may produce culturally biased outputs for non-English languages.

The practice matters for TinyChat because deploying a model without understanding its limitations is how you end up with a restaurant chatbot that confidently tells a customer with a peanut allergy that the pad thai is nut-free (when it isn't). Model cards won't prevent every failure, but they're the minimum baseline for responsible deployment — a checklist that forces you to think about what could go wrong before it does.

The open vs. closed divide shows up here too. Open-weight releases like LLaMA ship detailed model cards. Closed models like GPT-4 publish a "system card" that's substantially less detailed about training data and architecture. Some releases — particularly from smaller labs racing to publish — ship with minimal documentation. The trend is toward more transparency, but we're not there yet.

The Converged Recipe

After all that, let's make the convergence explicit. If you're reading a new model release in 2024 and the paper doesn't call out a specific design choice, it's almost certainly using this default stack:

The 2024 LLM Standard Recipe:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Architecture:     Decoder-only transformer
Normalization:    Pre-norm with RMSNorm
FFN Activation:   SwiGLU (d_ff ≈ 8/3 × d_model)
Position Enc:     RoPE (Rotary Position Embeddings)
Attention:        GQA (Grouped-Query Attention)
Bias terms:       None (removed from attention and FFN)
Vocabulary:       32K–128K BPE tokens
Context window:   8K–128K (RoPE enables extension)
Training:         AdamW optimizer, cosine LR schedule
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

When you read a new paper, the deviations from this recipe
are the interesting parts. Everything else is assumed.

This recipe is the engine block. The engine is the same across almost every family. LLaMA, Mistral, Qwen, Phi, Gemma — they all run this engine. The differences that matter are:

Training data — How much, how clean, what languages, what domains. This is probably the single biggest differentiator and also the most opaque. Most labs won't tell you exactly what's in the training set.

Scale — More parameters and more tokens generally help, up to a point. But Phi showed that data quality can compensate for scale, and Mistral showed that architectural efficiency can compensate for raw size.

Alignment method — RLHF (OpenAI), Constitutional AI (Anthropic), DPO (used by many open models), or various combinations. This determines how the model behaves, not what it knows.

Efficiency innovations — MoE (Mixtral, Gemini), sliding window attention (Mistral), MLA (DeepSeek). These are the turbochargers — they don't change what the engine does, but they change how fast and cheap it runs.

When you understand this recipe, reading new model papers becomes dramatically faster. You scan for deviations. DeepSeek-V2's MLA replaces GQA — that's interesting, read that section. Mamba's state-space layers replace the entire attention mechanism — that's a major departure. Jamba mixes Mamba and transformer layers in one model — hybrid approach. The recipe is your baseline. Deviations from it are the signal.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started by understanding why decoder-only transformers won the architecture wars — simplicity, scaling friendliness, and the KV cache advantage. We traced the GPT lineage from a 117M-parameter proof of concept to GPT-4's rumored trillion-parameter MoE, watching each generation remove a constraint by adding scale. We lived through the LLaMA revolution that brought frontier models to everyone, and we opened the hood to understand the components that make them tick — RMSNorm's simplified normalization, SwiGLU's learned gating, RoPE's rotational position encoding. We confronted the KV cache bottleneck and saw how GQA solves it. We watched Mistral fit full attention quality into a sliding window, and we decoded how Mixture-of-Experts gives large-model knowledge at small-model speed. And we surveyed the broader landscape — Claude's constitutional approach, Gemini's native multimodality, DeepSeek's latent attention, Phi's data-quality bet — and recognized that beneath the different paint jobs, they all run the same engine.

My hope is that the next time you see a new model release on Hugging Face or hear about the latest LLM in a meeting, instead of nodding along and pretending you understand the differences, you'll be able to read the model card, identify the deviations from the standard recipe, and form a genuine opinion about whether this model is interesting for your use case — having a pretty clear mental model of what's actually going on under the hood.

Resources

"Attention Is All You Need" (Vaswani et al., 2017) — The O.G. transformer paper. Everything in this section descends from it.

"LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023) — The paper that launched the open-source LLM revolution. Wildly influential.

"RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al., 2021) — The RoPE paper. The math is more approachable than you'd expect.

"GLU Variants Improve Transformer" (Shazeer, 2020) — Where SwiGLU comes from. A short, empirical paper that changed every model's FFN.

"GQA: Training Generalized Multi-Query Attention" (Ainslie et al., 2023) — The GQA paper. Clear, concise, and the KV cache analysis is insightful.

"Model Cards for Model Reporting" (Mitchell et al., 2019) — The paper that formalized responsible model documentation. Still essential reading.

← Previous LLM Training — Pretraining, Scaling & Fundamentals Next → Prompt Engineering