BERT & GPT: The Two Paradigms

Chapter 10: Sequence Models & Attention Encoder vs decoder · MLM vs next-token · When to use which

I avoided understanding the real difference between BERT and GPT for longer than I'd like to admit. I knew the surface-level version — BERT fills in blanks, GPT predicts the next word, something something bidirectional. Every time I saw a comparison table, I'd skim it, nod, and move on. But when an interviewer asked me why the masking strategy creates fundamentally different models, I froze. The discomfort of not really knowing what's happening under the hood grew too great for me. Here is that dive.

BERT and GPT were both published in 2018, both built on the Transformer architecture from the 2017 "Attention Is All You Need" paper. Same engine, wildly different vehicles. BERT became the workhorse for understanding tasks — classification, named entity recognition, question answering. GPT became the engine behind text generation, chatbots, and eventually the entire large language model revolution. Together, they define the two paradigms of modern NLP.

Before we start, a heads-up. We're going to walk through attention masks, training objectives, scaling laws, and a fair amount of architecture reasoning. You don't need to know any of it beforehand. We'll build every concept from scratch, one piece at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

Contents

  The fork in the road
  A tiny language for a big idea
  BERT: the fill-in-the-blank reader
  Why masking needs the 80/10/10 trick
  Next Sentence Prediction — and why it failed
  Dressing up the input: [CLS], [SEP], and three embeddings
  Fine-tuning: swapping the head
  Rest stop
  GPT: the storyteller who can't peek ahead
  Next-token prediction: the deceptively powerful objective
  The GPT scaling story: 117M to ??? parameters
  In-context learning: the prompt is the program
  Emergent abilities — real or mirage?
  Chinchilla: the data diet breakthrough
  Rest stop
  The BERT family tree: RoBERTa, ALBERT, DeBERTa
  Encoder vs. decoder: the paradigm comparison
  The hybrid workflow
  Wrap-up
  Resources

The Fork in the Road

Both BERT and GPT start with the same Transformer self-attention mechanism. Query, key, value — the whole machinery we built in the previous section. The divergence comes from one design decision that sounds almost trivial but changes everything: which tokens can attend to which?

Let every token see every other token — past and future, left and right — and you get BERT. Restrict each token to seeing only the tokens that came before it, and you get GPT. That's it. That's the fork. Everything else — training objectives, downstream uses, scaling behavior, the entire generation-vs-understanding divide — flows from that one choice about the attention mask.

I'll be honest: the first time someone told me this, I didn't believe such a small structural change could produce such different models. It felt like being told that the difference between a detective and a novelist is whether they read case files forwards or backwards. But the analogy turns out to be more apt than I expected. Let's see why.

A Tiny Language for a Big Idea

We need a concrete example to carry through this entire section. Let's build one.

Imagine we're building a movie review classifier. Our entire training corpus consists of five reviews:

Review 1: "The movie was dark but the ending was bright"     → mixed
Review 2: "The acting felt flat and lifeless"                 → negative
Review 3: "A bright and moving performance"                   → positive
Review 4: "The dark theater made the movie feel immersive"    → positive
Review 5: "Flat acting ruined an otherwise bright script"     → negative

Notice the word "dark" in reviews 1 and 4. In review 1, "dark" describes the movie's tone — probably negative. In review 4, "dark" describes the theater — neutral, maybe even positive in context. The word "bright" pulls the same trick: literal brightness in review 4, quality in reviews 1, 3, and 5. And "flat" — is it about the acting (bad) or could it be about something else entirely?

A model that reads these reviews needs to figure out what each word means in context. The question is: how much context does it get to see?

That question is exactly where BERT and GPT diverge. Let's start with the one that gets to see everything.

BERT: The Fill-in-the-Blank Reader

Think about how you'd study for a reading comprehension exam. You'd read the entire passage first — beginning, middle, end — and then answer questions about it. You wouldn't cover up the second half of the passage and try to answer questions using only the first half. That would be needlessly handicapping yourself.

BERT works the same way. It's an encoder-only Transformer, which means it takes the decoder from the original Transformer architecture and throws it away. No causal mask, no left-to-right restriction. Every token in the input attends to every other token — both left context and right context, simultaneously, at every layer.

Back to our movie reviews. When BERT processes the word "dark" in "The movie was dark but the ending was bright," it sees "movie" to the left and "ending" and "bright" to the right. By layer 2 or 3, the representation of "dark" has absorbed information from the entire sentence and settled into a region of embedding space that means "negative tone." When it processes "dark" in "The dark theater made the movie feel immersive," it sees "theater" to the right, and the representation shifts to mean "physically dark — not a value judgment." Same word, completely different internal representations, because BERT reads the whole sentence before making any decisions.

This is the defining advantage of bidirectional context. But it creates a training problem that took a clever trick to solve.

Why Masking Needs the 80/10/10 Trick

Here's the paradox. If BERT can see every token in the input, and you ask it to predict a specific token... it can cheat. It can look directly at the answer. There's no challenge, and the model learns nothing.

The solution is what the original paper called Masked Language Modeling, or MLM. Take a sentence, hide some of the words, and ask the model to figure out what's behind the curtain. It's a fill-in-the-blank test, scaled to billions of sentences.

Let's trace through exactly what happens with one of our reviews. Take "The movie was dark but the ending was bright." BERT randomly selects about 15% of the tokens — say it picks "dark" and "ending." Now, here's where it gets subtle. If BERT always replaced selected tokens with a special [MASK] symbol, we'd create a mismatch. During pre-training, the model sees [MASK] everywhere. During fine-tuning and real inference, there are no [MASK] tokens. The model might learn to activate its prediction machinery only when it spots that special symbol.

The fix is the 80/10/10 strategy. Of the 15% of tokens selected for prediction, 80% get replaced with [MASK] — the straightforward case. 10% get replaced with a random word from the vocabulary — maybe "dark" becomes "purple" or "seventeen." And the remaining 10% are left completely unchanged — "dark" stays "dark," but the model still has to predict it.

Let me walk through why each piece matters, using our review. The original sentence is:

Original: "The movie was dark but the ending was bright"

Selected tokens: "dark" (position 4), "ending" (position 7)

80% case → "The movie was [MASK] but the [MASK] was bright"
10% case → "The movie was purple but the ending was bright"
10% case → "The movie was dark but the ending was bright"  (unchanged!)

The random replacement ("purple") forces BERT to maintain accurate representations for every position, not only positions where it sees the [MASK] flag. It can never be certain which tokens have been tampered with, so it has to keep an honest internal model of the whole sentence. The unchanged case prevents a different shortcut — without it, the model could learn "if the input looks normal, don't bother predicting anything." With the unchanged case, even a perfectly normal-looking sentence might contain tokens that need to be predicted. BERT has to stay vigilant everywhere.

The loss function only counts the selected tokens — BERT isn't penalized for the 85% it wasn't asked about. But the 80/10/10 mixture ensures it can't rely on any cheap signals.

import torch

def apply_mlm_masking(token_ids, vocab_size, mask_token_id, mask_prob=0.15):
    """The 80/10/10 strategy that makes BERT's training work."""
    labels = token_ids.clone()
    masked_input = token_ids.clone()

    # Select ~15% of positions for prediction
    selected = torch.bernoulli(
        torch.full(token_ids.shape, mask_prob)
    ).bool()

    # Only compute loss on selected positions
    labels[~selected] = -100

    # Within selected positions: 80% become [MASK]
    mask_indices = (
        torch.bernoulli(torch.full(token_ids.shape, 0.8)).bool()
        & selected
    )
    masked_input[mask_indices] = mask_token_id

    # 10% become a random token (the remaining selected, split 50/50)
    random_indices = (
        torch.bernoulli(torch.full(token_ids.shape, 0.5)).bool()
        & selected & ~mask_indices
    )
    masked_input[random_indices] = torch.randint(
        0, vocab_size, token_ids.shape
    )[random_indices]

    # The last 10% stay unchanged — already correct in masked_input
    return masked_input, labels

That code is the entire masking pipeline. The -100 label tells PyTorch's cross-entropy loss to ignore those positions, so the model is only trained on the tokens it was asked to predict.

I'll be honest — when I first read the BERT paper, I glossed over the 80/10/10 split as a minor implementation detail. It took me a while to appreciate that without it, the whole pre-training scheme falls apart. The details that seem small often carry the most weight.

Next Sentence Prediction — and Why It Failed

BERT's second pre-training objective was called Next Sentence Prediction, or NSP. The idea: take two text segments A and B, and train the model to predict whether B actually follows A in the original document, or whether B is a random sentence pulled from somewhere else. 50% of the time it's a real pair, 50% it's a random pair. The [CLS] token's representation gets fed to a binary classifier: IsNext or NotNext.

The motivation was reasonable — understanding the relationship between sentences matters for tasks like question answering ("Does this paragraph actually answer this question?") and natural language inference ("Does sentence B follow from sentence A?").

It turned out to be mostly useless.

RoBERTa, published in 2019, ran a careful ablation study: they trained BERT with and without NSP, and the version without NSP performed better on downstream tasks. The problem was that distinguishing a real next sentence from a random sentence is too easy. A random sentence typically comes from a completely different topic — maybe one sentence is about cooking and the other about astrophysics. The model ends up learning shallow topic matching ("these two sentences are both about food, so IsNext") rather than genuine discourse reasoning.

I spent a week implementing NSP carefully before reading the RoBERTa paper and learning it had been dropped. That sting is why I remember the lesson so well: not every idea in a landmark paper turns out to be essential. ALBERT replaced NSP with Sentence Order Prediction (SOP) — predicting whether sentences A and B are in the correct order rather than whether B is random. SOP is harder and more useful, because both sentences come from the same document. DeBERTa dropped sentence-level objectives entirely. The field moved on.

Dressing Up the Input: [CLS], [SEP], and Three Embeddings

Before BERT can read our movie review, we need to package it in a specific format. There are several pieces that work together, and each one exists for a reason.

WordPiece tokenization breaks text into subword units. BERT doesn't operate on whole words. The word "lifeless" from our review might become ["life", "##less"], where ## means "this piece continues the previous token." This gives BERT a fixed vocabulary of about 30,000 tokens that can represent any English word — including words it never saw during training — by composing them from known pieces. Think of it like Lego bricks: you don't need a unique brick for every possible shape if you have enough small pieces that snap together.

Special tokens frame the input. Every sequence starts with [CLS], a classification token whose final hidden state becomes the aggregate representation for the entire sequence. Sentences are separated by [SEP]. For classifying a single review: [CLS] The movie was dark [SEP]. For comparing two sentences: [CLS] Sentence A [SEP] Sentence B [SEP].

Why does [CLS] work as a summary? Because self-attention lets it attend to every other token. Through 12 layers of attention, information from the entire sequence flows into that one position. The model is trained (via NSP and later via fine-tuning) to make the [CLS] representation useful for classification decisions. It's a learned aggregation point — not a magic token, but a designated spot where the model learns to gather what matters.

The final input to BERT is the sum of three separate embeddings:

# Concrete example: classifying "The acting felt flat"
# after tokenization and special token insertion

tokens    = ["[CLS]", "The", "acting", "felt", "flat", "[SEP]"]
token_ids = [  101,   1996,   3772,   2371,  4257,   102 ]
segment   = [    0,      0,      0,      0,     0,     0 ]   # all segment A
position  = [    0,      1,      2,      3,     4,     5 ]

# What BERT actually receives:
# input = token_embedding[101]  + segment_embedding[0] + position_embedding[0]
#       + token_embedding[1996] + segment_embedding[0] + position_embedding[1]
#       + ... and so on for each position

Token embeddings are the learned vectors for each WordPiece token — the content of what's at each position. Segment embeddings tell the model which sentence a token belongs to — sentence A gets embedding 0, sentence B gets embedding 1. These matter for tasks that compare two texts. Position embeddings encode where each token sits in the sequence. BERT uses learned (not sinusoidal) position embeddings for positions 0 through 511, which means it can handle sequences up to 512 tokens.

All three are added element-wise — they live in the same 768-dimensional space, and the model learns to disentangle the signals during training.

Fine-Tuning: Swapping the Head

Pre-training teaches BERT general language understanding. Fine-tuning adapts it to your specific task. The recipe: take the pre-trained model, attach a small task-specific layer on top, and train on your labeled data for a few epochs.

Let's fine-tune BERT on our movie review classification task. We need to go from the [CLS] token's 768-dimensional hidden state to one of three labels: positive, negative, or mixed. That's a single linear layer — 768 inputs, 3 outputs — followed by softmax.

from transformers import BertForSequenceClassification, BertTokenizer
from transformers import Trainer, TrainingArguments

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=3   # positive, negative, mixed
)

training_args = TrainingArguments(
    output_dir="./review_classifier",
    num_train_epochs=3,            # more than 3 and you risk overfitting
    per_device_train_batch_size=16,
    learning_rate=2e-5,            # the critical number
    weight_decay=0.01,
    warmup_steps=100
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)
trainer.train()

The learning rate of 2e-5 deserves a moment. This number isn't arbitrary — it's the result of thousands of practitioners independently converging on the same narrow range. Go higher (say 1e-4) and you catastrophically overwrite the pre-trained weights. The model "forgets" everything it learned about language during pre-training. Go lower (say 1e-6) and the model barely adapts to your task at all. The sweet spot of 2e-5 to 5e-5 threads the needle: large enough to learn your task, small enough to preserve the pre-trained knowledge. Three epochs is typically sufficient. Beyond that, you start memorizing your training data instead of generalizing.

The beauty of this setup is its versatility. For a different task — say, named entity recognition (finding person and organization names in text) — you swap the head. Instead of one classification per sequence, you add a linear layer that classifies each token individually: is this token a person name, an organization, a location, or none of the above? The pre-trained body stays the same. For extractive question answering, the head predicts the start and end positions of the answer span in the passage. Same pre-trained weights, different head, different task.

This pattern — one expensive pre-training, many cheap fine-tunings — is what made BERT so transformative. Train the base model once at enormous cost, then fine-tune it for your specific task with a few hundred labeled examples and a GPU for an hour.

🛑 Rest Stop

Congratulations on making it this far. You can stop if you want.

You now have a working mental model of BERT: an encoder-only Transformer that sees all tokens bidirectionally, trained by predicting masked tokens with the 80/10/10 trick, then fine-tuned by attaching a task-specific head. You understand why [CLS] collects sequence-level information, how the three embeddings (token, segment, position) combine, and why the learning rate during fine-tuning is so critical.

That model is useful. It covers about half the landscape. But it doesn't tell the story of the other paradigm — the one that powers ChatGPT, code generation, and the entire large language model revolution.

If the discomfort of not knowing what's on the other side of the fork is nagging at you, read on.

GPT: The Storyteller Who Can't Peek Ahead

Imagine writing a novel. You start with the first word and build the sentence forward, one word at a time. You can look back at everything you've already written — the characters you've introduced, the tone you've established, the plot threads you've set up. But you can't look ahead. The next word depends entirely on what came before.

GPT works exactly like that. It's a decoder-only Transformer with a causal attention mask. When processing the 5th token, it can attend to tokens 1 through 4. Token 6 doesn't exist yet — it hasn't been written. This constraint is enforced by a triangular mask that sets all future-position attention scores to negative infinity before the softmax, guaranteeing they receive zero attention weight.

Let's trace this through our movie review example. Consider the review "The acting felt flat and lifeless." When GPT processes the word "flat" at position 4, it can attend to "The" (position 1), "acting" (position 2), and "felt" (position 3). It does not get to see "and" or "lifeless" — those come later. Compare this to BERT, which sees the entire sentence simultaneously. When BERT processes "flat," it already knows about "lifeless" to the right, giving it a much richer context for understanding what "flat" means here.

So why would you ever choose GPT's restricted view? Because generation is inherently sequential. When you're producing text, the words after the cursor don't exist yet. There's nothing to peek at. GPT's training mirrors its inference process — at both training and generation time, each token is predicted from only the preceding tokens. This alignment between how the model learns and how it's used is one of GPT's deepest strengths.

import torch
import torch.nn.functional as F

def causal_attention(Q, K, V):
    """The core of GPT: self-attention that can't see the future."""
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / (d_k ** 0.5)

    # The causal mask: a lower-triangular matrix of ones
    seq_len = scores.size(-1)
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
    scores = scores.masked_fill(mask, float('-inf'))

    weights = F.softmax(scores, dim=-1)
    return weights @ V

# For a 4-token sequence, the attention pattern looks like:
# Token 1 sees: [1, -, -, -]        (only itself)
# Token 2 sees: [1, 2, -, -]        (itself and token 1)
# Token 3 sees: [1, 2, 3, -]        (first three tokens)
# Token 4 sees: [1, 2, 3, 4]        (everything so far)

That torch.triu call creates the upper-triangular matrix of ones, and masked_fill replaces those positions with negative infinity. After softmax, negative infinity becomes zero. Future tokens are invisible. That's the entire mechanism that separates GPT from BERT — one line of code.

Next-Token Prediction: The Deceptively Powerful Objective

GPT's training objective is startlingly bare. Given a sequence of tokens, predict the next one. That's it. No masking strategy, no auxiliary objectives, no sentence pairs. For every position in every training sequence, compute the cross-entropy loss between the model's predicted distribution and the actual next token.

Formally: given tokens [t₁, t₂, ..., tₙ], minimize the negative log-likelihood:

L = −Σ log P(tᵢ | t₁, t₂, ..., tᵢ₋₁)

Let me trace through what this means for one of our reviews. Take "The acting felt flat and lifeless." During training, GPT tries to predict each token from its predecessors:

Given ["The"]                          → predict "acting"
Given ["The", "acting"]                → predict "felt"
Given ["The", "acting", "felt"]        → predict "flat"
Given ["The", "acting", "felt", "flat"] → predict "and"
...and so on

At each step, the model produces a probability distribution over its entire vocabulary (say, 50,000 tokens). The loss penalizes it for assigning low probability to the correct next token. Over billions of training examples, the model learns grammar, facts, reasoning patterns, writing styles — anything that helps predict what comes next.

The insight that makes this work is subtle: predicting the next token in a truly diverse corpus requires understanding almost everything about language. To predict the next word in a legal brief, you need legal reasoning. To predict the next token in Python code, you need programming logic. To predict the next word in a story, you need narrative structure. The objective is breathtakingly simple. The capability required to do it well is not.

My favorite thing about next-token prediction is that, aside from high-level intuitions like the one I gave, no one is completely certain why it works so well. There's a gap between "predicting the next word should help with language understanding" (plausible) and "predicting the next word produces a system that can pass the bar exam" (astonishing). We're still closing that gap.

The GPT Scaling Story: 117M to ??? Parameters

GPT's history is a story about what happens when you take a straightforward idea and scale it relentlessly. Each generation didn't do old things better — it unlocked capabilities that smaller models could not do at all.

GPT-1 arrived in June 2018 with 117 million parameters, trained on BookCorpus — about 7,000 unpublished books. Its contribution wasn't the model architecture (a 12-layer Transformer decoder, nothing exotic). The contribution was the paradigm: pre-train on a large text corpus with next-token prediction, then fine-tune on your downstream task. This was the same recipe BERT would use a few months later. At 117M parameters, GPT-1 was a proof of concept. It needed fine-tuning for every task, same as BERT.

GPT-2 appeared in February 2019, scaled up to 1.5 billion parameters — about 13× larger — and trained on 40GB of internet text. OpenAI initially withheld the full model, claiming it could generate misleading text too convincingly. The real technical bombshell: GPT-2 could perform tasks without any fine-tuning. Phrase your input as a question, and it answers. Give it a few translation examples, and it translates. No weight updates, no gradient descent — the prompt alone is enough. This was the first concrete evidence that scale unlocks qualitatively new behaviors. A 117M-parameter model can't do this no matter how you prompt it. A 1.5B-parameter model can, even though nobody trained it to.

GPT-3 landed in June 2020 with 175 billion parameters — 100× larger than GPT-2 — trained on 300 billion tokens. This is the model that changed the industry. GPT-3 demonstrated few-shot learning: put a handful of input-output examples in the prompt, and the model generalizes to new inputs. No weight updates. Give it three sentiment-labeled reviews, and it correctly classifies the fourth. Give it five English-to-French translations, and it translates the sixth. The cost of training: an estimated $4.6 million.

GPT-4 arrived in March 2023. OpenAI stopped publishing architecture details, but the capabilities spoke for themselves: it processes images alongside text, passes the bar exam in the 90th percentile, writes sophisticated code, and follows complex multi-step instructions. There are credible reports it uses a mixture-of-experts architecture — multiple specialized sub-networks that activate selectively — though OpenAI hasn't confirmed this.

The pattern across these four generations is what matters most. GPT-1 proved pre-training works. GPT-2 proved zero-shot works. GPT-3 proved few-shot works. GPT-4 proved multimodal reasoning works. Each generation crossed a threshold that smaller models couldn't reach regardless of clever prompting or engineering.

In-Context Learning: The Prompt Is the Program

This is the capability that turned GPT from an interesting research result into a platform. You don't fine-tune the model. You don't update any weights. You write instructions and examples directly in the prompt, and the model follows them.

Let's return to our movie reviews. Here are three modes of prompting, each giving the model a different amount of guidance:

Zero-shot — describe the task, provide no examples:

Classify this movie review as positive, negative, or mixed.

Review: "The movie was dark but the ending was bright"
Sentiment:

The model has to figure out the task format and the answer from the description alone. With GPT-3 and larger models, this often works.

Few-shot — provide a handful of labeled examples first:

Classify each movie review.

"The acting felt flat and lifeless" → negative
"A bright and moving performance" → positive
"Flat acting ruined an otherwise bright script" → negative

"The movie was dark but the ending was bright" →

The model sees the pattern — reviews followed by labels — and continues it. This works remarkably well, even for tasks the model was never explicitly trained on.

Many-shot — with modern context windows stretching to 100K+ tokens, you can include dozens or hundreds of examples. Performance keeps climbing with more examples, sometimes approaching the accuracy of a fine-tuned BERT model.

Why does in-context learning work at all? There are two competing theories, and I find both compelling. The first is implicit gradient descent: research has shown that Transformer attention layers can implement something mathematically similar to a single step of gradient descent in their forward pass. The examples in the prompt act like a tiny training set, and the model internally "fits" to them — not by changing weights, but by adjusting its activations. The second theory is task recognition: the model saw thousands of task formats during pre-training (Q&A pairs, classification examples, translation pairs), and few-shot examples don't teach it anything new. They tell the model which of its existing behaviors to activate.

Both theories have experimental support. I'm still developing my intuition for which explanation is closer to the truth — or whether the distinction matters at all. In practice, the result is the same: one model, infinite tasks, no retraining.

Emergent Abilities — Real or Mirage?

Some abilities seem to appear from nowhere as models get bigger. A 1-billion-parameter model can't do three-digit addition. A 10-billion-parameter model still can't. A 100-billion-parameter model suddenly can. This isn't gradual improvement — it looks like a phase transition, like water suddenly turning to ice.

Chain-of-thought reasoning is a vivid example. Append "Let's think step by step" to a math problem, and large models start decomposing the problem into intermediate steps and arriving at the correct answer. Small models given the same prompt produce plausible-looking but wrong reasoning chains. The ability to decompose problems into steps wasn't an explicit training objective. It was present in the training data — humans write step-by-step explanations — and the model learned to leverage it, but only at sufficient scale.

There's a genuine debate about whether this is real. Schaeffer, Miranda, and Koyejo published a provocative paper in 2023 titled "Are Emergent Abilities of Large Language Models a Mirage?" Their argument: the apparent phase transition is an artifact of how we measure. If you use binary exact-match accuracy (right or wrong, no partial credit), you see a sharp jump. Switch to a continuous metric like the token-level log-likelihood, and the improvement is smooth and gradual all along. The model was getting better continuously — the all-or-nothing metric couldn't detect it until a threshold was crossed.

This doesn't fully settle things. Some capabilities — like in-context learning itself — do seem genuinely absent at small scales. A 117M-parameter model doesn't do few-shot learning badly; it doesn't do it at all. But Schaeffer's work is a critical reminder: how you measure determines what you see. I keep this in mind every time I read a claim about "breakthrough" capabilities at a new scale.

Chinchilla: The Data Diet Breakthrough

For years, the scaling strategy was straightforward: make the model bigger. More parameters, better performance. GPT-3 was 175 billion parameters trained on 300 billion tokens. Most of the compute budget went into parameters.

In 2022, DeepMind's Chinchilla paper upended this. The key finding: you should scale data and parameters together. Specifically, the optimal ratio is roughly 20 training tokens per parameter. By that rule, GPT-3's 175B parameters needed about 3.5 trillion tokens, not 300 billion. GPT-3 was massively undertrained.

DeepMind proved the point by training Chinchilla — 70 billion parameters on 1.4 trillion tokens. It outperformed Gopher, DeepMind's own 280-billion-parameter model trained on 300 billion tokens. A model 4× smaller won because it saw 4.7× more data. The total training compute was the same — they allocated it differently.

Think of it like exercise and nutrition. For years, the field was going to the gym more and more (bigger models) while eating the same small diet (same amount of data). Chinchilla showed that you get better results by proportionally increasing both. You can't out-train a bad diet, and you can't out-data a tiny model. Balance matters.

This insight echoes what happened with BERT. Remember RoBERTa? Same architecture, 10× more data, much better results. The lesson repeats: many "architectural breakthroughs" are actually training discipline breakthroughs. How much data you feed the model, and for how long, matters as much as how many parameters it has.

🛑 Rest Stop

You can stop here if you'd like. You now have both halves of the picture: BERT, the bidirectional encoder trained with masked language modeling, and GPT, the autoregressive decoder trained with next-token prediction. You understand the GPT scaling story — how each generation unlocked new capabilities — and the Chinchilla insight about balancing data and parameters.

The short version of what's ahead: BERT spawned a family of improved variants (RoBERTa, ALBERT, DeBERTa), the paradigm comparison has practical implications for which model to use when, and there's a hybrid workflow that gets you the best of both worlds. There. You're about 75% of the way there.

But if you want the full picture — including the architectural tricks that made BERT's descendants stronger than the original, and a concrete framework for choosing between the paradigms — read on.

The BERT Family Tree: RoBERTa, ALBERT, DeBERTa

The original BERT paper left performance on the table. Not because the architecture was wrong, but because the training was incomplete. Several teams figured this out and published improvements. Let me walk through the three most important ones, because each teaches a different lesson.

RoBERTa (Robustly Optimized BERT Approach, 2019) changed nothing about the architecture. Zero architectural modifications. It trained the same model with better discipline: 10× more training data, removed NSP (as we discussed), used dynamic masking (re-randomize which tokens get masked each epoch, instead of fixing the mask pattern during preprocessing), larger batch sizes, and trained for longer. The result: massive improvements across every benchmark.

The lesson from RoBERTa is humbling. The original BERT wasn't trained hard enough. The architecture was fine — the training recipe was incomplete. This pattern repeats throughout deep learning: before you redesign the architecture, try training the existing one properly.

ALBERT (A Lite BERT, 2019) went in the opposite direction — it focused on making BERT smaller and more efficient. Two key ideas. First, cross-layer parameter sharing: instead of each of the 12 Transformer layers having its own independent set of weights, all layers share the same weights. The model applies the same transformation 12 times. This reduces the parameter count dramatically (from 110M to as low as 12M) with surprisingly modest performance loss. Second, factorized embedding parameterization: instead of one large embedding matrix that maps vocabulary tokens directly to the hidden dimension (30,000 × 768 = 23M parameters), ALBERT first maps to a smaller intermediate dimension (30,000 × 128) and then projects up (128 × 768). This decouples vocabulary size from hidden dimension, saving millions of parameters. ALBERT also replaced NSP with Sentence Order Prediction, which as we saw is a harder and more useful training signal.

DeBERTa (Decoding-enhanced BERT with Disentangled Attention, 2020) made the most significant architectural change and produced the strongest results. The core innovation is disentangled attention: instead of computing attention scores from a single combined embedding (which mixes content and position information), DeBERTa keeps content and position separate.

In standard BERT, the attention score between tokens i and j is computed from embeddings that already combine "what the token is" with "where the token is." DeBERTa untangles these two signals. The attention score becomes a sum of three components: content-to-content (what is token i, and what is token j?), content-to-position (what is token i, and where is token j?), and position-to-content (where is token i, and what is token j?). Each component uses separate learned projections.

I haven't figured out a great way to visualize disentangled attention intuitively, but here's a crude attempt. In our movie review "The acting felt flat," standard attention computes: "how relevant is 'flat' to 'acting'?" as a single score mixing the identity of both words with their positions. DeBERTa asks three separate questions: "Is the concept 'flat' relevant to the concept 'acting'?" (content-to-content), "Is the concept 'flat' relevant to whatever is at position 2?" (content-to-position), and "Is whatever is at position 4 relevant to the concept 'acting'?" (position-to-content). These three scores are summed. The model can learn, for example, that adjectives at certain relative positions are more relevant than adjectives far away — separately from learning that "flat" is semantically relevant to "acting."

DeBERTa surpassed the human baseline on SuperGLUE, the premier language understanding benchmark, and remains the strongest model in the BERT family. Its success suggests that separating "what" from "where" in attention is a genuinely important architectural principle.

Encoder vs. Decoder: The Paradigm Comparison

We've now seen both sides. Time to put them face to face.

Think back to our movie review classifier. If we want to classify "The movie was dark but the ending was bright" as positive, negative, or mixed, we need the model to understand the entire review before making a decision. The word "but" is doing critical work — it signals a contrast between "dark" and "bright." A model that reads left-to-right and reaches "dark" might tentatively guess "negative" — and it won't see the correction until later. BERT reads the whole thing at once and catches the contrast from the start. This is the BERT advantage for understanding tasks.

Now imagine we want the model to write a review. "Generate a mixed sentiment review about a sci-fi movie." The model has to produce text word by word, and it can't look ahead because the words don't exist yet. This is the GPT advantage for generation tasks.

Here's where the rubber meets the road — the practical differences that matter when you're choosing a model for a production system:

Dimension	BERT (Encoder)	GPT (Decoder)
Attention pattern	Bidirectional — every token sees every other token	Causal — each token sees only past tokens
Training objective	Masked language modeling (fill in blanks)	Next-token prediction (continue the sequence)
Natural strength	Understanding: classification, NER, similarity, extractive QA	Generation: text, dialogue, code, translation
How you adapt it	Fine-tune on labeled data with a task-specific head	Prompt engineering, few-shot, or fine-tune
Inference speed	Fast — one forward pass for the entire input	Slower — one forward pass per generated token
Labeled data?	Required for fine-tuning	Optional (zero/few-shot works for many tasks)
Parameter range	66M – 1.5B (practical range)	117M – trillions (and growing)
Output format	Classification labels, token labels, spans	Free-form text (anything)
Deterministic?	Yes, with fixed weights	Depends on sampling temperature

Here's the uncomfortable truth for the BERT paradigm: GPT-style models are increasingly competitive on understanding tasks too. GPT-4 does sentiment classification, named entity recognition, and question answering through prompting — no fine-tuning, no labeled data. For many tasks, the accuracy gap between a prompted GPT-4 and a fine-tuned DeBERTa has closed.

But "closed" isn't "eliminated." I still occasionally reach for BERT-style models over GPT, and here's when. If I need millisecond latency — BERT processes the entire input in a single forward pass, while GPT generates one token at a time. If I need cheap inference at scale — a DistilBERT model is 66 million parameters, GPT-4 is orders of magnitude larger. If I need guaranteed output format — a classification head outputs exactly one of N labels, always. GPT can hallucinate anything. If I need deterministic, reproducible results — a fine-tuned BERT with fixed weights gives the same output every time. GPT with temperature sampling does not.

For production classification pipelines processing millions of items per hour, a fine-tuned DistilBERT running at 2ms per input, costing fractions of a cent per thousand calls, is still the pragmatic choice.

The Hybrid Workflow

In practice, the most effective teams don't choose one paradigm over the other. They use both, in sequence.

The workflow goes like this: start with GPT for prototyping. Use prompting to validate that your task is solvable — can a language model distinguish positive, negative, and mixed reviews from the prompt alone? If the answer is yes, use GPT to generate labeled data by classifying a large batch of unlabeled examples. Then take those GPT-generated labels, fine-tune a small BERT-style model (DistilBERT, RoBERTa) on them, and deploy the small model to production.

# Phase 1: GPT generates labels (prototyping)
from openai import OpenAI
client = OpenAI()

def label_with_gpt(review_text):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": (
                "Classify this movie review as positive, negative, "
                "or mixed. Reply with one word only.\n\n"
                f"Review: {review_text}"
            )
        }],
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()

# Phase 2: Fine-tune a small model on GPT's labels (production)
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=3
)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Train on the GPT-labeled data → deploy a 66M-param model
# that runs in 2ms and costs 100× less per call

The result: you get GPT's flexibility for exploration and data generation, and BERT's speed and cost efficiency for deployment. The paradigms aren't competitors — they're complements. GPT is your research assistant. BERT is your production engineer.

This brings our running example full circle. We started with five movie reviews and a question about how models handle ambiguous words like "dark" and "bright." BERT reads the full review and resolves the ambiguity using both left and right context. GPT reads left-to-right and resolves it using only what came before — but it compensates with scale, in-context learning, and the sheer volume of language patterns it absorbed during training. For classifying those reviews in production, a small fine-tuned BERT model is the efficient choice. For exploring new classification schemes or generating training data, GPT is the flexible choice.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with a single architectural choice — which tokens can see which — and traced how it splits the Transformer into two fundamentally different paradigms. We watched BERT learn to fill in blanks using the 80/10/10 masking trick, and saw how NSP was a well-intentioned idea that didn't survive contact with better training. We crossed the fork and watched GPT learn to write one token at a time, then scale from 117 million to hundreds of billions of parameters, picking up zero-shot learning, few-shot learning, and chain-of-thought reasoning along the way. We saw RoBERTa prove that training discipline matters as much as architecture, ALBERT prove that you can share parameters across layers without disaster, and DeBERTa prove that separating content from position in attention is a significant win. And we ended where practice begins: choosing the right tool for the job, and combining both paradigms when that makes sense.

My hope is that the next time you see someone claim "GPT has made BERT obsolete" or "BERT is all you need for NLP," instead of nodding along, you'll know exactly where each paradigm shines and where it struggles — having a pretty good mental model of what's going on under the hood of both.

Resources

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018) — the O.G. paper. Remarkably readable for a paper that spawned an entire paradigm.

Improving Language Understanding by Generative Pre-Training (Radford et al., 2018) — the GPT-1 paper. Short, unpretentious, and historically important.

Language Models are Few-Shot Learners (Brown et al., 2020) — the GPT-3 paper. Long, but the few-shot learning demonstrations are unforgettable.

RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019) — proves that training discipline matters more than architecture changes. A humbling read.

DeBERTa: Decoding-enhanced BERT with Disentangled Attention (He et al., 2020) — the strongest BERT-style model. The disentangled attention mechanism is worth understanding in detail.

Training Compute-Optimal Large Language Models (Hoffmann et al., 2022) — the Chinchilla paper. Changed how everyone thinks about scaling. Short and wildly influential.

Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al., 2023) — the counterpoint to the emergence narrative. Insightful regardless of where you land on the debate.

← Previous Self-Attention & the Transformer Next → State Space Models & Mamba