LLM Training — Pretraining, Scaling & Fundamentals
I avoided digging into how LLMs are actually trained for longer than I’d like to admit. I could use them, prompt them, even fine-tune them. But every time someone asked “how is the base model actually made?” I’d wave my hands and mumble something about “next-token prediction on lots of text.” That answer is technically correct and practically useless. It’s like saying a skyscraper is made from “stacking stuff up.” Finally the discomfort of not knowing what actually happens between “download Common Crawl” and “you have a working chatbot” grew too great for me. Here is that dive.
LLM pretraining is the process of teaching a randomly initialized neural network to understand and generate language by exposing it to enormous quantities of text. The model learns no specific task — it learns language itself, developing internal representations of syntax, semantics, world knowledge, and reasoning patterns. Pretraining was popularized by GPT (2018) and BERT (2018), and has since become the foundation of every major language AI system.
Before we start, a heads-up. We’re going to talk about web crawling, hash functions, tokenization algorithms, loss functions, scaling power laws, GPU clusters, and numerical precision. You don’t need to know any of it beforehand. We’ll add the concepts we need one at a time, with explanation.
This isn’t a short journey, but I hope you’ll be glad you came.
Cleaning the mess
Building a vocabulary from scratch
The pretraining objective: what does the model actually learn?
Three flavors of prediction
Rest stop
How big should the model be? Scaling laws
The Chinchilla correction
Compute-optimal training and the data wall
The physical machine: clusters and interconnects
When training goes wrong: loss spikes and stability
Feeding the beast: curriculum and data mixing
Resources and credits
Where Does the Training Data Come From?
Imagine we want to train a small language model for a recipe website. We want it to complete sentences like “Preheat the oven to” or “Stir the mixture until.” Where do we get the training text?
For our tiny recipe model, we might scrape three cooking blogs. We’d end up with maybe 10,000 sentences. For a real LLM, the answer is the same — scrape the web — but at a scale that’s hard to internalize. The primary source for most open LLMs is Common Crawl, a nonprofit that has been crawling the web monthly since 2008. Each monthly snapshot is tens of terabytes of raw HTML. All of it. News articles, blog posts, forum threads, product listings, cookie consent banners, spam, and an astonishing number of pages that consist entirely of the phrase “lorem ipsum.”
Common Crawl alone gives us hundreds of trillions of words. But raw web text is like a river after a flood: there’s water, but also driftwood, plastic bags, and the occasional shopping cart. I’ll be honest — when I first looked at raw Common Crawl data, I was genuinely surprised by how much of it is garbage. Not edge-case garbage. Majority garbage.
The major open pretraining datasets each take a different approach to taming this flood:
The Pile, built by EleutherAI, took a curation-first approach. Instead of starting with one giant web crawl, they assembled 22 distinct subsets: Wikipedia, ArXiv papers, PubMed biomedical literature, GitHub code, Stack Exchange Q&A, Project Gutenberg books, YouTube subtitles, even Enron emails and Ubuntu IRC chat logs. Each subset was chosen because it represented a different kind of language the model should learn. The total: 825 GB of text, deliberately diverse.
RedPajama, from Together AI, aimed to reproduce the data recipe behind LLaMA. It combined filtered Common Crawl with Wikipedia, books, ArXiv, Stack Exchange, PubMed, and GitHub code — roughly 1.2 trillion tokens. The key word is “reproduce” — they wanted to show that an open dataset could match what Meta used behind closed doors.
RefinedWeb, built for the Falcon models, took the opposite philosophy. Instead of curating many sources, they focused entirely on Common Crawl but applied brutally aggressive filtering. The thesis: if you clean web data well enough, you don’t need curated sources. A bold claim, and Falcon’s strong performance backed it up.
Back to our recipe model. We scraped three blogs and got 10,000 sentences. That’s our “Common Crawl.” Now we need to clean it. And that brings us to the hardest part of data preparation — the part that doesn’t get enough credit.
Cleaning the Mess
Our recipe scrape contains duplicates. Lots of them. The same chocolate chip cookie recipe appeared on all three blogs (they all copied it from the same source, naturally). There are also navigation menus, ad text, comment sections full of “great recipe!!!”, and one blog that auto-generated 500 pages of keyword-stuffed SEO nonsense about “best recipe for recipe recipes.”
Cleaning happens in two stages: deduplication and quality filtering. Both matter more than most people realize.
Deduplication
I assumed web data was mostly unique. I was wrong by an embarrassing margin. Studies consistently find that 30–50% of web-crawled text is near-duplicate content. Not vaguely similar — near-identical. Boilerplate footers, syndicated articles, copy-pasted forum posts, scraped mirrors of the same sites.
Training on duplicate data is worse than wasteful. It causes the model to memorize those passages rather than generalize from them. If the same paragraph appears a thousand times, the model learns to recite it verbatim, which is bad for both generalization and privacy.
The standard deduplication approach uses a technique called MinHash with Locality-Sensitive Hashing (LSH). Here’s how it works, traced through our recipe example. Suppose we have three documents:
Doc A: "Preheat oven to 350°F. Mix flour, sugar, and butter."
Doc B: "Preheat oven to 350°F. Mix flour, sugar, and butter. Add eggs."
Doc C: "Sauté onions in olive oil until translucent."
First, we break each document into overlapping chunks of words called n-grams (typically 5-grams). Doc A becomes the set {“Preheat oven to 350°F Mix”, “oven to 350°F Mix flour”, ...}. Doc B produces a very similar set with a few extras at the end.
Next, we apply multiple hash functions to each n-gram set and keep only the minimum hash value for each function. This produces a compact fingerprint called a MinHash signature — maybe 128 numbers that summarize the entire document. The mathematical property that makes this work: the probability that two documents share a MinHash value equals their Jaccard similarity (the overlap of their n-gram sets divided by the union).
Then LSH groups documents with similar fingerprints into buckets. Within each bucket, we compute exact Jaccard similarity. Documents above a threshold (commonly 0.8) are flagged as near-duplicates, and all but one copy gets removed.
For our recipe example, Docs A and B would land in the same LSH bucket, and their Jaccard similarity (high, because most n-grams overlap) would exceed 0.8. We’d keep one and discard the other. Doc C has completely different n-grams, so it stays.
At production scale, the LLaMA team applied exact deduplication at the line level (removing identical lines across the corpus) plus fuzzy deduplication with MinHash at the document level. This alone cut their data by roughly a third.
Quality Filtering
After deduplication, we still have a lot of low-quality text. The goal is to keep the Wikipedia articles and cooking instructions while removing the SEO spam and cookie consent banners.
Three approaches, often used in combination:
Perplexity-based filtering trains a small, cheap language model (often a KenLM n-gram model) on a reference corpus of high-quality text, such as Wikipedia. Then every document in the crawl gets scored: how surprised is the quality model by this text? A well-written article gets low perplexity (not surprising). A page of garbled text or keyword spam gets high perplexity (very surprising). Documents above a perplexity threshold get removed. This is the core of the CCNet pipeline, developed by Facebook for processing Common Crawl.
Heuristic rules catch patterns that statistical models miss. Remove documents shorter than 200 characters. Remove documents where more than 30% of characters are special symbols. Remove documents with a word-to-symbol ratio below a threshold. Remove documents with more than 3 consecutive repeated lines. The RefinedWeb paper documents dozens of such rules, each discovered by manually inspecting failure cases.
Classifier-based filtering trains a fast classifier (often fastText) on a small set of human-rated examples. “Is this high quality? Yes or no.” Then it scores the entire crawl. This is more expensive than heuristics but catches subtler quality issues.
Our recipe model’s equivalent: we read through our 10,000 scraped sentences, threw out the navigation menus and comment spam, and kept the actual recipes. At web scale, you can’t read anything manually, so you build automated versions of yourself.
Here’s something that took me a while to internalize: the quality of your data is a scaling law. Bad data doesn’t get diluted by more compute. It shifts the entire loss curve upward. A well-curated 1T token dataset can produce a better model than a poorly filtered 5T token dataset. Microsoft’s Phi models demonstrated this dramatically — they trained small models on “textbook quality” data and got performance that rivaled models 10× larger trained on noisier corpora.
Building a Vocabulary from Scratch
We have clean text. Before we can train a model on it, we need to convert words into numbers. But which numbers? And what counts as a “word”?
Our recipe corpus has words like “preheat,” “350°F,” “tablespoon,” and “sauté.” If we treat each unique word as a token, we get a vocabulary of maybe 5,000 entries. That works for a tiny model, but real English has hundreds of thousands of words, plus every language we want to support, plus code, plus mathematical notation. A word-level vocabulary would be enormous and still fail on any word it hasn’t seen before.
The solution is subword tokenization, and the dominant algorithm is Byte Pair Encoding (BPE). The core idea: start with individual characters (or bytes), then repeatedly merge the most frequent pair of adjacent tokens into a new token. After enough merges, common words become single tokens while rare words get split into familiar pieces.
Let’s walk through BPE on a tiny vocabulary. Suppose our entire corpus is:
Corpus: "heat heat heat preheat preheat"
Start with character-level tokens:
h e a t h e a t h e a t p r e h e a t p r e h e a t
Frequency of adjacent pairs:
(h, e): 5 (e, a): 5 (a, t): 5 (p, r): 2 (r, e): 2
Merge #1: (h, e) → "he" (most frequent pair, tied — pick one)
he a t he a t he a t p r e he a t p r e he a t
Merge #2: (a, t) → "at"
he at he at he at p r e he at p r e he at
Merge #3: (he, at) → "heat"
heat heat heat p r e heat p r e heat
Merge #4: (e, heat) → "eheat" ... and so on
After enough merges, “heat” is a single token and “preheat” is two tokens: “pre” + “heat.” A word the model has never seen, like “reheat,” would get split into “re” + “heat” — meaningful pieces that the model knows.
The vocabulary size — how many merges you perform — is a design choice with real tradeoffs. LLaMA-2 uses 32,000 tokens (via SentencePiece). GPT-4 uses about 100,000 tokens (via tiktoken). A larger vocabulary means each word is represented by fewer tokens (saving sequence length, which saves compute in the attention mechanism), but it also means a larger embedding matrix (the lookup table that converts token IDs to vectors) and a larger output layer.
At LLM scale, the tokenizer is typically trained on a representative sample of the training data, not the full corpus. You want the tokenizer to reflect the distribution of text the model will see. Train the tokenizer on English Wikipedia and your model will struggle with Python code. Train it on a balanced mix and it handles both.
One thing that still trips me up: the tokenizer is trained separately from the model, before model training begins. Once set, it’s frozen. You cannot change the vocabulary mid-training. This means tokenizer design decisions made months before training completes have permanent consequences. A mismatch between the tokenizer and the training data is one of those silent bugs that degrades quality without ever throwing an error.
The Pretraining Objective: What Does the Model Actually Learn?
We have clean data and a tokenizer. Now for the question that determines everything: what task do we give the model during training? There are no labels, no human annotations. The model has to learn from the text itself. The trick is designing a self-supervised objective — a task the data provides for free.
Let’s return to our recipe corpus. We have the sentence: “Preheat the oven to 350°F and grease the pan.” What can we ask the model to do with this, without any human labeling?
Causal Language Modeling: Guess What Comes Next
The most powerful idea in modern AI might also be the most underwhelming when you first hear it: predict the next token. That’s it. Given everything the model has read so far, what token comes next?
For our recipe sentence, the model sees “Preheat” and must predict “the.” Then it sees “Preheat the” and predicts “oven.” Then “Preheat the oven” → “to.” And so on, sliding forward one token at a time. At every position, the model outputs a probability distribution over the entire vocabulary — all 32,000 tokens — and gets penalized for assigning low probability to the actual next token.
Input tokens: Preheat the oven to 350 °F and
↓ ↓ ↓ ↓ ↓ ↓ ↓
Model predicts: the oven to 350 °F and grease
↕ ↕ ↕ ↕ ↕ ↕ ↕
Actual next: the oven to 350 °F and grease ✓ ✓ ✓ ✓ ✓ ✓ ✓
Loss = cross-entropy at EVERY position, averaged across the sequence
This is called causal language modeling (CLM) because each prediction is causal: position 5 can attend to positions 1 through 4 but never to position 6 or beyond. The model reads left to right, like us. This is enforced by a causal attention mask — a triangular matrix that blocks each position from seeing the future:
Causal Mask (5 tokens):
Preheat the oven to 350
Preheat [ 1 0 0 0 0 ] sees only itself
the [ 1 1 0 0 0 ] sees Preheat, the
oven [ 1 1 1 0 0 ] sees Preheat, the, oven
to [ 1 1 1 1 0 ] sees positions 1-4
350 [ 1 1 1 1 1 ] sees everything so far
1 = can attend, 0 = masked (set to -∞ before softmax)
The implementation has a critical detail called the shift trick. The model’s output at position t is its prediction for position t+1. So to compute the loss, we compare logits at positions 1 through N−1 against the actual tokens at positions 2 through N:
logits = model(input_ids) # shape: (batch, seq_len, vocab_size)
shift_logits = logits[:, :-1, :] # predictions for positions 2..N
shift_labels = input_ids[:, 1:] # actual tokens at positions 2..N
loss = cross_entropy(shift_logits, shift_labels)
Off-by-one errors in this shift are one of the most common bugs in training code. I’ve seen experienced engineers lose days to it.
During training, we use teacher forcing: the model always receives the actual ground-truth tokens as input, not its own predictions. At inference time, there is no ground truth, so the model feeds its own outputs back in, generating one token at a time autoregressively.
Why does this absurdly minimal objective produce such capable models? Because language encodes everything. To accurately predict the next word in a Wikipedia article about quantum mechanics, the model must understand physics. To predict the next line of a Python function, it must understand programming logic. To predict the next sentence in a novel, it must understand human psychology. The objective is trivial. The knowledge required to minimize it is not. That’s the deep insight behind GPT and every decoder-only model since.
CLM powers GPT, LLaMA, Mistral, Claude, and effectively every chatbot you’ve interacted with. The industry has converged on it because generation is the killer app, and decoder-only models with CLM scale cleanly: one architecture, one objective, one pipeline.
Three Flavors of Prediction
CLM isn’t the only pretraining objective. There are two important alternatives, and understanding all three reveals why the field converged on CLM — and what we gave up along the way.
Masked Language Modeling: Fill in the Blanks
BERT (2018) took a completely different approach. Instead of reading left to right and predicting the next word, it randomly hides some tokens and asks the model to fill them back in — using context from both sides.
Our recipe sentence becomes:
Original: "Preheat the oven to 350°F and grease the pan"
Masked: "Preheat the [MASK] to 350°F and [MASK] the pan"
The model must predict:
[MASK] → "oven" (using left context "Preheat the" AND right context "to 350°F...")
[MASK] → "grease" (using "...and" AND "the pan")
The superpower here is bidirectional context. When predicting “oven,” the model sees both “Preheat the” and “to 350°F.” CLM only sees the left side. This makes masked language modeling (MLM) produce richer representations for understanding tasks — classification, named entity recognition, question answering.
BERT masks 15% of tokens using an 80/10/10 protocol: 80% are replaced with the special [MASK] token, 10% are replaced with a random word from the vocabulary, and 10% are left unchanged but still predicted. The random replacements and unchanged tokens prevent the model from learning a shortcut that only activates when it sees [MASK] — a token that never appears during fine-tuning or real use.
The tradeoff is severe: MLM models cannot generate text. There’s no left-to-right factorization, no way to autoregressively sample. And there’s a sample efficiency problem: only 15% of tokens contribute to the loss in each training step. The other 85% provide context but generate zero gradient signal. That’s a lot of wasted compute.
Prefix Language Modeling and Span Corruption
Between CLM and MLM lies a family of hybrid objectives. Prefix language modeling, used in models like UniLM, lets the model attend bidirectionally within a “prefix” section, then autoregressively predicts what follows. Think of it as reading an entire question bidirectionally, then generating the answer left-to-right.
Span corruption, used by T5, generalizes MLM by masking contiguous spans of tokens rather than individual words. Each span is replaced with a single sentinel token, and the decoder reconstructs the missing spans:
Original: "Preheat the oven to 350°F and grease the pan"
Corrupted: "Preheat <X0> 350°F and <X1> pan"
Target: "<X0> the oven to <X1> grease the"
The encoder sees the corrupted input (shorter — saves compute).
The decoder generates only the missing pieces (compact targets).
T5 cast every NLP task as text-to-text: same architecture, same loss, every task. Pretraining was span corruption. Fine-tuning was “translate English to German: The house is blue” → “Das Haus ist blau.” Elegant and practical, but the encoder-decoder architecture is more complex to scale than a pure decoder.
Here’s the mapping that’s worth internalizing:
| Architecture | Objective | Attention | Strengths | Examples |
|---|---|---|---|---|
| Encoder-only | MLM | Bidirectional | Classification, retrieval, embeddings | BERT, RoBERTa, ELECTRA |
| Decoder-only | CLM | Causal (left-to-right) | Generation, chat, code, reasoning | GPT, LLaMA, Mistral |
| Encoder-Decoder | Span corruption / prefix LM | Bidirectional encoder + causal decoder | Translation, summarization | T5, BART, UL2 |
The industry converged on decoder-only CLM not because it’s theoretically best for every task, but because it’s the simplest to scale and generation turned out to be the killer capability. One architecture, one objective, one training pipeline. Encoder-only models remain king for embeddings and classification. Encoder-decoder models are strong for structured sequence-to-sequence problems. But when your model is big enough, decoder-only does everything well enough, and “well enough at everything” beat “best at one thing.”
Rest Stop
Congratulations on making it this far. If you want to stop here, you can.
You now have a solid mental model of LLM pretraining: raw text gets scraped from the web, cleaned through deduplication and quality filtering, converted to tokens by a BPE tokenizer, and fed into a transformer that learns to predict the next token. That’s the foundation of GPT, LLaMA, and every major language model.
This understanding doesn’t tell the complete story. It doesn’t answer “how big should the model be?” or “how many GPUs do you need?” or “what happens when the training run blows up at 3 AM?” Those are the questions that separate someone who understands pretraining from someone who can actually plan and execute one.
The short version: there are mathematical laws that tell you the optimal model size for your budget, you need hundreds of GPUs connected by specialized networking hardware, and training runs are fragile beasts that require careful babysitting. There. You’re 60% of the way.
But if the discomfort of not knowing what’s underneath is nagging at you, read on.
How Big Should the Model Be? Scaling Laws
For years, training LLMs was expensive gambling. You picked a model size, picked a dataset size, spent millions of dollars, and hoped it worked. Then in 2020, a team at OpenAI discovered something that turned gambling into engineering.
Kaplan and colleagues trained hundreds of transformer models ranging from a few hundred parameters to 1.5 billion parameters. When they plotted loss against model size on a log-log chart, they got a straight line. Same for loss against dataset size. Same for loss against total compute.
Three clean power-law relationships:
Loss vs. Parameters: L(N) ∝ N^(-0.076)
Loss vs. Data: L(D) ∝ D^(-0.095)
Loss vs. Compute: L(C) ∝ C^(-0.050)
Loss (log scale)
│
│╲
│ ╲
│ ╲ ← straight line on log-log
│ ╲ (this is a power law)
│ ╲
│──────────── Parameters or Tokens (log scale)
The profound implication: every time you 10× your parameters, you get a fixed reduction in loss. The returns never stop. They get more expensive, but they never stop. No one has found a ceiling yet.
Kaplan’s key conclusion: given a fixed compute budget, prioritize model size over data. Make the model bigger rather than training on more data. The recommendation: if you 10× your compute, 5× your parameters but only 2× your data.
This directly shaped the “make it bigger” era. GPT-3 was 175 billion parameters trained on 300 billion tokens. Massive model, modest data. The thinking was: parameters matter most.
This turned out to be wrong.
The Chinchilla Correction
In March 2022, DeepMind published a paper that reshuffled the entire field overnight. The title was dry: “Training Compute-Optimal Large Language Models.” The finding was devastating: almost every existing large model was undertrained.
Hoffmann’s team ran a proper sweep: over 400 models from 70 million to 16 billion parameters, trained on varying amounts of data, all with fixed compute budgets. The picture that emerged was different from Kaplan’s.
The core finding: for compute-optimal training, scale parameters and data equally. The optimal ratio is roughly 20 tokens per parameter:
Chinchilla Rule: D ≈ 20 × N
N = number of model parameters
D = number of training tokens
Given compute C ≈ 6 × N × D:
Optimal N ∝ √C
Optimal D ∝ √C (both scale equally with compute)
To prove it, they ran the most convincing experiment possible. DeepMind had recently trained Gopher: 280B parameters, 300B tokens. Classic Kaplan-era — huge model, modest data. Then they trained Chinchilla: 70B parameters (4× smaller) but on 1.4 trillion tokens (4.7× more data). Same compute budget. The result: Chinchilla won on nearly every benchmark. A model 4× smaller, trained on more data, was flat-out better. And because it was smaller, it was cheaper to deploy.
I still have to look up the exact Chinchilla ratio every time I do a back-of-envelope calculation. So here’s the lookup table I keep coming back to:
| Parameters | Optimal Tokens (≈20N) | Approx FLOPs |
|---|---|---|
| 400M | 8B | 2 × 10¹&sup9; |
| 1B | 20B | 1.2 × 10²&sup0; |
| 7B | 140B | 5.9 × 10²¹ |
| 13B | 260B | 2 × 10²² |
| 70B | 1.4T | 5.9 × 10²³ |
GPT-3 (175B parameters, 300B tokens) was trained on roughly 12× fewer tokens than Chinchilla would recommend. Massively undertrained by modern standards.
A quick Python function I use for budget planning:
import math
def chinchilla_optimal(flops_budget):
"""Given a FLOPs budget, estimate compute-optimal model size and tokens."""
# C ≈ 6 * N * D, with D ≈ 20 * N
# So C ≈ 120 * N², therefore N = sqrt(C / 120)
N = math.sqrt(flops_budget / 120)
D = 20 * N
return N, D
# Example: 10^22 FLOPs
n, d = chinchilla_optimal(1e22)
print(f"≈{n/1e9:.1f}B params, ≈{d/1e9:.0f}B tokens")
# → ≈9.1B params, ≈182B tokens
Compute-Optimal Training and the Data Wall
Chinchilla changed how labs allocate compute. But it also exposed a looming problem: we might run out of training data.
A 1-trillion-parameter model, by the Chinchilla rule, would need 20 trillion tokens. The entire publicly available internet — Common Crawl, books, code, Wikipedia, everything — is estimated at 5 to 15 trillion tokens of reasonable quality. We’re already scraping the edges.
This “data wall” has driven several responses. Meta’s LLaMA deliberately over-trains: LLaMA-7B was trained on 1 trillion tokens, roughly 7× the Chinchilla-optimal amount. The reasoning is pragmatic. Chinchilla optimizes for training compute, but inference cost depends on model size, not on how long you trained it. A smaller model trained longer is cheaper to deploy. When you’re serving millions of users, the few extra weeks of training compute pay for themselves many times over in cheaper inference.
Synthetic data — models generating training data for other models — is another escape route. It works surprisingly well for code and mathematics, where generated outputs can be verified for correctness. But it carries the risk of model collapse: distributional biases compound over generations, like repeatedly photocopying a photocopy until only the darkest marks survive.
Multi-epoch training (reusing data multiple times) works with appropriate regularization, but returns diminish after 4–8 passes. Multimodal data — images, video, audio — is partly a strategy to escape the text data wall by tapping into the enormous information content of visual and auditory signals.
No one is completely certain about the right path forward. My favorite thing about this corner of the field is how open the questions still are.
The Physical Machine: Clusters and Interconnects
Let’s ground all this theory in hardware. Training a 7B-parameter model requires multiple GPUs working in lockstep. Training a 70B model requires hundreds of them. The physical infrastructure is as much a part of the story as the algorithms.
A modern training cluster is built from nodes, each containing 8 high-end GPUs (typically NVIDIA H100s). Within a node, GPUs communicate via NVLink, a proprietary interconnect that provides up to 900 GB/s of bidirectional bandwidth per GPU. That’s fast enough that GPUs within a node can share data almost as if they were accessing their own memory.
Between nodes, the interconnect is InfiniBand, which provides 400 Gbps with latencies around 1–2 microseconds. Fast by networking standards, but roughly 18× slower than NVLink. This asymmetry shapes everything about distributed training strategy.
To get an intuition for these numbers: if NVLink were a fire hose, InfiniBand would be a garden hose. Both move water, but you plan very differently depending on which one you have between two GPUs.
Distributed training uses three types of parallelism, layered according to the bandwidth they need:
Tensor Parallelism (TP) splits individual weight matrices across GPUs within a node. Each GPU computes a slice of the attention or feedforward layer, then they combine results. This requires constant communication between GPUs, so it uses the fastest link: NVLink within a node.
Pipeline Parallelism (PP) assigns different transformer layers to different groups of nodes. GPU group 0 runs layers 1–8, group 1 runs layers 9–16, and so on. Data flows through groups like an assembly line. The downside is “pipeline bubbles” — GPUs sit idle waiting for activations from earlier stages. This tolerates moderate latency, so it spans a few nodes connected by InfiniBand.
Data Parallelism (DP / FSDP) replicates the training across many groups, each processing different batches. Gradients are averaged across groups after each step. This requires the least frequent communication, so it spans the widest network distances.
Example: LLaMA-70B on 256 GPUs
TP = 8 (within each node, NVLink — splits weight matrices)
× PP = 4 (across 4 node groups — splits layers)
× DP = 8 (across 8 pipeline replicas — splits data)
= 8 × 4 × 8 = 256 GPUs
Think of it as 8 identical assembly lines (DP),
each with 4 stages (PP),
each stage using 8 GPUs in lockstep (TP).
The fire-hose-and-garden-hose analogy comes back here. TP needs the fire hose (NVLink, within a node). PP can work with the garden hose (InfiniBand, across a few nodes). DP can work with a trickle (InfiniBand across the wider network), because it only needs to sync gradients periodically.
The key memory optimization that makes all of this practical is FSDP (Fully Sharded Data Parallel), PyTorch’s implementation of ZeRO-3 from DeepSpeed. In standard data parallelism, every GPU holds a complete copy of the model, optimizer states, and gradients — massive redundancy. FSDP shards everything: each GPU stores only 1/N of the parameters, optimizer states, and gradients. During the forward pass, parameters are gathered from other GPUs right before they’re needed, used, then discarded. This can reduce per-GPU memory from over 100 GB to under 15 GB for a 7B model across 8 GPUs.
When Training Goes Wrong: Loss Spikes and Stability
The first time I saw a loss spike in a training run, I thought the run was dead. Loss had been decreasing smoothly for days, and then — a sudden upward jump, like a heart rate monitor during a horror movie. Twenty minutes of panic. Then the loss recovered, as if nothing happened.
Loss spikes are the most common instability in LLM training. The loss is declining peacefully, then shoots upward before (usually) recovering over the next few hundred steps. Causes include:
A bad data batch — a batch containing highly unusual content (extremely long documents, garbled Unicode, code with pathological nesting) can produce extreme gradient values that shove the model into a bad region of parameter space. The fix: better data filtering before training, or detecting and skipping batches with anomalous loss during training.
A learning rate that’s too high — the optimizer takes a step that’s too large and overshoots into a steep valley. The fix: reduce peak learning rate, use longer warmup, or both.
Numerical instability — FP16 overflow, division by very small numbers in layer normalization, or softmax overflow. The fix: use BF16 (which has the same dynamic range as FP32, making overflow extremely rare), add small epsilon values to denominators, and use numerically stable softmax implementations.
Google’s PaLM team documented a practical recovery approach: when a loss spike occurs, roll back to a checkpoint from roughly 100 steps before the spike started, and skip the data batches that triggered it. It’s crude, but effective. Some teams also use checkpoint averaging — averaging weights from the last 5–20 checkpoints to smooth out catastrophic updates and return the model to a stable region.
The universal safety net is gradient clipping: before each optimizer step, compute the global norm of all gradients and, if it exceeds a threshold (typically 1.0), scale all gradients down proportionally. This preserves the direction of the update while capping its magnitude. Think of it as saying “go in the same direction, but take a smaller step.” Global norm clipping, rather than per-parameter clipping, is critical because per-parameter clipping distorts the gradient direction.
Other stability techniques that have proven essential at scale:
BF16 mixed precision — store most tensors in 16-bit bfloat16 (which trades mantissa precision for FP32-level dynamic range), but maintain a FP32 copy of the weights for the optimizer update. The optimizer accumulates many small gradient updates, and doing this in BF16 would cause rounding errors that compound over thousands of steps. This mixing of precisions is why it’s called “mixed precision.”
QK-Norm — applying layer normalization to queries and keys before computing attention scores. Without this, attention logits can grow unboundedly as training progresses, eventually causing instability. Used in PaLM and LLaMA-2.
z-loss — a small penalty on the log-partition function of the output softmax, preventing output logits from growing too large. Used in PaLM training.
Learning rate warmup — starting with a near-zero learning rate and linearly increasing to the peak over the first 1–5% of training. At initialization, the model’s gradients are wildly unpredictable. A full-strength learning rate at this stage can launch the model into a bad region it never recovers from. Warmup lets the optimizer cautiously explore the landscape before committing to large steps. After warmup, the learning rate follows a cosine decay curve down to roughly 10% of the peak.
Learning Rate
│
│ peak (e.g., 3e-4)
│ ╱‾‾‾‾‾‾‾╲
│ ╱ ╲
│ ╱ ╲ cosine decay
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲____ min (e.g., 3e-5)
│╱
│───────────────────────── Training Steps
warmup main training
(1-5%) (95-99%)
Feeding the Beast: Curriculum and Data Mixing
So far, we’ve treated the training data as a single homogeneous stream of tokens. In practice, it’s a careful blend of many sources, and the proportions of that blend affect the resulting model profoundly.
Return to our recipe model. Suppose we have three sources: recipe instructions (60%), food science articles (30%), and restaurant reviews (10%). If we oversample restaurant reviews, the model gets great at generating “the ambiance was delightful” but worse at “preheat the oven to 350.” The mix matters.
Real LLMs face this at enormous scale:
| Source | Typical Share | What It Teaches |
|---|---|---|
| Filtered web crawl | 60–70% | Breadth: conversational text, general knowledge |
| Code (GitHub) | 5–15% | Structured reasoning, logic, programming |
| Books | 5–10% | Long-form coherence, deep knowledge |
| Wikipedia | 3–5% | Factual knowledge, structured writing |
| Academic papers | 3–5% | Scientific reasoning, technical language |
| Conversational data | 2–5% | Dialog patterns, Q&A structure |
There’s growing evidence that code data improves reasoning even on non-code tasks. Code is highly structured, logically precise, and demands sequential thinking. Models trained with significant code exposure (like GPT-3.5 and GPT-4, which descend from Codex) consistently outperform text-only models on reasoning benchmarks. This was a genuine surprise to the field.
Getting the data mix right is as much art as science. DoReMi, a method from Google, attempts to learn optimal mixing proportions automatically. It frames the problem as minimax optimization: the data mixture and a small proxy model are co-trained, adjusting proportions to maximize downstream performance. The result often differs from hand-tuned proportions in ways that are hard to predict — for example, DoReMi might upweight a small, high-quality source that humans would have undersampled.
Curriculum learning goes further, changing the data order during training. One approach: start with simpler, cleaner text (Wikipedia, well-edited books) and gradually introduce noisier, more complex data (web crawl, code). The intuition is that early training establishes foundational representations, and disrupting those foundations with noisy data too early is like trying to learn calculus before arithmetic. In practice, the benefits of curriculum learning for LLMs are mixed — some teams report improvements, others find that random shuffling works as well. I’m still developing my intuition for when curriculum helps and when it doesn’t.
The honest truth about data mixing: no one is completely certain about the optimal proportions. Labs treat their data recipes as closely guarded secrets, and the few ablation studies that exist show that small changes in the mix can produce surprisingly large changes in downstream behavior. It’s one of those areas where a lot of practical knowledge exists that hasn’t been formalized into theory yet.
Wrap-Up
If you’re still with me, thank you. I hope it was worth it.
We started with a deceptively mundane question — “where does the training data come from?” — and traced a path through web crawling and quality filtering, BPE tokenization, the surprisingly powerful next-token prediction objective, the scaling laws that turned expensive gambling into engineering, the physical infrastructure of GPU clusters connected by NVLink and InfiniBand, the fragile stability of multi-month training runs, and the art of mixing data from diverse sources.
My hope is that the next time someone mentions “pretraining a large language model,” instead of picturing a mysterious black box, you’ll picture the full pipeline: messy web data being filtered and deduplicated, a tokenizer carving text into subword pieces, a transformer reading left-to-right and guessing what comes next, and a team of engineers watching loss curves at 3 AM, ready to roll back to a checkpoint if something spikes. That’s what’s under the hood. It’s engineering all the way down.
Resources and Credits
The resources below are the ones I found most illuminating while building this understanding.
- Kaplan et al., “Scaling Laws for Neural Language Models” (2020) — the O.G. paper that discovered the power laws. Dense but beautifully empirical.
- Hoffmann et al., “Training Compute-Optimal Large Language Models” (2022) — the Chinchilla paper. The single most impactful result in LLM training methodology.
- Touvron et al., “LLaMA: Open and Efficient Foundation Language Models” (2023) — a masterclass in practical training decisions, with more implementation details than most papers dare include.
- Penedo et al., “The RefinedWeb Dataset for Falcon LLM” (2023) — an insightful look at aggressive web data filtering. Changed how I think about data quality.
- Chowdhery et al., “PaLM: Scaling Language Modeling with Pathways” (2022) — the most detailed account of training stability issues at extreme scale. The loss spike section alone is worth the read.
- Rajbhandari et al., “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models” (2020) — wildly helpful for understanding distributed training memory optimization.