Reasoning & Inference-Time Scaling

Chapter 12: Large Language Models Section 4

I avoided digging into inference-time reasoning for longer than I'd like to admit. Every time someone mentioned "chain-of-thought" or "thinking tokens," I nodded along, vaguely aware that it meant "make the model show its work." But when OpenAI's o1 scored 83% on the AIME math competition — a test designed for the top 500 high school math students in the United States — while GPT-4o managed 13% on the same questions, I couldn't keep nodding. Those are the same transformer layers underneath. The difference is how much the model thinks before answering. Here is that dive.

Reasoning at inference time is a collection of techniques — some you control through prompts, some baked into the model's training — that let a language model spend more computation on harder problems before committing to an answer. The field exploded in 2022–2024, starting with chain-of-thought prompting (Wei et al., 2022) and culminating in models like OpenAI's o1/o3 and DeepSeek-R1 that are trained to reason internally. The core promise: a smaller model that thinks longer can outperform a larger model that answers immediately.

Before we start, a heads-up. We'll touch on reinforcement learning, tree search, and a bit of probability along the way — but you don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.

This isn't a short journey, but I hope you'll be glad you came.

Contents

The Autoregressive Bottleneck
Chain-of-Thought: Giving the Model a Scratchpad
Self-Consistency: The Wisdom of Noisy Crowds
Rest Stop
Tree of Thoughts: When Reasoning Branches
Thinking Tokens and the o1 Paradigm
The Fourth Scaling Axis: Inference-Time Compute
Verifiers: Process Reward Models vs Outcome Reward Models
MCTS for Reasoning: Search Meets Language
The Self-Improvement Flywheel: STaR and GRPO
Measuring Progress: Mathematical Reasoning Benchmarks
Wrap-Up
Resources

The Autoregressive Bottleneck

Let's start with a scenario we'll carry through the whole section. Imagine we're building a math tutoring bot — something a student can ask word problems and get worked solutions back. Our first attempt is straightforward: feed the question in, get the answer out.

Here's a problem we'd like it to solve:

A bakery has 45 croissants. They sell 12 in the morning,
receive a delivery of 30 at noon, then sell 18 in the afternoon.
How many croissants remain?

If we give this to a standard language model — no special prompting, no tricks — it does a single forward pass. The question tokens flow through, say, 96 transformer layers, and out comes a prediction for the next token. Then another. Then another. Each forward pass takes the same amount of computation, whether the question is "What is 2 + 2?" or "Prove that there are infinitely many primes."

That's the bottleneck. Think of it like a chess player forced to play blitz — three seconds per move, no exceptions. For simple positions, three seconds is plenty. But for a complicated middle-game where you need to calculate five moves deep? Three seconds is not enough, no matter how talented you are. The player doesn't need more talent. They need more time.

A standard LLM is a blitz player. Every question gets the same fixed amount of computation. And for hard reasoning problems — multi-step arithmetic, logic chains, anything that requires holding intermediate results in mind — that fixed budget is not enough. The model might get it right sometimes (pattern matching from training data can be surprisingly powerful), but it will be unreliable. It'll solve the bakery problem correctly one run and output "55" the next.

I'll be honest — when I first encountered this framing, I didn't buy it. "What do you mean fixed computation? The model processes the whole prompt." But the key isn't the prompt. It's what happens between reading the question and producing the answer. In a standard generation, there's nothing between them. The answer starts immediately. That's the gap these techniques fill.

Chain-of-Thought: Giving the Model a Scratchpad

Back to our bakery tutor. Instead of asking for the answer directly, what if we ask the model to show its work first?

Prompt (direct):
  Q: A bakery has 45 croissants. They sell 12 in the morning,
     receive 30 at noon, sell 18 in the afternoon.
     How many remain?
  A:

Model output: "55"   ← wrong

Prompt (chain-of-thought):
  Q: A bakery has 45 croissants. They sell 12 in the morning,
     receive 30 at noon, sell 18 in the afternoon.
     How many remain?
  A: Let's work through this step by step.

Model output:
  "Start with 45 croissants.
   Morning: 45 - 12 = 33
   Noon delivery: 33 + 30 = 63
   Afternoon: 63 - 18 = 45
   The bakery has 45 croissants remaining."   ← correct

That's chain-of-thought prompting, or CoT. Wei et al. formalized it in 2022, and Kojima et al. showed the same year that even the five words "Let's think step by step" can double or triple accuracy on arithmetic and logic benchmarks. Five words. No fine-tuning, no new architecture, no additional training data.

Why does this work? Here's the mechanical explanation. Remember our blitz chess player? Each token the model generates is another forward pass through the full transformer stack — another 96 layers of computation. When the model writes "45 - 12 = 33" before moving on to the next operation, that intermediate result is now in the context window. The next forward pass can attend to it. The model doesn't have to hold "33" in some invisible internal register; it wrote it down.

It's a scratchpad. You're giving the model a piece of paper to do its work on, rather than demanding it compute everything in its head.

Another way to think about it: each reasoning token is like upgrading our chess player from three-second blitz to a longer time control. The model that generates 50 tokens of reasoning before answering gets 50 additional forward passes — 50 × 96 layers of computation — compared to the model that answers immediately. That's not a small difference. It's a fundamentally different amount of work.

There are two flavors of CoT that matter in practice. Zero-shot CoT is what we showed above — append an instruction like "Let's think step by step" and let the model figure out what "showing work" means. Few-shot CoT is more powerful: you provide worked examples with explicit reasoning traces, so the model learns both the format and the depth of reasoning you expect.

# Few-shot CoT for our bakery tutor
# We provide one worked example, then the real question

prompt = """
Q: A store has 20 oranges. They sell 8 before lunch and
   buy 15 more after lunch. How many oranges do they have?
A: The store starts with 20 oranges.
   They sell 8: 20 - 8 = 12 oranges.
   They buy 15: 12 + 15 = 27 oranges.
   The answer is 27.

Q: A bakery has 45 croissants. They sell 12 in the morning,
   receive 30 at noon, sell 18 in the afternoon.
   How many remain?
A:"""

The model sees the format — break the problem into operations, compute each one, state the answer — and follows it. On the GSM8K benchmark (a collection of grade-school math word problems), few-shot CoT pushes accuracy from around 50–60% to over 80% on large models. That's the difference between a tutor that's wrong half the time and one that's wrong a fifth of the time.

But there's a catch. CoT isn't free. Every reasoning token costs money (if you're using an API) and adds latency. Worse, on problems that don't need multi-step reasoning — sentiment analysis, factual lookup, simple classification — CoT can actually hurt. The model talks itself into overthinking a question it would have gotten right with a snap judgment. Our blitz chess player doesn't need more time for a forced checkmate in one move. Sometimes the immediate answer is the right one.

The deeper issue, though, is reliability. Even with CoT, the model can make mistakes at any step. One arithmetic error in step 2 propagates through every subsequent step. And we have no way to catch it — the chain marches forward in one direction, never reconsidering.

We need a way to handle that.

Self-Consistency: The Wisdom of Noisy Crowds

Here's something that tripped me up for a while: if you ask the same model the same question ten times (with temperature > 0, so there's randomness in the generation), you get different reasoning chains. Not different wordings of the same chain — genuinely different approaches, different intermediate steps, sometimes different final answers.

Wang et al. (2022) had a beautifully simple insight about this. If the model has, say, a 60% chance of reaching the correct answer on any single attempt, then the errors are somewhat random — the model might miscalculate in step 3 on one attempt, use a wrong formula on another, get confused by the wording on a third. But the correct reasoning paths all converge on the same answer. So if you sample 10 chains and take the majority vote on the final answers, the correct answer tends to win.

This is self-consistency, and the math behind it is surprisingly clean. Let's work through it with our bakery tutor.

# We ask our tutor the bakery problem 10 times
# Each attempt generates a different reasoning chain

attempt_1:  "45 - 12 = 33, 33 + 30 = 63, 63 - 18 = 45"  → 45 ✓
attempt_2:  "45 - 12 = 33, 33 + 30 = 63, 63 - 18 = 45"  → 45 ✓
attempt_3:  "45 - 12 = 32, 32 + 30 = 62, 62 - 18 = 44"  → 44 ✗
attempt_4:  "45 - 12 = 33, 33 + 30 = 63, 63 - 18 = 45"  → 45 ✓
attempt_5:  "45 + 30 = 75, 75 - 12 - 18 = 45"            → 45 ✓
attempt_6:  "45 - 12 = 33, 33 + 30 = 63, 63 - 18 = 45"  → 45 ✓
attempt_7:  "45 - 12 - 18 + 30 = 45"                      → 45 ✓
attempt_8:  "45 - 12 = 33, 33 + 30 = 53, 53 - 18 = 35"  → 35 ✗
attempt_9:  "45 - 12 = 33, 33 + 30 = 63, 63 - 18 = 45"  → 45 ✓
attempt_10: "45 - 12 = 33, 33 + 30 = 63, 63 - 18 = 45"  → 45 ✓

Majority vote: 45 appears 8 times → answer is 45 ✓

Look at what happened. Attempt 3 made a subtraction error. Attempt 8 botched the addition. But those errors are different — they don't agree with each other. Meanwhile, the correct answer shows up 8 out of 10 times. The signal drowns out the noise.

The binomial math makes this intuitive. If the single-attempt accuracy is p, the probability that the majority of N attempts is correct follows the binomial distribution. For p = 0.6 and N = 10, the majority vote accuracy jumps to about 83%. For N = 20, it's around 92%. You're trading compute for reliability — more attempts cost more, but the accuracy curve climbs.

The returns diminish, though. Going from 1 to 5 samples gives you the biggest jump. Going from 20 to 40 barely moves the needle. And there's a hard limit: if the model has a systematic blind spot — it consistently misunderstands a concept, not randomly — more samples won't help. You'll get the same wrong answer ten times. The samples from a single model aren't truly independent; they share the same training biases, the same conceptual gaps.

That systematic-error problem is real, and it's where self-consistency breaks down. For our bakery tutor, self-consistency works beautifully on arithmetic problems where errors are random. It struggles on problems where the model fundamentally misunderstands what's being asked. We need something more structured for those.

Rest stop. If you've made it this far, congratulations. You now have a working mental model of the two most practically important inference-time reasoning techniques: chain-of-thought (give the model a scratchpad) and self-consistency (sample multiple chains, majority vote). Together, these two ideas will handle 80% of production reasoning use cases. If you're building something right now and need to ship, this is your toolkit.

It doesn't tell the complete story, though. We haven't talked about what happens when reasoning genuinely branches — when there are multiple valid approaches and you need to explore them. We haven't talked about models that are trained to think, not prompted. And we haven't talked about the most provocative finding: that a small model with a big thinking budget can outperform a large model that answers immediately.

If the discomfort of not knowing what's underneath is nagging at you, read on.

Tree of Thoughts: When Reasoning Branches

Chain-of-thought is linear. One step leads to the next leads to the next, like following a single path through a forest. But real problem-solving isn't always linear. Sometimes you reach a fork and need to try both paths. Sometimes you hit a dead end and need to back up. Sometimes the first approach you try is wrong, and the right answer requires a completely different starting move.

Let's give our bakery tutor a harder problem to see where linearity fails.

Using the numbers 3, 7, 8, and 2, and any arithmetic
operations (+, -, ×, ÷), make the number 24.
Each number must be used exactly once.

This is the Game of 24 — a classic puzzle that Yao et al. (2023) used to demonstrate Tree of Thoughts (ToT). A chain-of-thought approach would pick an operation, apply it, and keep going. But if the first operation was wrong — say, 3 + 7 = 10 — the model has no way to reconsider. It's stuck on that branch.

Tree of Thoughts treats the reasoning process as a search tree. Each node is a partial solution. At each node, the model generates several possible next steps (branches). Then — and this is the critical part — a separate evaluation judges which branches look promising. The search proceeds down the best-looking branches, and can abandon dead ends.

# Conceptual trace of Tree of Thoughts on Game of 24
# Numbers: 3, 7, 8, 2

Root: {3, 7, 8, 2} → goal: 24

Branch A: 8 × 3 = 24... but we still have 7 and 2 unused.
  → Dead end. Need to use all four numbers.

Branch B: 7 - 3 = 4, remaining {4, 8, 2}
  Branch B1: 4 × 8 = 32, remaining {32, 2}
    → 32 - 2 = 30 ≠ 24. Dead end.
  Branch B2: 8 × 2 = 16, remaining {4, 16}
    → 4 + 16 = 20 ≠ 24. Dead end.
  Branch B3: 8 - 2 = 6, remaining {4, 6}
    → 4 × 6 = 24. ✓ Solution found!

Branch C: 8 ÷ 2 = 4, remaining {3, 7, 4}
  Branch C1: 3 × 7 = 21, remaining {21, 4}
    → 21 + 4 = 25 ≠ 24. Close but wrong.
  ...

The search found the solution through Branch B3: (7 - 3) × (8 - 2) = 4 × 6 = 24. A linear chain-of-thought that happened to start with Branch A or B1 would have failed and produced a wrong answer. The tree allowed backtracking.

The search strategy matters. Breadth-first search (BFS) explores all branches at each level before going deeper — good when you're not sure which direction is right. Depth-first search (DFS) follows one branch all the way down before backing up — good when solutions are deep and you want to find one fast. In practice, ToT often uses a beam search: keep the top-k most promising branches at each step, discard the rest.

I'll be honest — I still find it a bit strange to think of text generation as tree search. The analogy that helped me: imagine writing an essay where you're unsure about your thesis. Instead of committing to one thesis and writing the whole essay (chain-of-thought), you draft three different opening paragraphs with three different theses, evaluate which one is most promising, then continue from that one. If you get stuck later, you can go back and try thesis #2. That's Tree of Thoughts.

The cost is significant, though. Each branch requires LLM calls to generate, and each evaluation requires another LLM call (or a separate evaluator). A tree with branching factor 3 and depth 5 could require dozens of LLM calls for a single problem. For our bakery tutor answering simple arithmetic? Massive overkill. For solving complex puzzles, planning tasks, or problems with genuine decision points? It can be the difference between failure and success.

Thinking Tokens and the o1 Paradigm

Everything we've discussed so far is external — techniques you apply through prompting and sampling strategies. The model itself hasn't changed. Chain-of-thought, self-consistency, Tree of Thoughts — these are all things we do to coax better reasoning out of a standard language model.

But what if the model itself learned to reason? What if, instead of us telling it to "think step by step," it had been trained to generate an extended internal reasoning chain before producing its answer?

That's the paradigm shift behind OpenAI's o1 (September 2024) and o3 (December 2024), and it's the idea behind what people now call thinking tokens. These models generate a long chain of reasoning — sometimes hundreds or thousands of tokens — in an internal "thinking" phase before producing the visible response. The thinking happens with special tokens that the user sees a summary of, but not the raw chain.

Let's bring back our chess analogy. Standard CoT is like giving our blitz player a piece of paper during the game — they can jot down calculations, but they're still playing blitz with whatever skills they brought to the table. Thinking tokens are different. The player has been trained to think deeply. They've practiced analyzing positions, considering counterarguments, and revising their evaluations. They automatically take more time on complex positions and less on simple ones. The deliberation is internalized.

The results speak for themselves. On the 2024 AIME (American Invitational Mathematics Examination), o1-preview scored 83%. GPT-4o scored 13%. Same underlying transformer architecture. Same parameter count ballpark. The difference is almost entirely in how much computation happens at inference time — and the fact that o1 was trained to use that computation effectively.

To see why this works mechanically, consider what a forward pass actually does. The question flows through the transformer's layers — let's say 96 of them — and comes out the other side. That's a fixed amount of computation. When the model generates a thinking token, that token flows through all 96 layers again. Then the next thinking token flows through again. A model that generates 500 thinking tokens before answering has done 500 × 96 layers of computation on the problem, compared to 1 × 96 for a model that answers immediately. The model is giving itself a variable-depth computation graph.

# Fixed computation (standard generation):
question → [96 layers] → answer_token_1 → [96 layers] → answer_token_2
# Each token gets exactly 96 layers of "thinking"

# Variable computation (thinking tokens):
question → [96 layers] → think_1 → [96 layers] → think_2 → ...
  ... → think_500 → [96 layers] → answer_token_1
# The answer benefits from 500 × 96 layers of prior computation
# Previous thinking tokens are in the context, attended to

This is profound, and I'm still developing my intuition for the full implications. The model dynamically allocates compute. An easy question gets 10 thinking tokens. A competition math problem gets 5,000. The cost scales with problem difficulty, not model size. And because the model was trained with reinforcement learning to use those thinking tokens effectively — not to ramble, not to repeat itself, but to actually make progress on the problem — the extra computation translates into better answers.

Anthropic's Claude has a similar capability with extended thinking. The implementation details differ, but the core insight is the same: let the model do variable-depth computation, and train it to use that depth wisely.

The Fourth Scaling Axis: Inference-Time Compute

For years, the recipe for better language models was: more parameters, more data, more training compute. Kaplan et al. (2020) showed that loss scales as a power law with these three axes. Hoffmann et al. (2022, the "Chinchilla" paper) refined the recipe: for a given training compute budget, there's an optimal balance between model size and data size. These are training-time scaling laws, and they drove the arms race of bigger models and larger datasets.

The 2024 breakthrough — formalized by Snell et al. in "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters" — is that there's a fourth axis: inference-time compute. You can improve performance not only by making a better model, but by giving the same model more time to think.

This fourth axis manifests in all the techniques we've discussed: more reasoning tokens (chain-of-thought, thinking tokens), more samples (self-consistency, best-of-N), and search over reasoning paths (Tree of Thoughts, beam search). They all trade inference compute for better answers.

And here's the finding that reshapes how you think about deployment: inference-time compute can substitute for training-time compute. Snell et al. showed that a 7B-parameter model with smart search strategies can match or beat a model 14 times larger answering directly — on hard reasoning tasks. Not on every task. On easy questions, the big model answering in one shot still wins. But on the problems where reasoning matters, the small-model-that-thinks-harder holds its own.

This changes the economics. The old question was: "What's the biggest model I can afford to run?" The new question is: "What's the optimal split between model size and thinking budget?"

# The tradeoff, illustrated

# Option A: 70B model, direct answer
#   Cost per query: ~140 TFLOPs
#   Accuracy on hard math: 65%

# Option B: 7B model, 10 reasoning chains, pick the best
#   Cost per query: ~14 TFLOPs × 10 + verification ≈ 155 TFLOPs
#   Accuracy on hard math: 72%

# Similar total compute. Different allocation.
# Option B wins on hard tasks.
# Option A wins on easy tasks (no reasoning overhead needed).

# The practical takeaway: route easy queries to fast single-pass
# inference, and hard queries to extended reasoning pipelines.
# A difficulty classifier at the front of the pipeline saves money.

My favorite thing about this result is how it democratizes access. If you can't afford to host a 70B model, you're not locked out of high-quality reasoning. A well-orchestrated 7B model with verification can get you there — at the cost of latency and some engineering complexity.

No one is completely certain where the ceiling of inference-time scaling lies. Training-time scaling laws have been studied for years and have clear empirical characterization. Inference-time scaling is newer, and the interaction between the two — how much training prepares a model to benefit from inference compute — is still being mapped out.

Verifiers: Process Reward Models vs Outcome Reward Models

We've been generating multiple reasoning chains. But how do we decide which one to trust? The majority-vote approach from self-consistency works when answers are discrete (a number, a multiple-choice option). But what about open-ended reasoning? What about chains that arrive at the right answer for the wrong reasons?

This is the verification problem, and it turns out to be at least as important as the generation problem. Think of it through our draft-and-revise analogy from earlier: generating reasoning chains is like writing multiple drafts of an essay. Verification is the editing — figuring out which draft is good and where the bad drafts went wrong.

There are two fundamentally different approaches to building a verifier.

An Outcome Reward Model (ORM) looks at the final answer and assigns a score. Did you get the right number? Is the code output correct? The ORM sees the complete solution and judges the endpoint. Training an ORM is straightforward — you need final-answer labels (correct or incorrect), and those are easy to get for math and code.

A Process Reward Model (PRM) does something much more granular. It evaluates each step of the reasoning process. "Step 1: correctly identified the formula. Step 2: correctly substituted values. Step 3: made an arithmetic error — 5² is 25, not 10." The PRM pinpoints where the reasoning went wrong.

Let's trace through an example from our bakery tutor to see the difference.

# A student asks: "What's the area of a circle with radius 5?"

reasoning_chain = [
  "Step 1: The area formula is A = π × r²",           # correct
  "Step 2: Substituting r = 5: A = π × 5²",            # correct
  "Step 3: 5² = 10, so A = 10π",                       # ERROR (5² = 25)
  "Step 4: 10π ≈ 31.4",                                # follows from error
  "Final answer: 31.4"                                  # wrong (should be 78.5)
]

# ORM evaluation:
#   Sees "31.4" → correct answer is 78.5 → score: 0.0
#   That's all it knows. No idea WHERE the reasoning failed.

# PRM evaluation:
#   Step 1: score 0.97  (correct formula recall)
#   Step 2: score 0.95  (correct substitution)
#   Step 3: score 0.03  (ERROR FLAGGED: 5² ≠ 10)
#   Step 4: score 0.10  (logically follows from step 3, but built on error)
#   PRM knows the chain derailed at step 3.

Why does this matter for search? If you're generating multiple reasoning chains and want to pick the best one, the ORM approach is: generate 10 complete chains, check all the final answers, pick the best. The PRM approach is: generate step by step, evaluate after each step, abandon chains that go wrong early. Instead of wasting computation finishing a chain that made an error in step 2, you stop it there and redirect compute to more promising branches.

Lightman et al. (2023), in "Let's Verify Step by Step," showed that PRM-guided search significantly outperforms ORM-guided search on math reasoning. The reason is intuitive: you're not wasting tokens on doomed chains.

The cost of PRMs is the catch. To train an ORM, you need problems paired with correct/incorrect final answers — easy to obtain. To train a PRM, you need step-level annotations — a human (or automated system) marking each reasoning step as valid or invalid. That's much more labor-intensive. Some approaches estimate step correctness using Monte Carlo methods (run many completions from each step, see what fraction reach the correct final answer), which helps reduce the human labeling burden. But PRMs remain more expensive to build.

I'm still building my intuition for exactly when the PRM payoff justifies the training cost. For math and code, where steps are cleanly delineated and correctness is verifiable, the evidence is strong. For fuzzier domains — legal reasoning, medical diagnosis, open-ended analysis — the picture is less clear.

MCTS for Reasoning: Search Meets Language

If you've heard of AlphaGo, you've encountered Monte Carlo Tree Search (MCTS). It's the algorithm that lets a computer explore a massive game tree — far too large to search exhaustively — by strategically sampling promising branches and using random playouts to estimate their value.

The same idea applies to reasoning. Each reasoning state is a node. Each possible next step is a branch. The question is: which branch should we explore next?

MCTS balances exploration (trying branches we haven't visited much) with exploitation (deepening branches that have looked promising so far). It does this through four repeating phases: select a promising node, expand it by generating a new reasoning step, simulate a playout to see if the path leads somewhere good, and backpropagate the result to update our confidence in earlier choices.

I'll be the first to admit that visualizing tree search over text feels strange. It's easier when we think of it concretely. Let's say our bakery tutor is solving a harder problem:

# Problem: "A baker uses 3/4 of the flour to make bread and
# 1/3 of the remainder to make cookies. If 5 kg of flour is
# left, how much did the baker start with?"

# MCTS exploration (simplified)

# Node 0 (root): Problem statement
#   → Expand: generate 3 possible first steps

# Branch A: "Let's call the starting flour x. After bread: x - 3x/4 = x/4"
#   Simulate: complete the solution → gets 20 kg → but 20 × 1/4 = 5,
#   then 5 × 1/3 = 1.67 for cookies, leaving 3.33 ≠ 5 → WRONG
#   Backpropagate: low score for Branch A

# Branch B: "Let total flour = x. Bread uses 3x/4. Remaining: x/4.
#            Cookies use 1/3 of remainder = x/12."
#   Simulate: "Left over = x/4 - x/12 = 3x/12 - x/12 = 2x/12 = x/6.
#              x/6 = 5, so x = 30." → CHECK: 30 × 3/4 = 22.5 bread,
#              7.5 remaining, 7.5 × 1/3 = 2.5 cookies, 5 left ✓
#   Backpropagate: high score for Branch B → SOLUTION FOUND

# Branch C: "First find what fraction of flour remains..."
#   (not explored further — Branch B already succeeded)

The power of MCTS is that it doesn't commit to a single path. It allocates exploration budget proportional to how promising each branch looks. DeepMind's AlphaProof system uses MCTS combined with language models to search over mathematical proof steps — a domain where wrong turns are common and backtracking is essential.

For most production applications, MCTS is overkill. The overhead of maintaining a search tree and running rollouts makes it impractical for low-latency scenarios. But for hard reasoning tasks where accuracy matters more than speed — formal verification, theorem proving, competition math — it represents the state of the art.

The Self-Improvement Flywheel: STaR and GRPO

So far, we've either prompted the model to reason (CoT, self-consistency, ToT) or used search at inference time (MCTS, PRM-guided beam search). But there's a third angle: train the model to be a better reasoner.

The naive approach is supervised fine-tuning on human-written reasoning traces. Get humans to write step-by-step solutions to math problems, fine-tune on those. It works, but it's expensive — and the model can only be as good as the human demonstrations. What if the model could teach itself?

STaR — Self-Taught Reasoner (Zelikman et al., 2022) — creates a self-improvement loop. It goes like this: the model tries to solve problems, generating reasoning chains. We keep the chains that arrive at the correct answer and discard the rest. Then we fine-tune the model on the successful chains. The improved model tries again. Better model → better chains → better training data → even better model.

# The STaR loop (one iteration)

good_chains = []
for problem, correct_answer in training_set:

    # Model generates a reasoning chain
    chain = model.generate_reasoning(problem)

    if chain.final_answer == correct_answer:
        good_chains.append((problem, chain))
    else:
        # "Rationalization": give the model a hint
        # "The answer is {correct_answer}. Now show the reasoning."
        hinted_chain = model.generate_with_hint(problem, correct_answer)
        if hinted_chain.final_answer == correct_answer:
            good_chains.append((problem, hinted_chain))

# Fine-tune model on the good chains
model = finetune(model, good_chains)
# Repeat for multiple iterations

The rationalization step is clever. When the model fails a problem, instead of throwing it away entirely, we tell the model the correct answer and ask it to work backward — "given that the answer is 30, generate the reasoning." This generates training data even from problems the model initially couldn't solve, bootstrapping its way out of its own limitations.

GRPO — Group Relative Policy Optimization, used by DeepSeek to train DeepSeek-R1 — takes this even further by framing reasoning improvement as reinforcement learning. For each problem, generate a group of candidate solutions. The ones that reach the correct answer get positive reinforcement; the incorrect ones get negative reinforcement. No separate reward model needed. The "reward" is whether the answer is right or wrong.

# GRPO: the key idea in simplified form

solutions = [model.generate(problem) for _ in range(8)]
rewards = [1.0 if is_correct(s) else 0.0 for s in solutions]

# Normalize rewards within the group
mean_r = mean(rewards)
std_r = std(rewards)
advantages = [(r - mean_r) / std_r for r in rewards]

# Update: make above-average solutions more likely,
# below-average solutions less likely
for solution, advantage in zip(solutions, advantages):
    policy_gradient_update(model, solution, advantage)

The elegance is that you only need a correctness signal. Right or wrong. No human-written reasoning traces. No learned reward model. For domains where you can check answers programmatically — math, code, formal logic — this is enormously powerful. It's how DeepSeek-R1 was trained, and the results are competitive with models trained on far more human-annotated data.

The pattern across STaR, GRPO, and related methods (ReST, ReST-EM) is the same flywheel: generate → evaluate → keep the good ones → train → repeat. The model bootstraps its own reasoning ability. Each iteration produces better chains, which produce better training data, which produces a better model.

The risk is reward hacking. If your correctness signal has loopholes — the model discovers a shortcut that produces the right answer for the wrong reasons — RL will exploit it ruthlessly. In math, where ground truth is clean and unambiguous, this is less of a problem. In open-ended domains, defining "correct" is much harder, and the model will find every crack in your reward definition.

Measuring Progress: Mathematical Reasoning Benchmarks

How do we know any of this actually works? The field has converged on a few standard benchmarks, and watching the numbers climb tells the story of inference-time reasoning better than any abstract description.

GSM8K (Grade School Math 8K) is a collection of 8,500 grade-school math word problems, each requiring 2–8 steps of reasoning. When it was introduced, state-of-the-art models scored around 50–60%. With chain-of-thought prompting, that jumped to 80%+. With self-consistency, into the 90s. As of 2024, the best models are at 96–97%, and the benchmark is widely considered saturated — too easy for frontier models. That saturation itself is a testament to how far reasoning techniques have come.

MATH is a collection of 12,500 competition-level math problems — high school competitions through early undergraduate. This is much harder. The best models reach about 70–72% as of 2024. The gap between GSM8K (near-perfect) and MATH (still 30% wrong) shows that while we've made enormous progress on multi-step arithmetic, genuine mathematical reasoning — creative problem decomposition, novel proof strategies, abstract thinking — remains a frontier.

AIME (American Invitational Mathematics Examination) is where the o1 result lives: 83% versus GPT-4o's 13%. These are competition problems designed to challenge top math students — not solvable by pattern matching alone, requiring genuine creative problem-solving.

There's an important caution about GSM8K in particular: data contamination. The benchmark has been publicly available for years, and there's evidence that some models have seen the test problems (or very similar ones) during training. Newer benchmarks like GSM1k — freshly generated problems with the same difficulty distribution — show that some models' "real" accuracy is 5–8 percentage points lower than their GSM8K scores. The skills are genuine, but the leaderboard numbers are slightly inflated.

Wrap-Up

If you're still with me, thank you. I hope it was worth it.

We started with the autoregressive bottleneck — a model that gives every question the same fixed amount of computation, like a chess player forced into permanent blitz. We gave it a scratchpad with chain-of-thought, let it vote across multiple attempts with self-consistency, and allowed it to explore and backtrack with Tree of Thoughts. Then we crossed into a different paradigm: models that are trained to think, allocating variable computation through thinking tokens. We discovered that inference-time compute is a fourth scaling axis — one that can substitute for model size itself. We learned to verify reasoning step by step with process reward models, to search over reasoning paths with MCTS, and to create self-improving reasoning loops with STaR and GRPO. And we measured it all against benchmarks that have tracked the field's remarkable climb from 50% to 97% on grade-school math in a few short years.

My hope is that the next time you see a model "reasoning" — whether it's a chain-of-thought prompt, an o1 thinking trace, or a self-consistency pipeline — instead of treating it as a black box, you'll have a pretty clear mental model of what's happening underneath. The blitz player got a longer time control. And it turns out that changes everything.

Resources

Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (2022) — the O.G. paper that started it all. Clean experiments, clear writing, and the results that made everyone rethink what prompting could do.

Wang et al., "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (2022) — the majority-vote idea. Beautifully simple, and the binomial analysis of why it works is insightful.

Lightman et al., "Let's Verify Step by Step" (2023) — the process reward model paper from OpenAI. The step-level verification results are striking, and the data collection methodology is worth studying.

Snell et al., "Scaling LLM Test-Time Compute Optimally" (2024) — the paper that formalized inference-time scaling laws. The finding that a 7B model with search can beat a 70B model answering directly is the headline, but the adaptive compute allocation framework is the real contribution.

Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (2023) — Tree of Thoughts. The Game of 24 experiments are wildly fun to read through.

Zelikman et al., "STaR: Bootstrapping Reasoning With Reasoning" (2022) — the self-taught reasoner paper. The rationalization trick (hindsight relabeling) is the unforgettable idea here.

← Previous Prompt Engineering Next → RAG & Semantic Search