1 Understanding reasoning models
This chapter introduces reasoning as the next major step in large language models, framing it as the ability of an LLM to produce intermediate steps before giving a final answer. These step-by-step explanations, often called chain-of-thought reasoning, can improve performance on complex tasks such as math, coding, and logic puzzles. However, the chapter emphasizes that LLM reasoning is not the same as human or symbolic reasoning: models generate tokens probabilistically based on learned patterns, so their reasoning steps can appear convincing without being guaranteed logically correct.
The chapter also explains the standard LLM training pipeline that reasoning methods build upon. Conventional LLMs are first pre-trained on massive text datasets to predict the next token, which gives them broad language capabilities and some emergent skills. They are then post-trained through instruction tuning and preference tuning so they can better follow user requests and produce responses aligned with human preferences. Reasoning improvements are typically added after these stages through three broad approaches: inference-time compute scaling, reinforcement learning with verifiable rewards or environments, and distillation from stronger models into smaller ones.
A key theme is the distinction between pattern matching and logical reasoning. Standard LLMs often appear to reason because they have learned many reasoning-like patterns from training data, but they do not explicitly apply formal rules or track contradictions like a symbolic system would. This makes them powerful in familiar situations but less reliable when problems are novel or require complex multi-step inference. The chapter argues that building reasoning models from scratch is valuable because it reveals how these methods work, clarifies their trade-offs, and helps practitioners decide when reasoning models are worth their higher cost, verbosity, and risk of overthinking.
A simplified illustration of how an LLM might tackle a multi-step reasoning task. Rather than just recalling a fact, the model needs to combine several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation.
Overview of a typical LLM training pipeline. The process begins with an initial model initialized with random weights, followed by pre-training on large-scale text data to learn language patterns by predicting the next token. Post-training then refines the model through instruction fine-tuning and preference fine-tuning, which enables the LLM to follow human instructions better and align with human preferences.
Example responses from a language model at different training stages. The prompt asks for a summary of the relationship between sleep and health. The pre-trained LLM produces a relevant but unfocused answer without directly following the instructions. The instruction-tuned LLM generates a concise and accurate summary aligned with the prompt. The preference-tuned LLM further improves the response by using a friendly tone and engaging language, which makes the answer more relatable and user-centered.
Three approaches commonly used to improve reasoning capabilities in LLMs. These methods (inference-compute scaling, reinforcement learning, and distillation) are typically applied after the conventional training stages (initial model training, pre-training, and post-training with instruction and preference tuning).
Illustration of how contradictory premises lead to a logical inconsistency. From "All birds can fly" and "A penguin is a bird," we infer "Penguin can fly." This conclusion conflicts with the established fact "Penguin cannot fly," which results in a contradiction.
An illustrative example of how a language model (GPT-4o in ChatGPT) appears to "reason" about a contradictory premise.
Token-by-token generation in an LLM. At each step, the LLM takes the full sequence generated so far and predicts the next token, which may represent a word, subword, or punctuation mark depending on the tokenizer. The newly generated token is appended to the sequence and used as input for the next step. This iterative decoding process is used in both standard language models and reasoning-focused models.
A mental model of the main reasoning model development stages covered in this book. We start with a conventional LLM as base model (stage 1). In stage 2, we cover evaluation strategies to track the reasoning improvements introduced via the reasoning methods in stages 3 and 4.
Summary
- Conventional LLM training occurs in several stages:
- Pre-training, where the model learns language patterns from vast amounts of text.
- Instruction fine-tuning, which improves the model's responses to user prompts.
- Preference tuning, which aligns model outputs with human preferences.
- Reasoning methods are applied on top of a conventional LLM.
- Reasoning in LLMs involves systematically solving multi-step tasks using intermediate steps (chain-of-thought).
- Reasoning in LLMs is different from rule-based reasoning and it also likely works differently than human reasoning; currently, the common consensus is that reasoning in LLMs relies on statistical pattern matching.
- Pattern matching in LLMs relies purely on statistical associations learned from data, which enables fluent text generation but lacks explicit logical inference.
- Improving reasoning in LLMs can be achieved through:
- Inference-time compute scaling, enhancing reasoning without retraining (e.g., chain-of-thought prompting).
- Reinforcement learning, training models explicitly with reward signals.
- Supervised fine-tuning and distillation, using examples from stronger reasoning models.
- Building reasoning models from scratch provides practical insights into LLM capabilities, limitations, and computational trade-offs.
FAQ
What does “reasoning” mean in the context of LLMs?
In this chapter, reasoning means that an LLM explains intermediate steps before giving a final answer. These step-by-step explanations are often called chain-of-thought (CoT). The goal is not to claim that LLMs reason like humans, but to describe models that produce reasoning-like traces that can improve accuracy on complex tasks such as math, coding, and logical puzzles.
What is chain-of-thought reasoning?
Chain-of-thought, or CoT, is a method where a model generates intermediate steps or explanations before producing its final answer. Instead of jumping directly to the result, the model appears to “think” through the problem step by step. This can make the answer easier to follow and can improve performance on multi-step tasks.
Do LLMs reason the same way humans do?
No. The chapter emphasizes that LLM reasoning can look similar to human reasoning because the model may articulate intermediate steps. However, current LLMs generate text probabilistically, one token at a time, based on statistical patterns in training data. Humans can deliberately apply logical rules, use world models, and consciously manipulate concepts, while LLMs rely on learned correlations.
How are conventional LLMs usually trained?
Conventional LLM training usually has two major stages: pre-training and post-training. During pre-training, the model learns to predict the next token from massive amounts of unlabeled text. During post-training, techniques such as supervised fine-tuning, instruction tuning, preference tuning, and sometimes RLHF help the model follow instructions and produce responses that better match user preferences.
What is the difference between pre-training and post-training?
Pre-training teaches an LLM raw language prediction by exposing it to large-scale text and training it to predict the next token. This gives the model basic language capabilities and emergent abilities such as translation or code generation. Post-training then refines the model so it can better follow user instructions and produce more helpful, preferred, and stylistically appropriate responses.
What are the main approaches for improving reasoning in LLMs?
The chapter groups reasoning-improvement methods into three broad categories: inference-time compute scaling, reinforcement learning, and distillation. Inference-time compute scaling improves performance during inference without changing model weights. Reinforcement learning updates model weights based on reward signals. Distillation transfers reasoning patterns from larger, stronger models into smaller or more efficient models.
What is inference-time compute scaling?
Inference-time compute scaling refers to methods that improve a model’s reasoning performance when the model is being used, without retraining or modifying its weights. The basic idea is to spend more computation during inference to get better answers. Examples include chain-of-thought prompting, generating multiple candidate solutions, sampling strategies, tool use, or using verifiers.
How does reinforcement learning for reasoning differ from RLHF?
Both use reinforcement learning, but they differ in the reward source. RLHF, used in preference tuning, relies on human judgments or rankings to guide the model toward preferred behavior. Reinforcement learning for reasoning often uses automated or environment-based rewards, such as whether a math answer is correct or whether code passes tests. This makes reasoning rewards more directly verifiable in some tasks.
How is reasoning different from pattern matching?
Pattern matching means the model generates text that is statistically consistent with patterns learned during training. For example, answering “Berlin” to “The capital of Germany is…” reflects a strong learned association. Logical reasoning, by contrast, involves systematically deriving conclusions from premises, tracking implications, and noticing contradictions. LLMs can simulate this behavior, but they do not necessarily execute explicit rule-based logic internally.
Why build reasoning models from scratch?
Building reasoning models from scratch helps reveal how reasoning methods actually work and what trade-offs they involve. Reasoning models can be powerful for complex tasks, but they are often more expensive, more verbose, and sometimes prone to overthinking. Implementing the methods step by step makes it easier to understand when reasoning is useful, how it improves performance, and why it may not be necessary for simpler tasks such as summarization or translation.
Build a Reasoning Model (From Scratch) ebook for free