1 Understanding reasoning models
This chapter introduces the book’s practical focus: extending large language models with reasoning capabilities. It defines reasoning in LLMs as generating intermediate steps—often called chain-of-thought—to arrive at answers, and contrasts this probabilistic, token-by-token process with deterministic, rule-based logic engines. The goal is to give practitioners a hands-on understanding of how to design, implement, and evaluate reasoning methods, clarifying both their strengths for complex tasks and their limitations compared to human-style reasoning.
The chapter reviews the standard LLM pipeline—pre-training for next-token prediction followed by post-training via instruction and preference tuning—and explains why these stages largely yield pattern matching rather than structured inference. Using simple examples, it shows how LLMs can appear to reason by leveraging statistical associations learned at scale, yet may miss contradictions or fail on novel or complex logical problems. This sets up a clear distinction: conventional models simulate reasoning when the patterns exist in data, whereas robust reasoning requires methods that promote explicit intermediate steps and verifiable outcomes.
Finally, the chapter outlines three complementary approaches for improving reasoning: inference-time compute scaling (e.g., step-by-step prompting and sampling strategies without changing weights), reinforcement learning (using rewards to learn reasoning strategies), and supervised fine-tuning/distillation (transferring reasoning behavior from stronger models). It stresses practical trade-offs—reasoning often increases cost and verbosity and isn’t always needed—and argues that building these systems from scratch is the best way to understand when to apply them. The chapter closes with a roadmap: start from a conventional LLM, establish evaluation methods, then implement inference and training techniques that reliably enhance reasoning.
A simplified illustration of how an LLM might tackle a multi-step reasoning task. Rather than just recalling a fact, the model needs to combine several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation.

Overview of a typical LLM training pipeline. The process begins with an initial model initialized with random weights, followed by pre-training on large-scale text data to learn language patterns by predicting the next token. Post-training then refines the model through instruction fine-tuning and preference fine-tuning, which enables the LLM to follow human instructions better and align with human preferences.In the pre-training stage, LLMs are trained on massive amounts (many terabytes) of unlabeled text, which includes books, websites, research articles, and many other sources. The pre-training objective for the LLM is to learn to predict the next word (or token) in these texts.

Example responses from a language model at different training stages. The prompt asks for a summary of the relationship between sleep and health. The pre-trained LLM produces a relevant but unfocused answer without directly following the instructions. The instruction-tuned LLM generates a concise and accurate summary aligned with the prompt. The preference-tuned LLM further improves the response by using a friendly tone and engaging language, which makes the answer more relatable and user-centered.

Illustration of how contradictory premises lead to a logical inconsistency. From "All birds can fly" and "A penguin is a bird," we infer "Penguin can fly." This conclusion conflicts with the established fact "Penguin cannot fly," which results in a contradiction.

An illustrative example of how a language model (GPT-4o in ChatGPT) appears to "reason" about a contradictory premise.

Three approaches commonly used to improve reasoning capabilities in LLM). These methods (inference-compute scaling, reinforcement learning, and distillation) are typically applied after the conventional training stages (initial model training, pre-training, and post-training with instruction and preference tuning).

Token-by-token generation in an LLM. At each step, the LLM takes the full sequence generated so far and predicts the next token, which may represent a word, subword, or punctuation mark depending on the tokenizer. The newly generated token is appended to the sequence and used as input for the next step. This iterative decoding process is used in both standard language models and reasoning-focused models.

A mental model of the main reasoning model development stages covered in this book. We start with a conventional LLM as base model (stage 1). In stage 2, we cover evaluation strategies to track the reasoning improvements introduced via the reasoning methods in stages 3 and 4.

Summary
- Conventional LLM training occurs in several stages:
- Pre-training, where the model learns language patterns from vast amounts of text.
- Instruction fine-tuning, which improves the model's responses to user prompts.
- Preference tuning, which aligns model outputs with human preferences.
- Reasoning methods are applied on top of a conventional LLM.
- Reasoning in LLMs involves systematically solving multi-step tasks using intermediate steps (chain-of-thought).
- Reasoning in LLMs is different from rule-based reasoning and it also likely works differently than human reasoning; currently, the common consensus is that reasoning in LLMs relies on statistical pattern matching.
- Pattern matching in LLMs relies purely on statistical associations learned from data, which enables fluent text generation but lacks explicit logical inference.
- Improving reasoning in LLMs can be achieved through:
- Inference-time compute scaling, enhancing reasoning without retraining (e.g., chain-of-thought prompting).
- Reinforcement learning, training models explicitly with reward signals.
- Supervised fine-tuning and distillation, using examples from stronger reasoning models.
- Building reasoning models from scratch provides practical insights into LLM capabilities, limitations, and computational trade-offs.