Overview

1 Understanding reasoning models

This chapter introduces reasoning as the next stage in large language models, defining it as the practice of making intermediate steps explicit before delivering an answer. Often called chain-of-thought, this behavior can boost accuracy on complex tasks, but it remains fundamentally probabilistic: models generate token by token based on patterns learned during training rather than executing guaranteed, rule-based logic. The book takes a hands-on, code-first approach aimed at engineers and researchers, focusing on how to implement and evaluate practical reasoning methods from scratch rather than debating philosophical definitions.

To ground the discussion, the chapter reviews the conventional LLM pipeline: large-scale pre-training for next-token prediction followed by post-training via instruction tuning and preference tuning to align with user intents. It contrasts pattern matching—strong statistical associations that produce fluent text—with logical reasoning that requires structured intermediate steps and contradiction handling. While modern LLMs can often simulate reasoning by leveraging familiar patterns from their data, they are most brittle when scenarios are novel or require deep, multi-step inference, underscoring the need for targeted methods that go beyond surface correlations.

The chapter then outlines three complementary strategies for improving reasoning: inference-time compute scaling (trading extra generation and sampling for better answers without changing weights), reinforcement learning for reasoning (training with objective, verifiable rewards to shape solution strategies, distinct from preference-tuning RLHF), and distillation (transferring high-quality reasoning traces from stronger models to smaller ones via supervised fine-tuning). It frames the practical trade-offs—reasoning often costs more due to longer outputs and multi-call workflows—and emphasizes using the right tool for the task, since “thinking longer” is not always beneficial. Finally, it previews the book’s roadmap: start from a conventional base model, establish evaluation methods, apply inference techniques to elicit reasoning behaviors, and then use training methods to build dedicated reasoning models.

A simplified illustration of how an LLM might tackle a multi-step reasoning task. Rather than just recalling a fact, the model needs to combine several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation.
Overview of a typical LLM training pipeline. The process begins with an initial model initialized with random weights, followed by pre-training on large-scale text data to learn language patterns by predicting the next token. Post-training then refines the model through instruction fine-tuning and preference fine-tuning, which enables the LLM to follow human instructions better and align with human preferences.
Example responses from a language model at different training stages. The prompt asks for a summary of the relationship between sleep and health. The pre-trained LLM produces a relevant but unfocused answer without directly following the instructions. The instruction-tuned LLM generates a concise and accurate summary aligned with the prompt. The preference-tuned LLM further improves the response by using a friendly tone and engaging language, which makes the answer more relatable and user-centered.
Three approaches commonly used to improve reasoning capabilities in LLMs. These methods (inference-compute scaling, reinforcement learning, and distillation) are typically applied after the conventional training stages (initial model training, pre-training, and post-training with instruction and preference tuning).
Illustration of how contradictory premises lead to a logical inconsistency. From "All birds can fly" and "A penguin is a bird," we infer "Penguin can fly." This conclusion conflicts with the established fact "Penguin cannot fly," which results in a contradiction.
An illustrative example of how a language model (GPT-4o in ChatGPT) appears to "reason" about a contradictory premise.
Token-by-token generation in an LLM. At each step, the LLM takes the full sequence generated so far and predicts the next token, which may represent a word, subword, or punctuation mark depending on the tokenizer. The newly generated token is appended to the sequence and used as input for the next step. This iterative decoding process is used in both standard language models and reasoning-focused models.
A mental model of the main reasoning model development stages covered in this book. We start with a conventional LLM as base model (stage 1). In stage 2, we cover evaluation strategies to track the reasoning improvements introduced via the reasoning methods in stages 3 and 4.

Summary

  • Conventional LLM training occurs in several stages:
    • Pre-training, where the model learns language patterns from vast amounts of text.
    • Instruction fine-tuning, which improves the model's responses to user prompts.
    • Preference tuning, which aligns model outputs with human preferences.
  • Reasoning methods are applied on top of a conventional LLM.
  • Reasoning in LLMs involves systematically solving multi-step tasks using intermediate steps (chain-of-thought).
  • Reasoning in LLMs is different from rule-based reasoning and it also likely works differently than human reasoning; currently, the common consensus is that reasoning in LLMs relies on statistical pattern matching.
  • Pattern matching in LLMs relies purely on statistical associations learned from data, which enables fluent text generation but lacks explicit logical inference.
  • Improving reasoning in LLMs can be achieved through:
    • Inference-time compute scaling, enhancing reasoning without retraining (e.g., chain-of-thought prompting).
    • Reinforcement learning, training models explicitly with reward signals.
    • Supervised fine-tuning and distillation, using examples from stronger reasoning models.
  • Building reasoning models from scratch provides practical insights into LLM capabilities, limitations, and computational trade-offs.

FAQ

What does “reasoning” mean for LLMs in this book?Reasoning means the model shows its intermediate steps before giving a final answer (often called chain-of-thought). Producing these steps can improve accuracy on complex tasks like coding, logic puzzles, and multi-step math. The book uses “reasoning” and “thinking” in the common LLM sense, without claiming human-like cognition.
How do LLM reasoning steps differ from symbolic or human reasoning?Symbolic engines follow deterministic, rule-based procedures with guarantees. LLMs generate tokens autoregressively based on probabilities learned from data, so their “reasoning steps” can look persuasive but aren’t guaranteed to be logically sound or consistent.
What are the standard training stages of an LLM before reasoning methods are added?Two stages: pre-training (next-token prediction on massive text to learn general language ability) and post-training, which includes instruction tuning (SFT) to follow instructions and preference tuning (often RLHF) to align outputs with human preferences. A chatbot experience adds an extra orchestration layer beyond these.
What are the main approaches to improve LLM reasoning after conventional training?The chapter groups them into three categories: - Inference-time compute scaling (test-time scaling, CoT prompting, diverse sampling) that doesn’t change weights. - Reinforcement learning (updates weights using rewards from tasks or verifiers). - Distillation (SFT on high-quality data generated by stronger teacher models).
How is RL for reasoning different from RLHF used in preference tuning?Both use RL, but the reward source differs. RLHF uses human judgments/rankings to align style and preferences. Reasoning-focused RL typically uses automated or verifiable rewards (e.g., correctness in math/coding), encouraging strategies that solve tasks rather than just align with human preference.
How do pattern matching and logical reasoning differ, and what does the “penguin” example show?Pattern matching predicts likely continuations from training correlations. Logical reasoning applies rules, tracks intermediate steps, and detects contradictions. In a closed-world reading of “All birds can fly; a penguin is a bird,” the valid conclusion is “yes.” With open-world knowledge (penguins can’t fly), a reasoning system should flag the contradiction—LLMs typically don’t track this explicitly.
Why do non-reasoning models sometimes appear to reason correctly?They often simulate reasoning via strong statistical associations learned at scale (e.g., “penguins” ↔ “cannot fly”). This works in familiar contexts but is brittle for novel scenarios or problems requiring intricate, multi-step logic.
When should I use a reasoning model, and what are the trade-offs?Use reasoning models for complex tasks (puzzles, advanced math, difficult coding). For simpler tasks (summarization, translation, fact-based Q&A), conventional models are often sufficient. Reasoning models can be more verbose, slower, costlier, and sometimes prone to “overthinking.”
Why are reasoning models more expensive to run?Two main reasons: - Longer outputs include intermediate steps, meaning more tokens and forward passes. - Many workflows sample multiple candidates, call tools, or run verifiers, multiplying token usage and inference calls.
What roadmap does the book propose for building reasoning models from scratch?Stage 1: start with a conventional LLM (already pre-trained and instruction-tuned). Stage 2: set up evaluation to track reasoning improvements. Stage 3: apply inference-time techniques to boost reasoning behavior. Stage 4: use training methods (e.g., RL) to develop explicit reasoning capabilities.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free