Overview

1 Understanding reasoning models

Large language models are entering a new phase centered on reasoning: making their intermediate steps explicit before giving an answer. In this context, a “reasoning model” is an LLM adapted to think step by step, which often boosts accuracy on complex tasks like math, coding, and logic puzzles. While these behaviors resemble human thought, they arise from probabilistic next-token prediction rather than deterministic, rule-based logic. This book takes a practical, code-first approach: starting from a pre-trained, instruction-tuned base model and implementing the components that enable clear, multi-step reasoning, with an emphasis on what works in practice for engineers and researchers.

The chapter first situates reasoning within the standard LLM pipeline: pre-training learns broad language patterns via next-token prediction; post-training (instruction and preference tuning) aligns the model to follow directions and human preferences. It then outlines three major avenues for enhancing reasoning beyond that baseline: inference-time compute scaling (e.g., chain-of-thought prompting, diverse sampling) that increases thinking at runtime without changing weights; reinforcement learning that updates weights using verifiable rewards distinct from RLHF’s human preference signals; and distillation that transfers reasoning behavior from larger models to smaller ones via supervised fine-tuning. The text clarifies how LLMs primarily match patterns rather than execute formal logic, which explains both their impressive fluency and their limitations on novel or deeply multi-step problems, and why they can appear to “reason” when training data contains many similar patterns.

Finally, the chapter motivates building reasoning models from scratch: industry momentum is shifting toward models that know when to think more or less, but reasoning adds cost and latency because it produces longer outputs and often requires multiple sampled solutions or verification passes. Reasoning is therefore best applied selectively, using the right tool for the task. The roadmap for the book is to load a capable base LLM, establish robust evaluations, then improve reasoning with inference-time methods followed by training approaches such as RL and distillation. By the end, readers will understand the trade-offs and be able to design, implement, and assess reasoning techniques that make LLMs more reliable on complex tasks.

A simplified illustration of how a conventional, non-reasoning LLM might respond to a question with a short answer.
A simplified illustration of how a reasoning LLM might tackle a multi-step reasoning task. Rather than just recalling a fact, the model combines several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation.
Overview of a typical LLM training pipeline. The process begins with an initial model initialized with random weights, followed by pre-training on large-scale text data to learn language patterns by predicting the next token. Post-training then refines the model through instruction fine-tuning and preference fine-tuning, which enables the LLM to follow human instructions better and align with human preferences.
Example responses from a language model at different training stages. The prompt asks for a summary of the relationship between sleep and health. The pre-trained LLM produces a relevant but unfocused answer without directly following the instructions. The instruction-tuned LLM generates a concise and accurate summary aligned with the prompt. The preference-tuned LLM further improves the response by using a friendly tone and engaging language, which makes the answer more relatable and user-centered.
Three approaches commonly used to improve reasoning capabilities in LLMs. These methods (inference-compute scaling, reinforcement learning, and distillation) are typically applied after the conventional training stages (initial model training, pre-training, and post-training with instruction and preference tuning).
Illustration of how contradictory premises lead to a logical inconsistency. From "All birds can fly" and "A penguin is a bird," we infer "Penguin can fly." This conclusion conflicts with the established fact "Penguin cannot fly," which results in a contradiction.
An illustrative example of how a language model (GPT-4o in ChatGPT) appears to "reason" about a contradictory premise.
Token-by-token generation in an LLM. At each step, the LLM takes the full sequence generated so far and predicts the next token, which may represent a word, subword, or punctuation mark depending on the tokenizer. The newly generated token is appended to the sequence and used as input for the next step. This iterative decoding process is used in both standard language models and reasoning-focused models.
A mental model of the main reasoning model development stages covered in this book. We start with a conventional LLM as base model (stage 1). In stage 2, we cover evaluation strategies to track the reasoning improvements introduced via the reasoning methods in stages 3 and 4.

Summary

  • Conventional LLM training occurs in several stages:
    • Pre-training, where the model learns language patterns from vast amounts of text.
    • Instruction fine-tuning, which improves the model's responses to user prompts.
    • Preference tuning, which aligns model outputs with human preferences.
  • Reasoning methods are applied on top of a conventional LLM.
  • Reasoning in LLMs refers to improving a model so that it explicitly generates intermediate steps (chain-of-thought) before producing a final answer, which often increases accuracy on multi-step tasks such as math problems, logical puzzles, and coding challenges.
  • Reasoning in LLMs is different from rule-based reasoning and it also likely works differently than human reasoning; currently, the common consensus is that reasoning in LLMs relies on statistical pattern matching.
  • Pattern matching in LLMs relies purely on statistical associations learned from data, which enables fluent text generation but lacks explicit logical inference.
  • Improving reasoning in LLMs can be achieved through:
    • Inference-time compute scaling, enhancing reasoning without retraining (e.g., chain-of-thought prompting).
    • Reinforcement learning, training models explicitly with reward signals.
    • Supervised fine-tuning and distillation, using examples from stronger reasoning models.
  • Building reasoning models from scratch provides practical insights into LLM capabilities, limitations, and computational trade-offs.

FAQ

What does “reasoning” mean in the context of LLMs?Reasoning means the model makes its intermediate steps explicit before giving a final answer. Generating these steps (often called chain-of-thought) tends to improve accuracy on complex tasks like coding, logic puzzles, and multi-step math, though it does not imply human-like thinking.
Do LLMs actually think like humans when they show chain-of-thought?No. “Reasoning” and “thinking” are used colloquially. LLMs generate text probabilistically, one token at a time, based on statistical patterns from training data; they do not apply human-like, rule-based reasoning or maintain an internal world model the way people do.
How is LLM “reasoning” different from traditional rule-based logic or theorem proving?Rule-based systems are deterministic and follow explicit, verifiable rules that ensure consistency. LLMs are probabilistic autoregressive generators: their intermediate “steps” are not guaranteed to be logically sound, even if they look convincing.
What is the standard LLM training pipeline before adding reasoning methods?It has two main stages: pre-training (next-token prediction on massive unlabeled text) and post-training, which includes supervised fine-tuning (instruction tuning) to follow tasks and preference tuning (e.g., RLHF) to align outputs with human preferences. A chat interface adds orchestration on top but is not part of the base model.
What are the main approaches to improving LLM reasoning?Three broad categories: (1) inference-time compute scaling (test-time methods like chain-of-thought sampling and multi-sample selection), (2) reinforcement learning for reasoning (updating weights using verifiable rewards), and (3) distillation (SFT on high-quality reasoning traces from stronger models).
How does inference-time compute scaling help without retraining the model?It trades more compute at inference for better performance. Examples include prompting the model to show steps, sampling multiple candidate solutions, using verifiers, and selecting the best answer—all without changing model weights.
How does reinforcement learning for reasoning differ from RLHF used in preference tuning?Both use RL, but they differ in rewards. RLHF uses human judgments to shape style and helpfulness. Reasoning-focused RL typically relies on automated, verifiable rewards (e.g., correctness in math or code), directly training the model to solve tasks.
What’s the difference between pattern matching and logical reasoning in LLMs?Pattern matching is next-token prediction from statistical correlations (e.g., “The capital of Germany is … Berlin”). Logical reasoning involves rule-based inference and tracking contradictions. LLMs often simulate reasoning via learned patterns rather than executing explicit logic.
Why build dedicated reasoning models if non-reasoning LLMs can sometimes “simulate” reasoning?Pattern-based simulation works on familiar cases but tends to fail on novel or highly complex, multi-step problems. Dedicated reasoning methods improve robustness, accuracy, and generalization beyond what pure pattern matching provides.
When should I use reasoning models, and what are the cost trade-offs?Use them for complex tasks (hard math, puzzles, challenging coding). They are often unnecessary for simpler tasks (summaries, translations, fact retrieval). Costs are higher because reasoning outputs are longer (more tokens, more forward passes) and workflows often require multiple samples or tool/verifier calls.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free