The RLHF Book you own this product

Reinforcement learning from human feedback, alignment, and post-training LLMs

Nathan Lambert

MEAP began November 2025
Last updated February 2026
Publication in Summer 2026 (estimated)

ISBN 9781633434301
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

resources: Book forum

table of content

1 Introduction

1.1 What Does RLHF Do?

1.2 An Intuition for Post-Training

1.3 How We Got Here

1.4 Future of RLHF

2 A Tiny History of RLHF

2.1 Origins to 2018: RL on Preferences

2.2 2019 to 2022: RL from Human Preferences on Language Models

2.3 2023 to Present: ChatGPT Era

3 Training Overview

3.1 Problem Formulation

3.1.1 A Simple Example: The Thermostat

3.1.2 Classic RL Example: CartPole

3.1.3 Manipulating the Standard RL Setup

3.1.4 Fine-tuning and Regularization

3.1.5 Optimization Tools

3.2 Canonical Training Recipes

3.2.1 InstructGPT

3.2.2 Tülu 3

3.2.3 DeepSeek R1

4 Instruction Fine-tuning

4.1 Chat templates and the structure of instructions

4.2 Best practices of instruction tuning

4.3 Implementation

5 Reward Modeling

5.1 Training Reward Models

5.2 Architecture

5.3 Implementation Example

5.4 Variants

5.4.1 Preference Margin Loss

5.4.2 Balancing Multiple Comparisons Per Prompt

5.4.3 K-wise Loss Function

5.5 Outcome Reward Models

5.6 Process Reward Models

5.7 Reward Models vs. Outcome RMs vs. Process RMs vs. Value Functions

5.7.1 Inference Differences

5.8 Generative Reward Modeling

5.9 Further Reading

6 Reinforcement Learning

6.1 Policy Gradient Algorithms

6.1.1 Vanilla Policy Gradient

6.1.2 REINFORCE

6.1.3 REINFORCE Leave One Out (RLOO)

6.1.4 Proximal Policy Optimization (PPO)

6.1.5 Group Relative Policy Optimization (GRPO)

6.1.6 Group Sequence Policy Optimization (GSPO)

6.1.7 Clipped Importance Sampling Policy Optimization (CISPO)

6.1.8 Comparing Algorithms

6.2 Implementation

6.2.1 Policy Gradient Basics

6.2.2 Loss Aggregation

6.2.3 Asynchronicity

6.2.4 Proximal Policy Optimization

6.2.5 Group Relative Policy Optimization

6.3 Auxiliary Topics

6.3.1 Generalized Advantage Estimation (GAE)

6.3.2 Double Regularization

6.3.3 Further Reading

7 Reasoning & Inference-Time Scaling

7.1 The Origins of New Reasoning Models

7.1.1 Why Does RL Work Now?

7.1.2 RL Training vs. Inference-time Scaling

7.1.3 The Future (Beyond Reasoning) of RLVR

7.2 Understanding Reasoning Training Methods

7.2.1 Reasoning Research Pre OpenAI’s o1 or DeepSeek R1

7.2.2 Early Reasoning Models

7.2.3 Common Practices in Training Reasoning Models

7.3 Looking Ahead

8 Direct Alignment Algorithms

8.1 Direct Preference Optimization (DPO)

8.1.1 How DPO Works

8.1.2 DPO Derivation

8.2 Numerical Concerns, Weaknesses, and Alternatives

8.3 Implementation Considerations

8.4 DAAs with Synthetic Preference Data

8.5 DAAs vs. RL: Online vs. Offline Data

9 Rejection Sampling

9.1 Training Process

9.1.1 Generating Completions

9.1.2 Scoring Completions

9.1.3 Fine-tuning

9.2 Implementation Details

9.3 Related: Best-of-N Sampling

10 The Nature of Preferences

10.1 The Origins of RLHF and Preferences

10.2 Specifying objectives: from logic of utility to reward functions

10.3 Tools for optimizing utility

10.4 Complexity of optimizing preferences

11 Preference Data

11.1 Why We Need Preference Data

11.2 Collecting Preference Data

11.2.1 Interface

11.2.2 Rankings vs. Ratings

11.2.3 Multi-turn Data

11.2.4 Structured Preference Data

11.2.5 Sourcing and Contracts

11.3 Bias: Things to Watch Out For in Data Collection

11.4 Open Questions in RLHF Preference Data

12 Synthetic Data

12.1 Distillation

12.2 AI Feedback

12.2.1 Balancing AI and Human Feedback Data

12.2.2 Specific LLMs for Judgement

12.3 Constitutional AI

12.3.1 Further Reading on CAI

12.4 Rubrics: AI Feedback for Training

13 Tool Use & Function Calling

13.1 Interweaving Tool Calls in Generation

13.2 Multi-step Tool Reasoning

13.3 Model Context Protocol (MCP)

13.4 Implementation

14 Over Optimization

14.1 Qualitative Over-optimization

14.1.1 Managing Proxy Objectives

14.1.2 Over-refusal and “Too Much RLHF”

14.2 Quantitative over-optimization

14.3 Misalignment and the Role of RLHF

15 Regularization

15.1 KL Divergences in RL Optimization

15.1.1 Reference Model to Generations

15.1.2 Implementation Example

15.2 Pretraining Gradients

15.3 Other Regularization

16 Evaluation

16.1 Prompting Formatting: From Few-shot to Zero-shot to CoT

16.2 Why Many External Evaluation Comparisons are Unreliable

16.3 How Labs Actually use Evaluations Internally to Improve Models

16.4 Contamination

16.5 Tooling

17 Product, UX, and Model Character

17.1 Character Training

17.2 Model Specifications

17.3 Product Cycles, UX, and RLHF

Appendix

Appendix A: Definitions

A.1 Language Modeling Overview

A.2 ML Definitions

A.3 NLP Definitions

A.4 RL Definitions

A.5 RLHF Only Definitions

A.6 Extended Glossary

Appendix B: Beyond ‘Just Style’

Appendix C: Practical Issues

C.1 Compute Costs of Post-Training

C.2 Evaluation Variance

C.3 Managing Training Performance Variance

C.4 Identifying Bad Training Jobs

Overview

1 Introduction

Reinforcement Learning from Human Feedback (RLHF) integrates human preferences into AI systems to solve objectives that are hard to specify directly, especially in interactive settings. It rose to prominence with chat assistants and is now a core part of post-training—the suite of methods applied after large-scale pretraining to make models useful, safe, and engaging. Within post-training, instruction/supervised fine-tuning teaches format and basic behaviors, preference fine-tuning (where RLHF dominates) aligns models to subtle human values and style, and reinforcement learning with verifiable rewards targets domains with objective signals. Framed this way, post-training is about eliciting and shaping the latent capabilities of a pretrained base model into reliable, user-centered performance.

The chapter outlines the standard RLHF pipeline—train an instruction-following model, collect preference data to build a reward model, then optimize the policy against that reward (or learn directly from preferences via “direct alignment” methods). RLHF operates at the response level with contrastive signals, guiding models toward better answers and away from worse ones. This shifts outputs from unfocused next-token continuations to concise, helpful, well-formatted, and often warmer responses, and it tends to generalize more broadly than instruction tuning alone. The approach is powerful but delicate: reward models are proxies, optimization can overfit (e.g., to length), and success depends on strong starting models, high-quality data, and careful regularization—making RLHF more complex and costly than simple fine-tuning.

Historically, open efforts first leaned on instruction tuning before skepticism about RLHF gave way to widespread adoption, including simplified preference-optimization methods such as Direct Preference Optimization. Meanwhile, large closed labs advanced multi-stage post-training pipelines at scale, highlighting that effective systems combine instruction tuning, RLHF, prompt and format design, and increasingly, verifiable-reward RL and reasoning-focused training. The chapter situates RLHF as both a mature pillar of preference fine-tuning and a bridge to newer reinforcement-learning approaches, while emphasizing practical goals: clarify trade-offs, provide starting points for implementation, and equip readers with the intuition and tools to contribute to modern post-training in research and industry.

A rendition of the early, three stage RLHF process: first training via supervised fine-tuning (SFT, chapter 4), building a reward model (RM, chapter 5), and then optimizing with reinforcement learning (RL, chapter 6).

Summary

RLHF incorporates human preferences into AI systems to solve problems that are hard-to-specify programmatically, and became widely known through ChatGPT’s breakout, which made the capabilities of language models more approachable.
The basic RLHF pipeline has three steps: instruction fine-tuning to teach the model to follow the question-answering format, training a reward model on human preferences, and optimizing the model with RL against that reward.
RLHF is known to primarily change the style, tone, and format of model responses – making them more helpful, warm, and engaging. But it’s not “just style transfer”: RLHF also improves benchmark performance, though over-optimization (e.g., excessive length or chattiness) can harm capabilities in other domains.
The elicitation theory of post-training suggests that base models contain latent potential, and post-training’s job is to extract and cultivate that intelligence into useful behaviors.
RLHF is one component of modern post-training, alongside instruction fine-tuning (IFT/SFT) and reinforcement learning with verifiable rewards (RLVR), used together in an intertwined manner to craft particular training recipes.

FAQ

What is RLHF and why did it become important?

Reinforcement Learning from Human Feedback (RLHF) integrates human preferences into AI systems, helping solve hard-to-specify objectives that arise in human-facing applications. It became widely known through ChatGPT, where preference-guided optimization made language models more helpful, safe, and usable across many tasks.

How did RLHF originate and where was it first successful?

Early RLHF efforts targeted classic RL/control settings and then moved to language tasks like summarization, instruction following, web question-answering, and broader “alignment.” These successes showed that simple preference signals can guide powerful models toward desired behavior.

What is the classic three-stage RLHF pipeline?

The basic pipeline: (1) train an instruction-following model via supervised fine-tuning (SFT/IFT), (2) collect human preference data to train a reward model of preferences, and (3) optimize the policy with RL by sampling responses and using the reward model to guide updates.

How does RLHF fit into modern post-training?

Post-training is a multi-stage process that typically includes: (1) Instruction/Supervised Fine-Tuning (IFT/SFT) to teach format and basic behaviors, (2) Preference Fine-Tuning (PreFT)—where RLHF largely lives—to align style and subtle human preferences, and (3) Reinforcement Learning with Verifiable Rewards (RLVR) for tasks with checkable outcomes.

What does RLHF change about a model’s outputs?

RLHF shapes response-level behavior—improving qualities like reliability, helpfulness, and warmth—and teaches preferred styles and formats. Instead of merely predicting the next token, models learn which whole responses are better or worse, leading to concise, user-oriented answers that generalize across domains.

How does RLHF differ from instruction fine-tuning (SFT/IFT)?

SFT updates per token to imitate target text, emphasizing specific features and formats. RLHF uses preference signals at the response level, applying contrastive objectives to prefer better completions and avoid worse ones. This generally yields stronger cross-domain generalization and more robust behavior.

What are the main challenges and costs of RLHF?

- Reward models are proxies for human goals and can be mis-specified
- Optimization can overfit to the proxy (“over-optimization”), requiring regularization
- Known issues include length bias and noisy preference data
- It is more expensive than SFT in compute, data collection, and engineering time, and it benefits from a strong base model

What is the “elicitation” interpretation of post-training?

Base models contain substantial latent capabilities from pretraining. Post-training “elicits” these abilities by amplifying useful behaviors and reshaping outputs from raw next-token prediction to effective question-answering. Like improving an F1 car around a fixed chassis, teams can unlock large gains without changing the base model.

Is alignment “just style”? What about the Superficial Alignment Hypothesis?

Style matters and improves user experience, but post-training also shapes deeper behaviors (e.g., reasoning format, chain-of-thought, robustness). Small instruction datasets can shift behavior, yet scaling diverse data and preference learning remains crucial for broader capability and reliable performance—so alignment is not “just style.”

How have methods evolved (DPO, RLVR), and what’s next?

Direct Preference Optimization (DPO) simplified preference learning by optimizing directly on pairwise data, enabling strong open models when tuned carefully. Meanwhile, post-training in closed labs has become multi-stage and sophisticated. Looking forward, RLVR and reasoning-focused RL are rapidly advancing, with RLHF serving as the bridge and foundation for aligning large base models to human objectives.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more