The RLHF Book you own this product

Reinforcement learning from human feedback, alignment, and post-training LLMs

Nathan Lambert

MEAP began November 2025
Last updated February 2026
Publication in Summer 2026 (estimated)

ISBN 9781633434301
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean

catalog / Data Science / Machine Learning

resources: Book forum

table of content

1 Introduction

1.1 What Does RLHF Do?

1.2 An Intuition for Post-Training

1.3 How We Got Here

1.4 Future of RLHF

2 A Tiny History of RLHF

2.1 Origins to 2018: RL on Preferences

2.2 2019 to 2022: RL from Human Preferences on Language Models

2.3 2023 to Present: ChatGPT Era

3 Training Overview

3.1 Problem Formulation

3.1.1 A Simple Example: The Thermostat

3.1.2 Classic RL Example: CartPole

3.1.3 Manipulating the Standard RL Setup

3.1.4 Fine-tuning and Regularization

3.1.5 Optimization Tools

3.2 Canonical Training Recipes

3.2.1 InstructGPT

3.2.2 Tülu 3

3.2.3 DeepSeek R1

4 Instruction Fine-tuning

4.1 Chat templates and the structure of instructions

4.2 Best practices of instruction tuning

4.3 Implementation

5 Reward Modeling

5.1 Training Reward Models

5.2 Architecture

5.3 Implementation Example

5.4 Variants

5.4.1 Preference Margin Loss

5.4.2 Balancing Multiple Comparisons Per Prompt

5.4.3 K-wise Loss Function

5.5 Outcome Reward Models

5.6 Process Reward Models

5.7 Reward Models vs. Outcome RMs vs. Process RMs vs. Value Functions

5.7.1 Inference Differences

5.8 Generative Reward Modeling

5.9 Further Reading

6 Reinforcement Learning

6.1 Policy Gradient Algorithms

6.1.1 Vanilla Policy Gradient

6.1.2 REINFORCE

6.1.3 REINFORCE Leave One Out (RLOO)

6.1.4 Proximal Policy Optimization (PPO)

6.1.5 Group Relative Policy Optimization (GRPO)

6.1.6 Group Sequence Policy Optimization (GSPO)

6.1.7 Clipped Importance Sampling Policy Optimization (CISPO)

6.1.8 Comparing Algorithms

6.2 Implementation

6.2.1 Policy Gradient Basics

6.2.2 Loss Aggregation

6.2.3 Asynchronicity

6.2.4 Proximal Policy Optimization

6.2.5 Group Relative Policy Optimization

6.3 Auxiliary Topics

6.3.1 Generalized Advantage Estimation (GAE)

6.3.2 Double Regularization

6.3.3 Further Reading

7 Reasoning & Inference-Time Scaling

7.1 The Origins of New Reasoning Models

7.1.1 Why Does RL Work Now?

7.1.2 RL Training vs. Inference-time Scaling

7.1.3 The Future (Beyond Reasoning) of RLVR

7.2 Understanding Reasoning Training Methods

7.2.1 Reasoning Research Pre OpenAI’s o1 or DeepSeek R1

7.2.2 Early Reasoning Models

7.2.3 Common Practices in Training Reasoning Models

7.3 Looking Ahead

8 Direct Alignment Algorithms

8.1 Direct Preference Optimization (DPO)

8.1.1 How DPO Works

8.1.2 DPO Derivation

8.2 Numerical Concerns, Weaknesses, and Alternatives

8.3 Implementation Considerations

8.4 DAAs with Synthetic Preference Data

8.5 DAAs vs. RL: Online vs. Offline Data

9 Rejection Sampling

9.1 Training Process

9.1.1 Generating Completions

9.1.2 Scoring Completions

9.1.3 Fine-tuning

9.2 Implementation Details

9.3 Related: Best-of-N Sampling

10 The Nature of Preferences

10.1 The Origins of RLHF and Preferences

10.2 Specifying objectives: from logic of utility to reward functions

10.3 Tools for optimizing utility

10.4 Complexity of optimizing preferences

11 Preference Data

11.1 Why We Need Preference Data

11.2 Collecting Preference Data

11.2.1 Interface

11.2.2 Rankings vs. Ratings

11.2.3 Multi-turn Data

11.2.4 Structured Preference Data

11.2.5 Sourcing and Contracts

11.3 Bias: Things to Watch Out For in Data Collection

11.4 Open Questions in RLHF Preference Data

12 Synthetic Data

12.1 Distillation

12.2 AI Feedback

12.2.1 Balancing AI and Human Feedback Data

12.2.2 Specific LLMs for Judgement

12.3 Constitutional AI

12.3.1 Further Reading on CAI

12.4 Rubrics: AI Feedback for Training

13 Tool Use & Function Calling

13.1 Interweaving Tool Calls in Generation

13.2 Multi-step Tool Reasoning

13.3 Model Context Protocol (MCP)

13.4 Implementation

14 Over Optimization

14.1 Qualitative Over-optimization

14.1.1 Managing Proxy Objectives

14.1.2 Over-refusal and “Too Much RLHF”

14.2 Quantitative over-optimization

14.3 Misalignment and the Role of RLHF

15 Regularization

15.1 KL Divergences in RL Optimization

15.1.1 Reference Model to Generations

15.1.2 Implementation Example

15.2 Pretraining Gradients

15.3 Other Regularization

16 Evaluation

16.1 Prompting Formatting: From Few-shot to Zero-shot to CoT

16.2 Why Many External Evaluation Comparisons are Unreliable

16.3 How Labs Actually use Evaluations Internally to Improve Models

16.4 Contamination

16.5 Tooling

17 Product, UX, and Model Character

17.1 Character Training

17.2 Model Specifications

17.3 Product Cycles, UX, and RLHF

Appendix

Appendix A: Definitions

A.1 Language Modeling Overview

A.2 ML Definitions

A.3 NLP Definitions

A.4 RL Definitions

A.5 RLHF Only Definitions

A.6 Extended Glossary

Appendix B: Beyond ‘Just Style’

Appendix C: Practical Issues

C.1 Compute Costs of Post-Training

C.2 Evaluation Variance

C.3 Managing Training Performance Variance

C.4 Identifying Bad Training Jobs

Overview

10 The Nature of Preferences

Reinforcement learning from human feedback treats human preferences as the fuel and target of learning when explicit reward design is infeasible. But what counts as “better” is often subjective and context-laden—as simple as preferring one poem over another—so RLHF inherits the ambiguity of human judgment. The chapter situates RLHF within an interdisciplinary lineage spanning philosophy, psychology, economics, decision theory, optimal control, and modern deep learning. Because each field brings its own assumptions about what preferences are and how they can be optimized, RLHF can never be a fully solved problem; in practice it emphasizes empirical alignment on concrete behaviors rather than perfect value calibration. Ongoing work therefore explores pluralistic alignment across populations and personalization to respect divergent values while maintaining practical performance.

Historically, ideas about utility and choice—from early probability and decision theory through utilitarian calculus to formal treatments of uncertainty—paved the way for modeling preference as something that could be quantified. Reinforcement learning then fused operant conditioning’s notion of “reward” with control theory’s utility-to-go, culminating in the Bellman equation and the MDP framework, and later in TD learning, Q-learning, and deep RL successes in games and control. These methods assume stable, well-defined objectives and stationary environments, assumptions that clash with messy, open-ended human judgments. Although inverse reinforcement learning is closely related to learning rewards from behavior, RLHF largely evolved through an engineering path focused on reward modeling from comparative feedback at scale. The result is a powerful toolkit optimized for utility under uncertainty, yet strained when compressing many heterogeneous, shifting human desiderata into a single scalar signal.

The chapter details why this compression is fragile: human preferences drift over time, depend on presentation and context, and resist neat Markovian assumptions or full observability common in RL formalisms. While the Von Neumann–Morgenstern utility theorem licenses representing preferences with expected utility, its prerequisites are routinely violated in real interfaces and social settings. Social choice results (like impossibility theorems) underscore that not all fairness desiderata can be satisfied simultaneously, and attempts at interpersonal utility comparison or principal–agent formulations introduce new tensions, including corrigibility concerns. Practically, RLHF data can exhibit intransitivity, framing effects, reliance on noisy proxies, and low inter-annotator agreement, all of which complicate aggregation and optimization. Hence RLHF is best viewed as an iterative, empirically grounded process: it aligns models to observed human feedback for specific tasks while acknowledging unresolved normative trade-offs, motivating better dataset engineering, evaluation, and personalization rather than a final, universal solution.

The timeline of the integration of various subfields into the modern version of RLHF. The direct links are continuous developments of specific technologies, and the arrows indicate motivations and conceptual links.

Summary

RLHF sits at the intersection of philosophy, economics, psychology, reinforcement learning, and deep learning – each bringing its own assumptions about what preferences are and how they can be optimized.
Reinforcement learning was designed for domains with stable, deterministic reward functions, but human preferences are noisy, context-dependent, temporally shifting, and not always transitive – a fundamental mismatch that shapes the limitations of RLHF.
The Von Neumann-Morgenstern utility theorem provides theoretical license for modeling preferences as scalar functions, but its assumptions (transitivity, comparability, stability) are routinely violated in practice. Impossibility theorems in social choice theory further show that no single aggregation method over preferences can satisfy all fairness criteria simultaneously.
These challenges explain why RLHF will never be fully “solved,” but they do not prevent it from being useful. In practice, RLHF operates on more tractable problems of style and performance rather than attempting to resolve the full complexity of human values.
The practical mechanics of collecting and structuring preference data in light of RLHF’s complex motivations are covered in Chapter 11.

FAQ

What does “the nature of preferences” mean in this chapter, and why does it matter for RLHF?

Human preferences are complex, context-dependent, and shaped by philosophy, psychology, economics, and culture. RLHF tries to model these preferences to guide learning, but their variability and ambiguity make both measurement and optimization inherently challenging.

Why will RLHF never be a fully solved problem?

Because preferences drift over time, differ across people and contexts, and cannot be perfectly aggregated into a single objective. Social choice impossibility results, measurement biases, and practical trade-offs ensure there’s no one definitive solution—only better approximations.

How do reward, utility, and preference relate in RLHF?

RLHF uses learned reward models as proxies for human preferences, drawing on decision theory’s utility concepts. While utility/reward provide clear optimization targets in control and RL, compressing rich, plural human preferences into a single scalar reward inevitably loses nuance.

What is “reward-to-go” and why is a discount factor used?

Reward-to-go is the expected cumulative future reward from a state or step. RL typically applies a discount factor to weight near-term rewards more than distant ones, stabilizing optimization. This framework fits well-defined tasks but strains when modeling open-ended, shifting human preferences.

Which RL/control assumptions break down when modeling human preferences?

Key assumptions such as stationary rewards, determinism, transitivity, a single objective, and the Markov property often fail. Human judgments are context-sensitive, non-stationary, and multi-objective, making standard RL formulations an imperfect fit.

Why is the poem example relevant to RLHF?

Judging “which poem is better” lacks a single correct answer, unlike factual questions. RLHF relies on human comparisons in such open-ended tasks to approximate desirable behavior, but the subjectivity highlights ambiguity and the need for careful data and evaluation design.

How do MDPs and the Bellman equation relate to RLHF, and what are their limits?

RLHF borrows tools from RL that assume Markov Decision Processes and use the Bellman equation for optimality. Language tasks are partially observed and non-stationary, limiting these assumptions and weakening theoretical guarantees.

What measurement pitfalls affect preference data in RLHF?

Presentation effects (how options are displayed), order and fatigue effects, proxy signals (e.g., dwell time), and low agreement across annotators can skew data. Such biases can mislead reward models and propagate through training and deployment.

How does RLHF relate to inverse reinforcement learning (IRL)?

IRL infers a reward function from observed behavior; RLHF typically trains a reward model from direct human feedback (e.g., pairwise comparisons). They are closely related, and insights from IRL may improve RLHF, especially for complex, open-ended domains.

What is pluralistic alignment and how is it addressed?

Pluralistic alignment aims to respect diverse preferences across populations rather than enforcing a single utility. Approaches include better datasets, explicit aggregation strategies, and personalization—each balancing fairness, coherence, and corrigibility trade-offs.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more