Reinforcement Learning from Human Feedback you own this product

Alignment and post-training of LLMs

Nathan Lambert

MEAP began November 2025
Last updated April 2026
Publication in July 2026 (estimated)

ISBN 9781633434301
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean, Simplified Chinese

catalog / Data Science / Machine Learning

resources: Book forum

table of content

1 Introduction

1.1 What Does RLHF Do?

1.2 Walkthrough of a RLHF Recipe

1.3 An Intuition for Post-Training

1.4 How We Got Here

1.5 Future of RLHF

2 A Tiny History of RLHF

2.1 Origins to 2018: RL on Preferences

2.2 2019 to 2022: RL from Human Preferences on Language Models

2.3 2023 to Present: ChatGPT Era

3 Training Overview

3.1 Problem Formulation

3.1.1 A Simple Example: The Thermostat

3.1.2 Classic RL Example: CartPole

3.1.3 Manipulating the Standard RL Setup

3.1.4 Fine-tuning and Regularization

3.1.5 Optimization Tools

3.1.6 Subtle Advantages of RL in Post-Training Language Models

3.2 Canonical Training Recipes

3.2.1 InstructGPT

3.2.2 Tülu 3

3.2.3 DeepSeek R1

4 Instruction Fine-tuning

4.1 Chat Templates and the Structure of Instructions

4.2 Best Practices of Instruction Tuning

4.3 Implementation Details

5 Reward Modeling

5.1 Training a Bradley-Terry Reward Model

5.2 The Default Reward Model Architecture

5.3 Implementation Example

5.4 Reward Model Variants

5.4.1 Preference Margin Loss

5.4.2 Balancing Multiple Comparisons Per Prompt

5.4.3 K-wise Loss Function

5.5 Outcome Reward Models

5.6 Process Reward Models

5.7 Comparing Reward Model Types (and Value Functions)

5.7.1 Inference Across Reward Model Types

5.8 Generative Reward Modeling (a.k.a. LLM-as-a-judge)

5.9 Further Reading

6 Reinforcement Learning

6.1 Policy Gradient Algorithms

6.1.1 Deriving the Policy Gradient

6.1.2 Vanilla Policy Gradient

6.1.3 REINFORCE

6.1.4 REINFORCE Leave One Out (RLOO)

6.1.5 Example: Proximal Policy Optimization (PPO)

6.1.6 Group Relative Policy Optimization (GRPO)

6.1.7 Group Sequence Policy Optimization (GSPO)

6.1.8 Clipped Importance Sampling Policy Optimization (CISPO)

6.1.9 Comparing Algorithms

6.2 Implementation

6.2.1 Policy Gradient Basics

6.2.2 Loss Aggregation Trade-offs

6.2.3 Asynchronous RL Systems

6.2.4 Truncated Importance Sampling (TIS)

6.2.5 Proximal Policy Optimization

6.2.6 Example: Group Relative Policy Optimization

6.3 Auxiliary Topics

6.3.1 Generalized Advantage Estimation (GAE)

6.3.2 Double Regularization

6.3.3 Further Reading

7 Reasoning & Inference-Time Scaling

7.1 The Origins of New Reasoning Models

7.1.1 Why Does RL Work Now?

7.1.2 RL Training vs. Inference-time Scaling

7.1.3 The Future (Beyond Reasoning) of RLVR

7.2 Understanding Reasoning Training Methods

7.2.1 Reasoning Research Pre OpenAI’s o1 or DeepSeek R1

7.2.2 Early Reasoning Models

7.2.3 Common Practices in Training Reasoning Models

7.3 Looking Ahead

8 Direct Alignment Algorithms (DAAs)

8.1 Direct Preference Optimization (DPO)

8.1.1 How DPO Works

8.1.2 DPO Derivation

8.2 Numerical Concerns, Weaknesses, and Alternatives

8.3 Implementation Details

8.4 DAAs with Synthetic Preference Data

8.5 DAAs vs. RL: Online vs. Offline Data

9 Rejection Sampling

9.1 Training Process Step By Step

9.1.1 Generating Completions

9.1.2 Scoring Completions

9.1.3 Fine-tuning

9.2 Implementation Details

9.3 Related: Best-of-N Sampling

10 The Nature of Preferences

10.1 The Origins of RLHF and Preferences

10.2 Specifying Objectives: From Logic of Utility to Reward Functions

10.3 Tools for Optimizing Utility

10.4 Complexity of Optimizing Preferences

11 Preference Data

11.1 Why We Need Preference Data

11.2 Collecting Preference Data

11.2.1 Interfaces

11.2.2 Rankings vs. Ratings

11.2.3 Multi-turn Data

11.2.4 Structured Preference Data

11.2.5 Sourcing and Contracts

11.3 Bias: Things to Watch Out For in Data Collection

11.4 Open Questions in RLHF Preference Data

12 Synthetic Data

12.1 Distillation

12.2 AI Feedback

12.2.1 Balancing AI and Human Feedback Data

12.2.2 Building Specific LLMs for Judgement

12.3 Constitutional AI

12.3.1 Further Reading on CAI

12.4 Rubrics: Prompt-Specific AI Feedback for Training

13 Tool Use & Function Calling

13.1 Interweaving Tool Calls in Generation

13.2 Multi-step Tool Reasoning

13.3 Model Context Protocol (MCP)

13.4 Implementation Details

14 Over-Optimization

14.1 Qualitative Over-optimization

14.1.1 Managing Proxy Objectives

14.1.2 Over-refusal and “Too Much RLHF”

14.2 Quantitative Over-Optimization

14.3 Misalignment and the Role of RLHF

15 Regularization

15.1 KL Divergence in RL Optimization

15.1.1 Reference Model to Generations

15.1.2 Implementation Example

15.2 Implicit Regularization

15.2.1 SFT Memorizes, RL Generalizes

15.2.2 Retaining by Doing: On-Policy Data Mitigates Forgetting

15.2.3 RL’s Razor: Why Online RL Forgets Less

15.3 Other Types of Regularization

15.3.1 Pretraining Gradients

15.3.2 Margin-based Regularization

16 Evaluation

16.1 Prompting Formatting: From Few-shot to Zero-shot to CoT

16.2 Why Many External Evaluation Comparisons Are Unreliable

16.3 How Labs Actually Use Evaluations Internally to Improve Models

16.4 Contamination

16.5 Tooling

17 Crafting Model Character and Products

17.1 Character Training

17.1.1 Persona Vectors

17.1.2 The Assistant Axis

17.1.3 Persona Subnetworks

17.2 Model Specifications

17.3 Product Cycles and What’s Next for RLHF

Appendix

Appendix A: Definitions

A.1 Language Modeling Overview

A.2 ML Definitions

A.3 NLP Definitions

A.4 RL Definitions

A.5 RLHF Only Definitions

A.6 Extended Glossary

Appendix B: Beyond ‘Just Style’

Appendix C: Practical Issues

C.1 Compute Costs of Post-Training

C.2 Evaluation Variance

C.3 Managing Training Performance Variance

C.4 Identifying Bad Training Jobs

Overview

1 Introduction

Reinforcement Learning from Human Feedback (RLHF) is introduced as a way to incorporate human preferences into AI systems, especially for tasks where goals are hard to specify. It rose from early control settings to mainstream prominence with conversational language models and now sits as a central component of post-training—the multi-stage process that turns a pretrained base model into a helpful assistant. At a high level, the RLHF pipeline consists of training an instruction-following model, fitting a reward model to human preference data, and optimizing the policy with reinforcement learning; together these stages align behavior to what people actually want and frame the book’s practical decisions and implementation recipes.

Beyond enabling reliable question-answering, RLHF primarily shapes a model’s style and higher-level behavior—making answers concise, warm, reliable, safer, and better formatted—while also improving generalization relative to instruction fine-tuning alone. Conceptually, instruction tuning adjusts token-by-token predictions toward desired features, whereas RLHF evaluates whole responses and learns from comparisons of better versus worse outputs (a contrastive signal), including negative feedback. This flexibility brings challenges: training robust reward models from noisy data, avoiding over-optimization of proxy rewards, managing biases such as length, and budgeting the added compute and data costs. Consequently, RLHF works best alongside other post-training stages and depends on a strong base model.

The chapter places RLHF in the broader trajectory of open and closed model development: early instruction-only recipes captured “vibes” but plateaued; direct preference optimization simplified preference learning and spread widely; and newer reinforcement-learning approaches with verifiable rewards are pushing reasoning and agentic capabilities while consuming growing compute. Framed by an elicitation view, post-training extracts latent ability from base models—much like refining a car around a fixed chassis—often yielding large gains without changing pretraining. Looking ahead, RLHF remains the core of preference fine-tuning and a bridge to broader RL methods for reasoning, with the field converging on multi-stage post-training as the route to state-of-the-art systems.

A rendition of the early, three stage RLHF process: first training via supervised fine-tuning (SFT, chapter 4), building a reward model (RM, chapter 5), and then optimizing with reinforcement learning (RL, chapter 6).

Summary

RLHF incorporates human preferences into AI systems to solve problems that are hard-to-specify programmatically, and became widely known through ChatGPT’s breakout, which made the capabilities of language models more approachable.
The basic RLHF pipeline has three steps: instruction fine-tuning to teach the model to follow the question-answering format, training a reward model on human preferences, and optimizing the model with RL against that reward.
RLHF is known to primarily change the style, tone, and format of model responses – making them more helpful, warm, and engaging. But it’s not “just style transfer”: RLHF also improves benchmark performance, though over-optimization (e.g., excessive length or chattiness) can harm capabilities in other domains.
The elicitation theory of post-training suggests that base models contain latent potential, and post-training’s job is to extract and cultivate that intelligence into useful behaviors.
RLHF is one component of modern post-training, alongside instruction fine-tuning (IFT/SFT) and reinforcement learning with verifiable rewards (RLVR), used together in an intertwined manner to craft particular training recipes.

FAQ

What is Reinforcement Learning from Human Feedback (RLHF)?

RLHF is a technique that incorporates human preference information into AI systems to solve hard-to-specify objectives. It became prominent as a way to guide models toward behaviors users actually want, especially in interactive settings where preferences are nuanced or hard to write as rules.

Why did RLHF become important for modern language models?

RLHF helps bridge the gap between raw next-token prediction and helpful, conversational question-answering. It enabled models like ChatGPT to be reliable, warm, and engaging, accelerating the shift from research prototypes to broadly useful products.

What are the core steps in the classic RLHF pipeline?

The three-stage recipe is: (1) instruction/supervised fine-tuning to make the model follow prompts, (2) training a reward model from human preference data, and (3) optimizing the policy with RL using reward-model scores on generated responses.

How does RLHF change a model’s outputs compared to a base model?

Base models complete text generically, while RLHF-tuned models provide concise, user-oriented answers with improved tone, format, and helpfulness. RLHF shapes response-level behavior (style, structure, safety) rather than just next-token features.

Where does RLHF fit within modern post-training?

Post-training commonly includes: (1) Instruction/Supervised Fine-tuning (IFT/SFT) to teach format and basic behaviors, (2) Preference Fine-tuning (PreFT) such as RLHF to align with human preferences, and (3) Reinforcement Learning with Verifiable Rewards (RLVR) to boost performance on tasks with checkable outcomes. RLHF dominates the second stage.

What does RLHF optimize that instruction tuning alone does not?

Instruction tuning optimizes per-token imitation of good examples; RLHF optimizes whole responses using relative preferences or reward signals. This contrastive, response-level feedback helps models generalize better across domains and capture subtle, hard-to-specify preferences.

What are the main challenges and costs of doing RLHF?

Key challenges include building reliable reward models, avoiding over-optimization to proxy rewards, controlling length bias, and maintaining base capabilities. RLHF is more expensive than simple SFT in data, compute, and engineering time.

What is the “Elicitation Theory” of post-training?

It’s the view that post-training largely extracts and amplifies capabilities already present in the pretrained base model—like refining an F1 car’s performance after the chassis is built. With careful post-training (SFT, RLHF, RLVR), teams can rapidly unlock substantial gains without changing pretraining.

How did open post-training methods evolve (Alpaca → DPO → beyond)?

Early open efforts used small human sets plus synthetic data to imitate ChatGPT-style behavior. Skepticism of RLHF gave way to direct preference optimization (DPO) and other “direct alignment” methods that simplified preference learning, while closed labs advanced multi-stage post-training at larger scales.

What’s next for RLHF and related methods?

RLHF is a core part of preference fine-tuning and a bridge to broader RL-based post-training like RLVR and reasoning-focused training. Looking forward, larger-scale RL methods will likely play a growing role, while RLHF remains central to mapping human values and objectives into everyday AI systems.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $30.23

you save $17.76 (37%)

include audio $24.99 $15.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $30.23

you save $17.76 (37%)

include audio $24.99 $15.74

eBook

pdf, ePub, online

$47.99 $30.23

you save $17.76 (37%)

include audio $24.99 $15.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more