Overview

1 Introduction

Reinforcement Learning from Human Feedback (RLHF) is presented as a practical way to imbue AI systems with human preferences, enabling models to solve tasks that are hard to specify directly. The chapter outlines the now-classic three-step pipeline: first train a prompt-following base model with supervised instruction tuning, then learn a reward model from human preference data, and finally optimize the policy with an RL objective guided by that reward. RLHF sits within modern post-training alongside instruction/supervised finetuning (feature and format learning) and reinforcement finetuning (performance on verifiable domains), and it has been central to turning large pretrained models into effective, general-purpose assistants. Conceptually, RLHF operates at the response level with contrastive signals—teaching models what better and worse answers look like—which improves stylistic quality, faithfulness to user intent, and cross-domain generalization beyond what instruction tuning alone provides.

The chapter emphasizes both the power and the pitfalls of RLHF. While it integrates subtle stylistic and behavioral features that enhance usefulness and reliability, it introduces difficult engineering questions: how to train robust reward models, how to avoid over-optimization against proxy objectives, how to control biases such as length preferences, and how to regularize optimization to stay in safe regions of behavior. RLHF is costlier than simple instruction tuning in data, compute, and iteration time, and it works best when built on strong base models. The author frames post-training through an “elicitation” lens: much of the value comes from extracting and amplifying latent capabilities already present from pretraining at scale, not merely applying superficial style. With the right data and objectives, post-training can reshape reasoning behavior and boost performance in challenging domains, and newer reinforcement finetuning methods extend this frontier further.

Historically, the field moved from early open instruction-tuned models (e.g., Alpaca, Vicuna) and skepticism about RLHF to a surge of preference-tuning methods such as Direct Preference Optimization, catalyzed by better hyperparameters and datasets like UltraFeedback. Meanwhile, closed labs advanced multi-stage post-training programs combining instruction tuning, RLHF, and careful prompt and data design at large scale, with open efforts like Tülu providing a comprehensive foundation for research. Today, post-training interleaves objectives to target specific capabilities, and innovation is accelerating in reinforcement finetuning, reasoning training, and AI/constitutional feedback. The book positions itself as a concise, practice-oriented guide to the canonical RLHF workflow—reward modeling, regularization, instruction tuning, rejection sampling, policy-gradient RL, and direct alignment methods—augmented by advanced topics such as tool use, synthetic data, and evaluation, aimed at readers with basic ML and RL backgrounds.

A rendition of the early, three stage RLHF process with SFT, a reward model, and then optimization.
figure

Future of RLHF

With the investment in language modeling, many variations on the traditional RLHF methods emerged. RLHF colloquially has become synonymous with multiple overlapping approaches. RLHF is a subset of preference fine-tuning (PreFT) techniques, including Direct Alignment Algorithms (See Chapter 12). RLHF is the tool most associated with rapid progress in "post-training" of language models, which encompasses all training after the large-scale autoregressive training on primarily web data. This textbook is a broad overview of RLHF and its directly neighboring methods, such as instruction tuning and other implementation details needed to set up a model for RLHF training.

As more successes of fine-tuning language models with RL emerge, such as OpenAI’s o1 reasoning models, RLHF will be seen as the bridge that enabled further investment of RL methods for fine-tuning large base models. At the same time, while the spotlight of focus may be more intense on the RL portion of RLHF in the near future – as a way to maximize performance on valuable tasks – the core of RLHF is that it is a lens for studying the grand problems facing modern forms of AI. How do we map the complexities of human values and objectives into systems we use on a regular basis? This book hopes to be the foundation of decades of research and lessons on these problems.

FAQ

What is RLHF and why did it emerge?Reinforcement Learning from Human Feedback (RLHF) is a technique to incorporate human preferences into AI systems, originally developed to tackle hard-to-specify objectives. It became widely known through ChatGPT and is central to turning large language models into useful, general-purpose tools.
What are the three main steps in the RLHF pipeline?1) Train a language model to follow instructions (instruction/supervised finetuning). 2) Collect human preference data and train a reward model that scores responses. 3) Optimize the language model using an RL method by sampling responses and rating them with the reward model.
How does RLHF differ from instruction finetuning (SFT/IFT)?Instruction finetuning updates the model per token to match specific examples and formats. RLHF evaluates whole responses, learning what makes one response better than another (including what to avoid) via a contrastive signal. This tends to generalize better across domains.
Where does RLHF fit within modern post-training?Post-training comprises three methods: instruction/supervised finetuning (features and formatting), preference finetuning (aligning to human style and preferences, often via RLHF), and reinforcement finetuning (boosting performance on verifiable tasks). RLHF is a key component of preference finetuning.
What does RLHF actually change about model outputs?Primarily style and behavior: it makes answers more concise, helpful, and aligned with user expectations. Compared to a pretrained model that may ramble or include web-like metadata, an RLHF-trained model responds directly and appropriately to the user’s query.
What are the main challenges and risks of RLHF?Key challenges include training a good reward model (practices vary by domain), avoiding over-optimization to a proxy reward (requiring regularization), and managing biases such as length bias. RLHF is also more expensive than simple instruction finetuning in compute, data, and time, and it needs a strong starting model.
Why do small alignment datasets sometimes have big effects, and is alignment only “style”?Small datasets can noticeably shift behavior (e.g., formatting and style), but the book argues that post-training effects aren’t merely superficial. With the right data and scaling, post-training can significantly influence capabilities and behavior beyond style alone.
How has the field evolved in open vs. closed settings?Open efforts began with instruction-tuned models (e.g., Alpaca-era) and skepticism about RLHF, then shifted to preference-tuning breakthroughs (e.g., DPO-based models) once training nuances were found. Closed labs, meanwhile, use complex, large-scale multi-stage post-training pipelines that go beyond what most open recipes have explored.
What topics and techniques does this book cover?The book covers canonical RLHF steps and trade-offs: reward modeling, regularization, instruction tuning, rejection sampling, RL/policy gradients, and direct alignment algorithms. Advanced topics include constitutional AI and AI feedback, reasoning and reinforcement finetuning, tool use/function calling, synthetic data, and evaluation, plus open questions like over-optimization and the role of style.
Who is the book for and how should it be used?It targets readers with entry-level experience in language modeling, reinforcement learning, and ML. It’s a concise, practical reference to get started or build toy implementations—not an exhaustive textbook—and invites web-first contributions and corrections.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free