Overview

1 Overview

Reinforcement learning from human feedback (RLHF) is a method for injecting human preferences into AI systems to handle objectives that are hard to specify explicitly. It rose to prominence by enabling conversational models to move from raw next-token completion to helpful, safe, and engaging dialogue, becoming a cornerstone of modern “post-training.” At a high level, RLHF helps models internalize the style, tone, and behavioral norms people prefer across domains, turning broadly capable base models into reliable, general-purpose assistants.

The canonical RLHF pipeline has three stages: instruction/supervised fine-tuning to teach basic instruction following, reward modeling from human preference data, and reinforcement learning to optimize model responses against the learned reward. In today’s post-training stack, RLHF largely occupies preference fine-tuning, complemented by instruction/SFT and reinforcement learning with verifiable rewards (RLVR). Unlike per-token instruction tuning, RLHF adjusts behavior at the whole-response level using contrastive signals (better vs. worse responses), which tends to generalize more robustly across tasks. This power comes with costs and risks: reward models are proxy objectives, training can over-optimize or induce biases (like length bias), and the process is compute-, data-, and operations-intensive. As a result, RLHF is most effective as part of a multi-stage post-training recipe built on a strong base model.

The field has evolved from early instruction-tuned, vibes-driven open models to broader acceptance of preference optimization, including simpler algorithms like Direct Preference Optimization (DPO), while closed labs advanced multi-stage post-training at scale. A useful mental model is the “elicitation” view: pretraining yields substantial latent ability, and post-training—especially RLHF—extracts and shapes it for interactive use, going beyond superficial style changes to influence reasoning and response quality. This book distills best practices and trade-offs across the pipeline, offers hands-on implementations, and situates RLHF within the larger post-training toolkit, preparing readers to navigate ongoing advances in areas like RLVR and reasoning-focused training.

A rendition of the early, three stage RLHF process: first training via supervised fine-tuning (SFT, Chapter 4), building a reward model (RM, Chapter 5), and then optimizing with reinforcement learning (RL, Chapter 6).
figure

Future of RLHF

With the investment in language modeling, many variations on the traditional RLHF methods emerged. RLHF colloquially has become synonymous with multiple overlapping approaches. RLHF is a subset of preference fine-tuning (PreFT) techniques, including Direct Alignment Algorithms (see Chapter 8), which are the class of methods downstream of DPO that solve the preference learning problem by taking gradient steps directly on preference data, rather than learning an intermediate reward model. RLHF is the tool most associated with rapid progress in “post-training” of language models, which encompasses all training after the large-scale autoregressive training on primarily web data. This textbook is a broad overview of RLHF and its directly neighboring methods, such as instruction tuning and other implementation details needed to set up a model for RLHF training.

As more successes of fine-tuning language models with RL emerge, such as OpenAI’s o1 reasoning models, RLHF will be seen as the bridge that enabled further investment of RL methods for fine-tuning large base models. At the same time, while the spotlight of focus may be more intense on the RL portion of RLHF in the near future – as a way to maximize performance on valuable tasks – the core of RLHF is that it is a lens for studying the grand problems facing modern forms of AI. How do we map the complexities of human values and objectives into systems we use on a regular basis? This book hopes to be the foundation of decades of research and lessons on these problems.

FAQ

What is RLHF and why did it become important?Reinforcement Learning from Human Feedback (RLHF) incorporates human preferences into AI training to solve tasks that are hard to specify with explicit rules. It became prominent because it helps systems reflect nuanced human preferences in real-world interactions and powered breakthroughs like ChatGPT, accelerating the rise of general-purpose language models.
What are the three main steps in the classic RLHF pipeline?1) Train a base instruction-following model via supervised fine-tuning (SFT/IFT). 2) Collect human preference data and train a reward model (RM). 3) Optimize the policy with an RL algorithm by sampling responses and scoring them with the reward model (e.g., policy gradients).
How does RLHF differ from instruction fine-tuning (SFT/IFT)?SFT learns token-level features by imitating reference answers, while RLHF optimizes whole responses using preference signals. RLHF uses contrastive objectives (better vs. worse responses), captures style and behavior, and generally transfers better across domains than pure instruction fine-tuning.
What does RLHF change about a model’s responses in practice?It shapes style, tone, and helpfulness. Compared to a base model that continues text generically, an RLHF-tuned model tends to answer concisely, directly, and supportively in a question-answer format, aligning with user expectations for clarity and warmth.
Where does RLHF fit within modern post-training?RLHF sits in preference fine-tuning (PreFT), which is one stage of post-training. A typical post-training stack includes: 1) Instruction/Supervised Fine-tuning (IFT/SFT), 2) Preference Fine-tuning (RLHF and related methods), and 3) Reinforcement Learning with Verifiable Rewards (RLVR).
What are the main challenges and costs of RLHF?Key challenges include training reliable reward models, avoiding over-optimization to proxy rewards, and handling biases (e.g., length bias). RLHF is more expensive than SFT in data, compute, and engineering time, and it works best when starting from a strong base model.
What is the “elicitation” interpretation of post-training?Post-training is viewed as eliciting and amplifying capabilities already latent in the pretrained model, similar to refining a car’s performance around a fixed chassis. Significant gains can be achieved by shaping behavior and format without changing most of pretraining.
What is the Superficial Alignment Hypothesis, and how is it critiqued here?It claims most alignment is about style and can be achieved with small datasets. The book argues this is incomplete: while small datasets can shift style and improve narrow benchmarks, preference-based and RL methods can also shape deeper behavior and reasoning, going beyond superficial changes.
How did open post-training evolve, and what role did DPO play?Early open efforts (e.g., Alpaca, Vicuna) relied on instruction tuning and synthetic data, leading some to doubt RLHF’s necessity. Direct Preference Optimization (DPO) later simplified preference optimization and, with careful hyperparameters, powered notable open models (e.g., Zephyr-Beta, Tülu 2), marking a step forward in open preference tuning.
What are the current frontiers beyond RLHF?Reinforcement Learning with Verifiable Rewards (RLVR) and broader reasoning training are advancing rapidly. Modern post-training interleaves IFT/SFT, preference tuning (including RLHF and DPO-like methods), prompt design, and RLVR to target specific capabilities across domains.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free