1 Introduction
Reinforcement Learning from Human Feedback (RLHF) is presented as a practical way to imbue AI systems with human preferences, enabling models to solve tasks that are hard to specify directly. The chapter outlines the now-classic three-step pipeline: first train a prompt-following base model with supervised instruction tuning, then learn a reward model from human preference data, and finally optimize the policy with an RL objective guided by that reward. RLHF sits within modern post-training alongside instruction/supervised finetuning (feature and format learning) and reinforcement finetuning (performance on verifiable domains), and it has been central to turning large pretrained models into effective, general-purpose assistants. Conceptually, RLHF operates at the response level with contrastive signals—teaching models what better and worse answers look like—which improves stylistic quality, faithfulness to user intent, and cross-domain generalization beyond what instruction tuning alone provides.
The chapter emphasizes both the power and the pitfalls of RLHF. While it integrates subtle stylistic and behavioral features that enhance usefulness and reliability, it introduces difficult engineering questions: how to train robust reward models, how to avoid over-optimization against proxy objectives, how to control biases such as length preferences, and how to regularize optimization to stay in safe regions of behavior. RLHF is costlier than simple instruction tuning in data, compute, and iteration time, and it works best when built on strong base models. The author frames post-training through an “elicitation” lens: much of the value comes from extracting and amplifying latent capabilities already present from pretraining at scale, not merely applying superficial style. With the right data and objectives, post-training can reshape reasoning behavior and boost performance in challenging domains, and newer reinforcement finetuning methods extend this frontier further.
Historically, the field moved from early open instruction-tuned models (e.g., Alpaca, Vicuna) and skepticism about RLHF to a surge of preference-tuning methods such as Direct Preference Optimization, catalyzed by better hyperparameters and datasets like UltraFeedback. Meanwhile, closed labs advanced multi-stage post-training programs combining instruction tuning, RLHF, and careful prompt and data design at large scale, with open efforts like Tülu providing a comprehensive foundation for research. Today, post-training interleaves objectives to target specific capabilities, and innovation is accelerating in reinforcement finetuning, reasoning training, and AI/constitutional feedback. The book positions itself as a concise, practice-oriented guide to the canonical RLHF workflow—reward modeling, regularization, instruction tuning, rejection sampling, policy-gradient RL, and direct alignment methods—augmented by advanced topics such as tool use, synthetic data, and evaluation, aimed at readers with basic ML and RL backgrounds.
A rendition of the early, three stage RLHF process with SFT, a reward model, and then optimization.
Future of RLHF
With the investment in language modeling, many variations on the traditional RLHF methods emerged. RLHF colloquially has become synonymous with multiple overlapping approaches. RLHF is a subset of preference fine-tuning (PreFT) techniques, including Direct Alignment Algorithms (See Chapter 12). RLHF is the tool most associated with rapid progress in "post-training" of language models, which encompasses all training after the large-scale autoregressive training on primarily web data. This textbook is a broad overview of RLHF and its directly neighboring methods, such as instruction tuning and other implementation details needed to set up a model for RLHF training.
As more successes of fine-tuning language models with RL emerge, such as OpenAI’s o1 reasoning models, RLHF will be seen as the bridge that enabled further investment of RL methods for fine-tuning large base models. At the same time, while the spotlight of focus may be more intense on the RL portion of RLHF in the near future – as a way to maximize performance on valuable tasks – the core of RLHF is that it is a lens for studying the grand problems facing modern forms of AI. How do we map the complexities of human values and objectives into systems we use on a regular basis? This book hopes to be the foundation of decades of research and lessons on these problems.
The RLHF Book ebook for free