Overview

10 The Nature of Preferences

Reinforcement learning from human feedback treats human preferences as the fuel and target of learning when explicit reward design is infeasible. But what counts as “better” is often subjective and context-laden—as simple as preferring one poem over another—so RLHF inherits the ambiguity of human judgment. The chapter situates RLHF within an interdisciplinary lineage spanning philosophy, psychology, economics, decision theory, optimal control, and modern deep learning. Because each field brings its own assumptions about what preferences are and how they can be optimized, RLHF can never be a fully solved problem; in practice it emphasizes empirical alignment on concrete behaviors rather than perfect value calibration. Ongoing work therefore explores pluralistic alignment across populations and personalization to respect divergent values while maintaining practical performance.

Historically, ideas about utility and choice—from early probability and decision theory through utilitarian calculus to formal treatments of uncertainty—paved the way for modeling preference as something that could be quantified. Reinforcement learning then fused operant conditioning’s notion of “reward” with control theory’s utility-to-go, culminating in the Bellman equation and the MDP framework, and later in TD learning, Q-learning, and deep RL successes in games and control. These methods assume stable, well-defined objectives and stationary environments, assumptions that clash with messy, open-ended human judgments. Although inverse reinforcement learning is closely related to learning rewards from behavior, RLHF largely evolved through an engineering path focused on reward modeling from comparative feedback at scale. The result is a powerful toolkit optimized for utility under uncertainty, yet strained when compressing many heterogeneous, shifting human desiderata into a single scalar signal.

The chapter details why this compression is fragile: human preferences drift over time, depend on presentation and context, and resist neat Markovian assumptions or full observability common in RL formalisms. While the Von Neumann–Morgenstern utility theorem licenses representing preferences with expected utility, its prerequisites are routinely violated in real interfaces and social settings. Social choice results (like impossibility theorems) underscore that not all fairness desiderata can be satisfied simultaneously, and attempts at interpersonal utility comparison or principal–agent formulations introduce new tensions, including corrigibility concerns. Practically, RLHF data can exhibit intransitivity, framing effects, reliance on noisy proxies, and low inter-annotator agreement, all of which complicate aggregation and optimization. Hence RLHF is best viewed as an iterative, empirically grounded process: it aligns models to observed human feedback for specific tasks while acknowledging unresolved normative trade-offs, motivating better dataset engineering, evaluation, and personalization rather than a final, universal solution.

The timeline of the integration of various subfields into the modern version of RLHF. The direct links are continuous developments of specific technologies, and the arrows indicate motivations and conceptual links.
figure

Summary

  • RLHF sits at the intersection of philosophy, economics, psychology, reinforcement learning, and deep learning – each bringing its own assumptions about what preferences are and how they can be optimized.
  • Reinforcement learning was designed for domains with stable, deterministic reward functions, but human preferences are noisy, context-dependent, temporally shifting, and not always transitive – a fundamental mismatch that shapes the limitations of RLHF.
  • The Von Neumann-Morgenstern utility theorem provides theoretical license for modeling preferences as scalar functions, but its assumptions (transitivity, comparability, stability) are routinely violated in practice. Impossibility theorems in social choice theory further show that no single aggregation method over preferences can satisfy all fairness criteria simultaneously.
  • These challenges explain why RLHF will never be fully “solved,” but they do not prevent it from being useful. In practice, RLHF operates on more tractable problems of style and performance rather than attempting to resolve the full complexity of human values.
  • The practical mechanics of collecting and structuring preference data in light of RLHF’s complex motivations are covered in Chapter 11.

FAQ

What does “the nature of preferences” mean in this chapter, and why does it matter for RLHF?Human preferences are complex, context-dependent, and shaped by philosophy, psychology, economics, and culture. RLHF tries to model these preferences to guide learning, but their variability and ambiguity make both measurement and optimization inherently challenging.
Why will RLHF never be a fully solved problem?Because preferences drift over time, differ across people and contexts, and cannot be perfectly aggregated into a single objective. Social choice impossibility results, measurement biases, and practical trade-offs ensure there’s no one definitive solution—only better approximations.
How do reward, utility, and preference relate in RLHF?RLHF uses learned reward models as proxies for human preferences, drawing on decision theory’s utility concepts. While utility/reward provide clear optimization targets in control and RL, compressing rich, plural human preferences into a single scalar reward inevitably loses nuance.
What is “reward-to-go” and why is a discount factor used?Reward-to-go is the expected cumulative future reward from a state or step. RL typically applies a discount factor to weight near-term rewards more than distant ones, stabilizing optimization. This framework fits well-defined tasks but strains when modeling open-ended, shifting human preferences.
Which RL/control assumptions break down when modeling human preferences?Key assumptions such as stationary rewards, determinism, transitivity, a single objective, and the Markov property often fail. Human judgments are context-sensitive, non-stationary, and multi-objective, making standard RL formulations an imperfect fit.
Why is the poem example relevant to RLHF?Judging “which poem is better” lacks a single correct answer, unlike factual questions. RLHF relies on human comparisons in such open-ended tasks to approximate desirable behavior, but the subjectivity highlights ambiguity and the need for careful data and evaluation design.
How do MDPs and the Bellman equation relate to RLHF, and what are their limits?RLHF borrows tools from RL that assume Markov Decision Processes and use the Bellman equation for optimality. Language tasks are partially observed and non-stationary, limiting these assumptions and weakening theoretical guarantees.
What measurement pitfalls affect preference data in RLHF?Presentation effects (how options are displayed), order and fatigue effects, proxy signals (e.g., dwell time), and low agreement across annotators can skew data. Such biases can mislead reward models and propagate through training and deployment.
How does RLHF relate to inverse reinforcement learning (IRL)?IRL infers a reward function from observed behavior; RLHF typically trains a reward model from direct human feedback (e.g., pairwise comparisons). They are closely related, and insights from IRL may improve RLHF, especially for complex, open-ended domains.
What is pluralistic alignment and how is it addressed?Pluralistic alignment aims to respect diverse preferences across populations rather than enforcing a single utility. Approaches include better datasets, explicit aggregation strategies, and personalization—each balancing fairness, coherence, and corrigibility trade-offs.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free