Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is the dominant method for aligning language models with human preferences following pretraining. It bridges the gap between a model that predicts text and one that behaves helpfully, harmlessly, and honestly.

Motivation

A pretrained language model is a next-token predictor trained on internet text. Its outputs reflect the distribution of that data — including harmful, incorrect, and low-quality content. It has no intrinsic notion of helpfulness or user intent. RLHF provides a mechanism to shape model behaviour toward human-preferred outputs without requiring dense, token-level supervision.

The RLHF Pipeline

Supervised Fine-Tuning

The pretrained model is fine-tuned on a dataset of high-quality demonstrations: prompt-response pairs written or curated by humans. This bootstraps the model toward instruction-following behaviour and provides a starting point for preference learning.

Reward Model Training

Human annotators compare pairs of model outputs for the same prompt and indicate which they prefer. A reward model $r_\phi(x, y)$ is trained to predict this preference:

$$ \mathcal{L}\text{RM}(\phi) = -\mathbb{E}{(x,, y_w,, y_l)}! \left[\log\sigma!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right] $$

where $y_w$ is the preferred (winning) response and $y_l$ the rejected (losing) response.

PPO Optimisation

The SFT model initialises a policy $\pi_\theta$ that is optimised using Proximal Policy Optimisation (PPO) to maximise expected reward:

$$ \mathcal{L}\text{PPO}(\theta) = \mathbb{E}!\left[r\phi(x, y)\right]

  • \beta, D_\text{KL}!\left[\pi_\theta | \pi_\text{SFT}\right] $$

The KL penalty prevents the policy from drifting too far from the SFT model, which would produce reward hacking — generating text that scores high on the reward model but is low quality in ways the reward model fails to capture.

Limitations

  • Reward hacking: the policy exploits weaknesses in the reward model rather than genuinely satisfying human intent
  • Scalable oversight: human annotators cannot reliably evaluate outputs in domains requiring deep expertise
  • Annotation cost: collecting high-quality preference data at scale is expensive
  • Distribution shift: the reward model was trained on SFT outputs; it may misrank outputs from a more capable policy

Variants

| Variant | Key idea | |---|---| | DPO | Eliminates the explicit reward model; treats preference learning as classification directly on the policy | | [[constitutional-ai|Constitutional AI]] / RLAIF | Replaces human labels with AI-generated labels guided by written principles | | Process Reward Models | Reward intermediate reasoning steps, not only final outputs | | Best-of-N sampling | Generates $N$ responses and selects the highest-scoring; simple but expensive |