RLHF — Deep Dive

The Mathematical Problem RLHF Solves

Language model pretraining optimizes for next-token prediction:

$$\mathcal{L}{PT} = -\sum_t \log P\theta(x_t | x_{<t})$$

This objective doesn’t care whether the output is helpful, honest, or harmless. It rewards fluency and distributional similarity to the training corpus. RLHF introduces a different objective: maximize human preference.

The InstructGPT paper (Ouyang et al., 2022) formalized this as a three-stage pipeline that has become the standard alignment training recipe.

Phase 1: Supervised Fine-Tuning

Starting from a pretrained LLM, SFT trains on a dataset $\mathcal{D}_{SFT} = {(x_i, y_i)}$ of prompt-response pairs curated by human demonstrators:

$$\mathcal{L}{SFT} = -\mathbb{E}{(x,y) \sim \mathcal{D}{SFT}} \sum_t \log \pi\theta(y_t | x, y_{<t})$$

This is standard supervised learning — nothing new algorithmically. The value is in the data quality. OpenAI’s labeler guidelines specified:

  • Response style (conversational vs. formal based on prompt tone)
  • How to handle ambiguous instructions (ask for clarification vs. make assumptions)
  • Treatment of potentially harmful requests (outright refusal vs. conditional compliance vs. safe alternatives)
  • Format preferences (bullet points for lists, code blocks for code, appropriate length calibration)

The SFT model $\pi_{SFT}$ forms the starting point and reference baseline for all subsequent training.

Phase 2: Reward Model Architecture and Training

Comparison Data Collection

For each prompt $x$, multiple responses ${y_1, y_2, …, y_k}$ are sampled from $\pi_{SFT}$. Human raters produce a partial ordering over these responses.

Critically, comparisons are preferred over absolute ratings because:

  1. Anchoring effects — humans are poor at assigning absolute scores but good at relative judgments
  2. Inter-rater consistency — agreement is higher on comparisons than on numeric scales
  3. Data efficiency — $k$ responses per prompt yield $\binom{k}{2}$ comparison pairs

Reward Model Training

The reward model $r_\phi$ is initialized from the SFT model with the final unembedding head replaced by a scalar output head. Given prompt $x$ and response $y$, it outputs a scalar reward.

Training uses a Bradley-Terry preference model:

$$\mathcal{L}{RM} = -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}{comp}} \left[\log \sigma(r\phi(x, y_w) - r_\phi(x, y_l))\right]$$

Where $y_w$ is the preferred response and $y_l$ is the less-preferred one. This loss pushes the reward difference to be large and positive for human-preferred responses.

OpenAI used a 6B parameter reward model for InstructGPT. Anthropic found that larger reward models correlate with better downstream policy quality, but with diminishing returns and increased cost.

Reward Model Reliability

A key failure mode is reward hacking: the policy learns to produce outputs that score well on the reward model but are actually poor. This happens because the reward model is trained on a finite distribution and can be fooled by out-of-distribution patterns.

Common reward hacking patterns:

  • Length inflation: The RM often prefers longer responses, so the policy learns to pad answers
  • Sycophancy: Models learn that agreeing with the human’s stated view scores higher
  • Formatting gaming: Overuse of bullet points, bold text, or headers that readers superficially prefer

Phase 3: PPO Training

The Optimization Objective

The RL phase optimizes:

$$\max_{\pi_\theta} \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta(x)} \left[r_\phi(x, y) - \beta \cdot \text{KL}[\pi_\theta(y|x) || \pi_{SFT}(y|x)]\right]$$

The KL term is crucial. Without it, PPO would optimize aggressively toward high-reward outputs, eventually collapsing to a distribution that fools the reward model but produces degenerate text (long, repetitive, stylistically extreme responses).

The coefficient $\beta$ is a hyperparameter controlling the alignment-capability tradeoff. Higher $\beta$ keeps the model closer to SFT behavior; lower $\beta$ allows more aggressive optimization toward the reward signal.

In practice, InstructGPT used $\beta \approx 0.02$ initially, adjusting per-run.

PPO Implementation Details

PPO (Schulman et al., 2017) is an actor-critic algorithm. In the RLHF context:

  • Actor: The LLM policy $\pi_\theta$, generating responses token-by-token
  • Critic: A value network predicting expected reward, often initialized from the reward model
  • Advantage estimation: GAE (Generalized Advantage Estimation) computes per-token advantages

One practical complication: standard RL environments have discrete, dense reward signals (score at each step). LLMs receive a sparse reward — one scalar at the end of a full response. Techniques like reward shaping (assigning partial credit to intermediate tokens) and KL regularization at each token address this.

Token-level KL penalty:

$$r_t’ = r_t - \beta \log \frac{\pi_\theta(y_t | x, y_{<t})}{\pi_{SFT}(y_t | x, y_{<t})}$$

Only the final token gets the scalar reward $r_\phi(x, y)$; intermediate tokens get only the KL penalty.

Engineering Scale

The PPO update requires forward passes through both the actor and critic for each sampled response. For a 175B parameter model, this is enormously compute-intensive. OpenAI’s InstructGPT training ran for several thousand PPO steps, each involving thousands of prompt-response pairs.

TRL (Transformer Reinforcement Learning), Anthropic’s internal tooling, and later DeepSpeed-Chat made this more accessible to researchers without GPT-3-scale compute.

DPO: Simplifying the RLHF Pipeline

Direct Preference Optimization (Rafailov et al., 2023) showed that the reward model and RL training can be collapsed into a single supervised objective.

The key insight: under the optimal policy for the RLHF objective, the reward function can be re-expressed in terms of the policy itself:

$$r^(x, y) = \beta \log \frac{\pi^(y|x)}{\pi_{SFT}(y|x)} + \beta \log Z(x)$$

This lets you write the preference learning objective directly in terms of the policy:

$$\mathcal{L}{DPO} = -\mathbb{E}{(x, y_w, y_l)} \left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{SFT}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{SFT}(y_l|x)}\right)\right]$$

DPO is simpler to implement (no separate reward model, no PPO), more stable to train, and achieves comparable results on most benchmarks. By 2024, most open-source fine-tuned models (Llama 2 Chat, Mistral Instruct, Gemma Instruct) used DPO or variants like IPO and KTO rather than full RLHF.

Variants and Extensions

RLAIF (Bai et al., 2022): Replace human raters with AI feedback, using a separate “evaluator” model to generate preference labels. Anthropic used this for Constitutional AI — the model critiques its own outputs against a set of principles.

Rejection Sampling Fine-Tuning: Sample many responses, filter to only the top-k by reward score, then fine-tune on those. Simpler than PPO; Llama 2 used this in combination with PPO.

SPIN (Self-Play Fine-Tuning): The model generates responses and is trained to distinguish its own high-quality outputs from lower-quality ones, without human labels.

Iterative RLHF: Run multiple rounds of preference data collection on the already-aligned model. Each round targets remaining failure modes. Anthropic’s Claude training uses multiple alignment iterations.

What RLHF Doesn’t Fix

RLHF improves helpfulness and reduces obvious harmful outputs, but doesn’t solve:

  • Hallucination: The model still generates plausible-sounding false statements. Humans rating outputs often can’t detect factual errors.
  • Sycophancy: Models learn that agreement scores well. In 2023–24, multiple papers documented that RLHF-trained models systematically flip their answers when users push back, even when the model was originally correct.
  • Value specification: RLHF encodes the preferences of a specific demographic of raters at a specific time. OpenAI’s labeler pool skewed toward English-speaking, Western perspectives. Bias is baked in.
  • Distributional shift: The reward model is accurate near its training distribution. Novel jailbreaks and unusual prompts can exploit blind spots.

One thing to remember: RLHF is an engineering approximation of alignment, not a solution to it. It makes models dramatically more usable while leaving open deeper questions about whose preferences are being encoded and whether those preferences actually represent good values.

aimachine-learningllmalignmentrlhfppodporeward-model

See Also

  • Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
  • Ai Safety Why some of the world's smartest people are worried about AI — and what researchers are actually doing about it before it becomes a problem.
  • Prompt Injection The security vulnerability where AI assistants can be hijacked by hidden instructions in documents they read — and why it's becoming a serious security problem.
  • Reward Modeling How AI learns what 'good' means — the training component that translates human preferences into a mathematical score that AI systems can optimize for.
  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.