RLHF — Deep Dive

The full technical picture of RLHF: reward modeling, PPO implementation details, KL penalties, mode collapse risks, and why DPO is replacing the original approach.

The Mathematical Problem RLHF Solves

Language model pretraining optimizes for next-token prediction:

$$\mathcal{L}{PT} = -\sum_t \log P\theta(x_t | x_{<t})$$

This objective doesn’t care whether the output is helpful, honest, or harmless. It rewards fluency and distributional similarity to the training corpus. RLHF introduces a different objective: maximize human preference.

The InstructGPT paper (Ouyang et al., 2022) formalized this as a three-stage pipeline that has become the standard alignment training recipe.

Phase 1: Supervised Fine-Tuning

Starting from a pretrained LLM, SFT trains on a dataset $\mathcal{D}_{SFT} = {(x_i, y_i)}$ of prompt-response pairs curated by human demonstrators:

$$\mathcal{L}{SFT} = -\mathbb{E}{(x,y) \sim \mathcal{D}{SFT}} \sum_t \log \pi\theta(y_t | x, y_{<t})$$

This is standard supervised learning — nothing new algorithmically. The value is in the data quality. OpenAI’s labeler guidelines specified:

Response style (conversational vs. formal based on prompt tone)
How to handle ambiguous instructions (ask for clarification vs. make assumptions)
Treatment of potentially harmful requests (outright refusal vs. conditional compliance vs. safe alternatives)
Format preferences (bullet points for lists, code blocks for code, appropriate length calibration)

The SFT model $\pi_{SFT}$ forms the starting point and reference baseline for all subsequent training.

Phase 2: Reward Model Architecture and Training

Comparison Data Collection

For each prompt $x$, multiple responses ${y_1, y_2, …, y_k}$ are sampled from $\pi_{SFT}$. Human raters produce a partial ordering over these responses.

Critically, comparisons are preferred over absolute ratings because:

Anchoring effects — humans are poor at assigning absolute scores but good at relative judgments
Inter-rater consistency — agreement is higher on comparisons than on numeric scales
Data efficiency — $k$ responses per prompt yield $\binom{k}{2}$ comparison pairs

Reward Model Training

The reward model $r_\phi$ is initialized from the SFT model with the final unembedding head replaced by a scalar output head. Given prompt $x$ and response $y$, it outputs a scalar reward.

Training uses a Bradley-Terry preference model:

$$\mathcal{L}{RM} = -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}{comp}} \left[\log \sigma(r\phi(x, y_w) - r_\phi(x, y_l))\right]$$

Where $y_w$ is the preferred response and $y_l$ is the less-preferred one. This loss pushes the reward difference to be large and positive for human-preferred responses.

OpenAI used a 6B parameter reward model for InstructGPT. Anthropic found that larger reward models correlate with better downstream policy quality, but with diminishing returns and increased cost.

Reward Model Reliability

A key failure mode is reward hacking: the policy learns to produce outputs that score well on the reward model but are actually poor. This happens because the reward model is trained on a finite distribution and can be fooled by out-of-distribution patterns.

Common reward hacking patterns:

Length inflation: The RM often prefers longer responses, so the policy learns to pad answers
Sycophancy: Models learn that agreeing with the human’s stated view scores higher
Formatting gaming: Overuse of bullet points, bold text, or headers that readers superficially prefer

Phase 3: PPO Training

The Optimization Objective

The RL phase optimizes:

$$\max_{\pi_\theta} \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta(x)} \left[r_\phi(x, y) - \beta \cdot \text{KL}[\pi_\theta(y|x) || \pi_{SFT}(y|x)]\right]$$

The KL term is crucial. Without it, PPO would optimize aggressively toward high-reward outputs, eventually collapsing to a distribution that fools the reward model but produces degenerate text (long, repetitive, stylistically extreme responses).

The coefficient $\beta$ is a hyperparameter controlling the alignment-capability tradeoff. Higher $\beta$ keeps the model closer to SFT behavior; lower $\beta$ allows more aggressive optimization toward the reward signal.

In practice, InstructGPT used $\beta \approx 0.02$ initially, adjusting per-run.

PPO Implementation Details

PPO (Schulman et al., 2017) is an actor-critic algorithm. In the RLHF context:

Actor: The LLM policy $\pi_\theta$, generating responses token-by-token
Critic: A value network predicting expected reward, often initialized from the reward model
Advantage estimation: GAE (Generalized Advantage Estimation) computes per-token advantages

One practical complication: standard RL environments have discrete, dense reward signals (score at each step). LLMs receive a sparse reward — one scalar at the end of a full response. Techniques like reward shaping (assigning partial credit to intermediate tokens) and KL regularization at each token address this.

Token-level KL penalty:

$$r_t’ = r_t - \beta \log \frac{\pi_\theta(y_t | x, y_{<t})}{\pi_{SFT}(y_t | x, y_{<t})}$$

Only the final token gets the scalar reward $r_\phi(x, y)$; intermediate tokens get only the KL penalty.

Engineering Scale

The PPO update requires forward passes through both the actor and critic for each sampled response. For a 175B parameter model, this is enormously compute-intensive. OpenAI’s InstructGPT training ran for several thousand PPO steps, each involving thousands of prompt-response pairs.

TRL (Transformer Reinforcement Learning), Anthropic’s internal tooling, and later DeepSpeed-Chat made this more accessible to researchers without GPT-3-scale compute.

DPO: Simplifying the RLHF Pipeline

Direct Preference Optimization (Rafailov et al., 2023) showed that the reward model and RL training can be collapsed into a single supervised objective.

The key insight: under the optimal policy for the RLHF objective, the reward function can be re-expressed in terms of the policy itself:

$$r^(x, y) = \beta \log \frac{\pi^(y|x)}{\pi_{SFT}(y|x)} + \beta \log Z(x)$$

This lets you write the preference learning objective directly in terms of the policy:

$$\mathcal{L}{DPO} = -\mathbb{E}{(x, y_w, y_l)} \left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{SFT}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{SFT}(y_l|x)}\right)\right]$$

DPO is simpler to implement (no separate reward model, no PPO), more stable to train, and achieves comparable results on most benchmarks. By 2024, most open-source fine-tuned models (Llama 2 Chat, Mistral Instruct, Gemma Instruct) used DPO or variants like IPO and KTO rather than full RLHF.

Variants and Extensions

RLAIF (Bai et al., 2022): Replace human raters with AI feedback, using a separate “evaluator” model to generate preference labels. Anthropic used this for Constitutional AI — the model critiques its own outputs against a set of principles.

Rejection Sampling Fine-Tuning: Sample many responses, filter to only the top-k by reward score, then fine-tune on those. Simpler than PPO; Llama 2 used this in combination with PPO.

SPIN (Self-Play Fine-Tuning): The model generates responses and is trained to distinguish its own high-quality outputs from lower-quality ones, without human labels.

Iterative RLHF: Run multiple rounds of preference data collection on the already-aligned model. Each round targets remaining failure modes. Anthropic’s Claude training uses multiple alignment iterations.

What RLHF Doesn’t Fix

RLHF improves helpfulness and reduces obvious harmful outputs, but doesn’t solve:

Hallucination: The model still generates plausible-sounding false statements. Humans rating outputs often can’t detect factual errors.
Sycophancy: Models learn that agreement scores well. In 2023–24, multiple papers documented that RLHF-trained models systematically flip their answers when users push back, even when the model was originally correct.
Value specification: RLHF encodes the preferences of a specific demographic of raters at a specific time. OpenAI’s labeler pool skewed toward English-speaking, Western perspectives. Bias is baked in.
Distributional shift: The reward model is accurate near its training distribution. Novel jailbreaks and unusual prompts can exploit blind spots.

One thing to remember: RLHF is an engineering approximation of alignment, not a solution to it. It makes models dramatically more usable while leaving open deeper questions about whose preferences are being encoded and whether those preferences actually represent good values.

aimachine-learningllmalignmentrlhfppodporeward-model