RLHF — Core Concepts

The three-phase training loop that turned language models into AI assistants — and why human preferences are surprisingly hard to automate.

Why Language Models Needed a Second Training Phase

Large language models (LLMs) are trained on massive text datasets — billions of web pages, books, and code. This gives them broad knowledge and fluent text generation. But there’s a fundamental mismatch: the internet is not curated for helpfulness.

Training on internet text teaches a model to mimic all writing styles, including unhelpful, verbose, biased, or outright wrong ones. A model that excels at predicting the next word in a Reddit thread doesn’t automatically know how to give someone a clear medication dosage answer.

RLHF — Reinforcement Learning from Human Feedback — is the process OpenAI used starting in 2022 to bridge this gap. InstructGPT (the precursor to ChatGPT) was the first major model trained this way, and the results were striking: a 1.3 billion parameter InstructGPT model outperformed a 175 billion parameter GPT-3 on human preference scores, despite being 100x smaller.

The Three Phases

Phase 1: Supervised Fine-Tuning (SFT)

The base LLM is fine-tuned on a curated dataset of prompt-response pairs written by human contractors. These are examples of ideal responses — clear, helpful, honest, appropriately scoped.

OpenAI employed a team of labelers who followed detailed guidelines covering things like: how to handle sensitive topics, when to decline requests, how to format responses for different question types. The SFT dataset is typically tens of thousands of examples.

This phase gives the model a baseline behavior shift — it starts trying to be helpful rather than just completing text patterns.

Phase 2: Reward Model Training

Now you need a way to automatically evaluate whether a response is good. This is where the reward model (RM) comes in.

Human raters are shown the same prompt with multiple AI-generated responses, and they rank them from best to worst. You might have 4–9 responses per prompt, and the rater orders them: response B > response A > response D > response C.

These ranking comparisons become training data for a separate neural network — the reward model. It learns to predict human preference scores. After training on tens of thousands of comparisons, the RM can score new responses without human involvement.

This is crucial: ranking is much easier than writing. It’s faster for humans to say “this answer is better than that one” than to write the perfect answer from scratch.

Phase 3: Reinforcement Learning (PPO)

Now you use the reward model to further train the LLM via reinforcement learning — specifically, an algorithm called Proximal Policy Optimization (PPO).

The LLM generates a response. The reward model scores it. That score becomes a reward signal. PPO updates the LLM’s weights to make high-scoring responses more likely.

There’s a critical constraint: the RL training includes a KL divergence penalty that prevents the model from drifting too far from the SFT baseline. Without this, the model would “game” the reward model — finding edge cases that score well but are actually terrible responses. The penalty keeps it grounded.

The Human Labeler Problem

RLHF sounds clean in theory but gets messy in practice. The quality of the reward model depends entirely on the quality and consistency of human raters.

Different people have different values. A rater in one country might consider a direct refusal appropriate; another might find it unhelpful. OpenAI’s labeler guidelines ran to hundreds of pages trying to standardize preferences. Scale AI and Surge AI built businesses around providing trained rater pools for exactly this kind of work.

In 2023, TIME reported that some Kenyan contractors doing content review for OpenAI were paid less than $2/hour and exposed to disturbing content without adequate support. The human infrastructure behind RLHF carries real ethical weight.

Common Misconception: RLHF Doesn’t Create Values

A frequent misunderstanding is that RLHF “teaches the AI what’s right and wrong.” It doesn’t. It teaches the model to produce outputs that a specific group of human raters, following specific guidelines, at a specific point in time, tend to prefer.

That’s not ethics — it’s preference learning. The distinction matters enormously for questions about AI safety and bias.

What Came After RLHF

RLHF is expensive and complex. In 2023–2024, researchers developed alternatives:

RLAIF (RL from AI Feedback): Use another AI model as the rater instead of humans — cheaper but inherits that model’s biases.
DPO (Direct Preference Optimization): A mathematical simplification that achieves similar results without the separate reward model training step. Most models trained after mid-2023 use DPO or variants.
Constitutional AI (Anthropic): The model critiques and revises its own outputs against a set of principles before the preference learning stage.

RLHF opened the door; the field has been improving the approach ever since.

One thing to remember: RLHF’s real innovation wasn’t a new algorithm — it was figuring out how to encode human preferences at scale so they could be used as a training signal.

aimachine-learningllmalignmentchatgptrlhf