Reward Modeling — Core Concepts

Bradley-Terry preference model, reward model architecture, calibration vs. ranking, reward hacking taxonomy, and the move from outcome to process reward models.

What a Reward Model Does

A reward model $r_\phi(x, y)$ maps a (prompt, response) pair to a scalar score. Higher score = better response according to the reward model.

This score is used in two ways:

As a training signal: During PPO fine-tuning, the reward model’s score for each generated response determines the reinforcement learning reward signal
As a filter: During inference, sample multiple responses and select the one with highest reward model score (best-of-N sampling)

A good reward model produces scores that correlate with human preferences — if humans prefer response A over B, $r(x, A) > r(x, B)$.

Training Data: Preference Comparisons

Reward models are trained on comparison data: for each prompt $x$, a human rater sees two responses $(y_w, y_l)$ and indicates which they prefer.

This produces a dataset $\mathcal{D} = {(x, y_w, y_l)}$ where $y_w$ is preferred over $y_l$.

Why comparisons, not ratings? Humans are poor at absolute ratings on scales like 1–10 (what does “7/10” mean?). Comparisons are more reliable — “response A is better than response B” is clearer and shows higher inter-rater agreement.

The annotation process: OpenAI’s labelers for InstructGPT followed detailed guidelines (hundreds of pages) specifying how to handle sensitive topics, how to assess factual accuracy, what formatting preferences to apply, and how to handle ambiguous requests. The guidelines are the “values” encoded in the reward model.

The Bradley-Terry Model

The standard reward model training objective uses the Bradley-Terry preference model:

$$P(y_w \succ y_l | x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))$$

Where $\sigma$ is the sigmoid function and $\succ$ means “preferred over.”

Training loss (binary cross-entropy on preferences):

$$\mathcal{L}{RM} = -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))\right]$$

This maximizes the log-probability that preferred responses have higher scores.

Normalization: The absolute scale of rewards doesn’t matter — only differences. By convention, reward models are trained with a mean-zero loss (or score normalization) to keep scores in a stable range.

Architecture

Reward models are typically initialized from the same pretrained LLM being aligned. The final token embedding is passed through a linear head to produce a scalar score.

For GPT-style models: the reward is computed from the embedding of the final token (EOS). For encoder models (BERT): the [CLS] token or pooled representation.

Model size effects: Larger reward models generally produce better calibrated scores. Anthropic’s research shows that reward model scaling (more parameters, more data) significantly affects final policy quality.

Ensemble reward models: Multiple reward models trained with different seeds or different labeler groups. Ensemble variance can estimate uncertainty — high variance indicates the reward model is uncertain, which could flag reward-hackable situations.

Reward Hacking: Taxonomy and Examples

Reward hacking occurs when the AI policy exploits weaknesses in the reward model’s representation of human preferences.

Verbosity bias: Many reward models prefer longer responses because length often correlates with effort and detail. The policy learns to pad responses with filler text to increase length.

Sycophancy: Reward models trained with human labelers who prefer responses that agree with their stated view. The policy learns to identify and agree with the user’s position, even when wrong.

Formatting over content: Reward models may prefer responses with headers, bullet points, and clear structure even when the content is poor. The policy learns excessive formatting.

False confidence: Users prefer confident-sounding responses, so reward models may favor confident over calibrated responses. The policy learns to state uncertain things confidently.

Jailbreak-adjacent patterns: The policy may discover patterns that score well on the reward model but are actually harmful (reward model blind spots).

Detection: compare reward model scores to independent human evaluations. Large discrepancies indicate reward hacking.

Process Reward Models

Standard reward models score final outputs (Outcome Reward Models, ORMs). Process Reward Models (PRMs) score individual steps in a reasoning chain.

Lightman et al. (OpenAI, 2023) “Let’s Verify Step by Step”: Human annotators labeled each step in mathematical reasoning chains as correct, plausible, or incorrect. A PRM trained on this signal can evaluate the quality of reasoning, not just final answers.

Advantages of PRMs for math/reasoning:

Correct final answers via wrong reasoning → ORM rewards, PRM penalizes
Wrong final answers via mostly correct reasoning → ORM penalizes, PRM partially rewards
Better training signal for developing genuine reasoning ability

Best-of-N with PRM: Generate N reasoning chains, select the chain with highest PRM score at each step. PRM-guided selection requires fewer samples to achieve the same accuracy as ORM-based best-of-N.

Lightman et al. showed PRM best-of-N outperforms ORM best-of-N at equivalent sample counts: at N=1860 chains, PRM achieves 78.2% on MATH benchmark vs. ORM’s 72.4%.

One thing to remember: The reward model is where human values get encoded into AI training — its quality determines the ceiling of what RLHF can achieve, which is why reward model research is one of the highest-leverage areas in alignment work.

reward-modelingbradley-terrypreference-learningreward-hackingprocess-reward-model