Reward Modeling — Deep Dive
Reward Model Overoptimization
Gao et al. (2023) “Scaling Laws for Reward Model Overoptimization” is the defining paper on what happens when you optimize too hard against a reward model.
Setup: A “gold-standard” oracle reward model $r^*$ (or human evaluations) represents true quality. The proxy reward model $r_{proxy}$ approximates it. A policy $\pi$ is trained to maximize $r_{proxy}$.
As the policy becomes better at optimizing $r_{proxy}$, it increasingly exploits the gaps between $r_{proxy}$ and $r^$. Beyond an optimal KL divergence between $\pi$ and the reference policy, true quality (measured by $r^$) begins to decrease while $r_{proxy}$ continues increasing.
The scaling law: Gao et al. found empirically that the relationship between proxy reward and gold reward follows:
$$r^* \approx r^* |_{KL=0} + a\sqrt{d} - b \cdot d$$
Where $d$ is the KL divergence from the reference policy. The term $a\sqrt{d}$ represents initial improvement; $-b \cdot d$ represents degradation from overoptimization.
The optimal KL depends on reward model capacity (better reward models → higher optimal KL before degradation begins).
Implications:
- The KL penalty in PPO ($\beta \cdot KL(\pi || \pi_{ref})$) is directly motivated by this — it limits optimization against the proxy reward
- Larger reward models support more optimization (higher optimal KL)
- Every policy should be compared against multiple independent reward models to detect overoptimization
Reward Model Calibration
A calibrated reward model produces scores where differences correspond to meaningful preference magnitudes — not just ordinal rankings.
Calibration test: For a reward model that assigns score difference $\Delta r$ to a pair, P(human prefers higher-scored response) should equal $\sigma(\Delta r / \tau)$ for some temperature $\tau$.
Well-calibrated reward models enable:
- Reliable best-of-N sampling (select the response with highest reward — meaningful if scores are calibrated)
- Uncertainty estimation (low confidence in comparisons reflects genuinely similar responses)
- Reward model comparison across different training runs
Calibration failures: Reward models often have poorly calibrated absolute scores. Two reward models may produce different score scales that correlate but aren’t directly comparable. Fine-tuning on task-specific data can shift calibration away from general preferences.
Reward Aggregation: Multi-Principle Models
Training one reward model on all human preference data conflates different evaluation dimensions. A response can be:
- Helpful (addresses the request)
- Harmless (not dangerous or offensive)
- Honest (factually accurate, not hallucinating)
- Formatting-appropriate (well-structured, appropriate length)
These dimensions can conflict. A highly helpful response might provide information that could be misused (helpfulness vs. harmlessness tradeoff).
Multi-objective reward modeling: Train separate reward models for each dimension, then combine via weighted aggregation: $$r_{combined} = w_1 r_{helpfulness} + w_2 r_{harmlessness} + w_3 r_{honesty}$$
The weights $w_i$ can be:
- Fixed: reflect policy choices about tradeoffs
- Task-dependent: different weights for different query types
- Learned: from additional preference data about the tradeoff
Constitutional AI partially addresses this by specifying dimensions explicitly in the constitution, allowing the AI to reason about tradeoffs.
Pareto reward optimization: Instead of aggregation, optimize for Pareto-optimal policies — improve helpfulness without reducing harmlessness, etc. Multi-reward PPO with Pareto constraints ensures no dimension degrades.
RLAIF: Replacing Human Raters with AI
RLAIF (Reinforcement Learning from AI Feedback) uses an AI model (typically a powerful LLM) rather than human raters to generate preference data.
Setup:
- Sample prompt $x$, generate two responses $(y_1, y_2)$
- Prompt AI evaluator: “Which response better follows these principles?”
- Use AI’s preference label to train the reward model
Bai et al. (Anthropic, 2022) “Constitutional AI” used this approach. Lee et al. (Google, 2023) “RLAIF: Scaling Reinforcement Learning from Human Feedback using AI Feedback” showed RLAIF can match RLHF quality on summarization tasks with significantly lower cost.
Advantages:
- 1000x cheaper than human labeling
- Faster iteration
- Consistent application of specified principles (no rater fatigue, no biases from personal experience)
Disadvantages:
- AI evaluator inherits biases of its own training
- Can’t reliably evaluate factual accuracy, safety edge cases, or novel situations
- AI evaluator may prefer AI-generated writing styles over human writing styles (distributional bias)
Chain-of-thought evaluation: Asking the AI evaluator to reason before giving a preference (“let’s think about which response is better…”) improves evaluation quality by 10–20%. This parallels how CoT improves generation quality.
Reward Model Generalization
Reward models are trained on in-distribution preference data. When the policy generates out-of-distribution responses (unusual styles, formats, or topic areas), the reward model may produce unreliable scores.
Domain shift effects: A reward model trained predominantly on conversational English may give low scores to valid mathematical notation, code, or non-English text — not because these are worse, but because they’re unfamiliar.
Length generalization: Reward models trained on responses of 100–500 words may score very long responses poorly (unfamiliar length) even when quality is maintained. Normalizing scores for length during training helps.
Active learning for reward model improvement: Identify inputs where the reward model is uncertain (ensemble disagreement, low margin in comparisons) and collect new human preference data for those inputs. Iteratively targeted data collection improves reward model generalization.
Reward model fine-tuning: When deploying RLHF for a specific domain (medical, legal, coding), fine-tune the base reward model on domain-specific preference data. Small amounts of targeted domain data (10,000 examples) can substantially improve domain-specific reward model quality.
One thing to remember: The reward model is both the most critical and the most fragile component of RLHF — its success requires sufficient coverage of the distribution the policy will explore, calibration that reflects real human preferences, and careful monitoring for overoptimization signals during policy training.
See Also
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
- Ai Safety Why some of the world's smartest people are worried about AI — and what researchers are actually doing about it before it becomes a problem.
- Prompt Injection The security vulnerability where AI assistants can be hijacked by hidden instructions in documents they read — and why it's becoming a serious security problem.
- Rlhf How ChatGPT learned to be helpful instead of just clever — the feedback loop that turned raw AI into something you'd actually want to talk to.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.