Reward Modeling — Deep Dive

Reward model overoptimization and KL penalties, Constitutional AI's reward model alternative, reward model generalization across distributions, and multi-principle reward models.

Reward Model Overoptimization

Gao et al. (2023) “Scaling Laws for Reward Model Overoptimization” is the defining paper on what happens when you optimize too hard against a reward model.

Setup: A “gold-standard” oracle reward model $r^*$ (or human evaluations) represents true quality. The proxy reward model $r_{proxy}$ approximates it. A policy $\pi$ is trained to maximize $r_{proxy}$.

As the policy becomes better at optimizing $r_{proxy}$, it increasingly exploits the gaps between $r_{proxy}$ and $r^$. Beyond an optimal KL divergence between $\pi$ and the reference policy, true quality (measured by $r^$) begins to decrease while $r_{proxy}$ continues increasing.

The scaling law: Gao et al. found empirically that the relationship between proxy reward and gold reward follows:

$$r^* \approx r^* |_{KL=0} + a\sqrt{d} - b \cdot d$$

Where $d$ is the KL divergence from the reference policy. The term $a\sqrt{d}$ represents initial improvement; $-b \cdot d$ represents degradation from overoptimization.

The optimal KL depends on reward model capacity (better reward models → higher optimal KL before degradation begins).

Implications:

The KL penalty in PPO ($\beta \cdot KL(\pi || \pi_{ref})$) is directly motivated by this — it limits optimization against the proxy reward
Larger reward models support more optimization (higher optimal KL)
Every policy should be compared against multiple independent reward models to detect overoptimization

Reward Model Calibration

A calibrated reward model produces scores where differences correspond to meaningful preference magnitudes — not just ordinal rankings.

Calibration test: For a reward model that assigns score difference $\Delta r$ to a pair, P(human prefers higher-scored response) should equal $\sigma(\Delta r / \tau)$ for some temperature $\tau$.

Well-calibrated reward models enable:

Reliable best-of-N sampling (select the response with highest reward — meaningful if scores are calibrated)
Uncertainty estimation (low confidence in comparisons reflects genuinely similar responses)
Reward model comparison across different training runs

Calibration failures: Reward models often have poorly calibrated absolute scores. Two reward models may produce different score scales that correlate but aren’t directly comparable. Fine-tuning on task-specific data can shift calibration away from general preferences.

Reward Aggregation: Multi-Principle Models

Training one reward model on all human preference data conflates different evaluation dimensions. A response can be:

Helpful (addresses the request)
Harmless (not dangerous or offensive)
Honest (factually accurate, not hallucinating)
Formatting-appropriate (well-structured, appropriate length)

These dimensions can conflict. A highly helpful response might provide information that could be misused (helpfulness vs. harmlessness tradeoff).

Multi-objective reward modeling: Train separate reward models for each dimension, then combine via weighted aggregation: $$r_{combined} = w_1 r_{helpfulness} + w_2 r_{harmlessness} + w_3 r_{honesty}$$

The weights $w_i$ can be:

Fixed: reflect policy choices about tradeoffs
Task-dependent: different weights for different query types
Learned: from additional preference data about the tradeoff

Constitutional AI partially addresses this by specifying dimensions explicitly in the constitution, allowing the AI to reason about tradeoffs.

Pareto reward optimization: Instead of aggregation, optimize for Pareto-optimal policies — improve helpfulness without reducing harmlessness, etc. Multi-reward PPO with Pareto constraints ensures no dimension degrades.

RLAIF: Replacing Human Raters with AI

RLAIF (Reinforcement Learning from AI Feedback) uses an AI model (typically a powerful LLM) rather than human raters to generate preference data.

Setup:

Sample prompt $x$, generate two responses $(y_1, y_2)$
Prompt AI evaluator: “Which response better follows these principles?”
Use AI’s preference label to train the reward model

Bai et al. (Anthropic, 2022) “Constitutional AI” used this approach. Lee et al. (Google, 2023) “RLAIF: Scaling Reinforcement Learning from Human Feedback using AI Feedback” showed RLAIF can match RLHF quality on summarization tasks with significantly lower cost.

Advantages:

1000x cheaper than human labeling
Faster iteration
Consistent application of specified principles (no rater fatigue, no biases from personal experience)

Disadvantages:

AI evaluator inherits biases of its own training
Can’t reliably evaluate factual accuracy, safety edge cases, or novel situations
AI evaluator may prefer AI-generated writing styles over human writing styles (distributional bias)

Chain-of-thought evaluation: Asking the AI evaluator to reason before giving a preference (“let’s think about which response is better…”) improves evaluation quality by 10–20%. This parallels how CoT improves generation quality.

Reward Model Generalization

Reward models are trained on in-distribution preference data. When the policy generates out-of-distribution responses (unusual styles, formats, or topic areas), the reward model may produce unreliable scores.

Domain shift effects: A reward model trained predominantly on conversational English may give low scores to valid mathematical notation, code, or non-English text — not because these are worse, but because they’re unfamiliar.

Length generalization: Reward models trained on responses of 100–500 words may score very long responses poorly (unfamiliar length) even when quality is maintained. Normalizing scores for length during training helps.

Active learning for reward model improvement: Identify inputs where the reward model is uncertain (ensemble disagreement, low margin in comparisons) and collect new human preference data for those inputs. Iteratively targeted data collection improves reward model generalization.

Reward model fine-tuning: When deploying RLHF for a specific domain (medical, legal, coding), fine-tune the base reward model on domain-specific preference data. Small amounts of targeted domain data (10,000 examples) can substantially improve domain-specific reward model quality.

One thing to remember: The reward model is both the most critical and the most fragile component of RLHF — its success requires sufficient coverage of the distribution the policy will explore, calibration that reflects real human preferences, and careful monitoring for overoptimization signals during policy training.

reward-modelingreward-overoptimizationkl-penaltyconstitutional-airlaifreward-ensembles

Reward Modeling — Deep Dive

Reward Model Overoptimization

Reward Model Calibration

Reward Aggregation: Multi-Principle Models

RLAIF: Replacing Human Raters with AI

Reward Model Generalization

See Also

Related Topics