Reward Shaping — Deep Dive
Formal foundation
Consider an MDP (S, A, T, R, γ). A shaping function F(s, a, s’) modifies the reward:
R'(s, a, s') = R(s, a, s') + F(s, a, s')
Ng’s theorem (1999): If F(s, a, s’) = γΦ(s’) - Φ(s) for some bounded real-valued function Φ: S → ℝ, then the optimal policy under R’ is the same as under R. Any other form of F may change the optimal policy.
This is the theoretical bedrock. All safe shaping techniques either use PBRS directly or prove equivalence to it.
Implementing PBRS as a Gymnasium wrapper
import gymnasium as gym
import numpy as np
class PotentialShapingWrapper(gym.Wrapper):
"""Add potential-based reward shaping to any environment."""
def __init__(self, env: gym.Env, potential_fn, gamma: float = 0.99):
super().__init__(env)
self.potential_fn = potential_fn
self.gamma = gamma
self._prev_potential = 0.0
def reset(self, **kwargs):
obs, info = self.env.reset(**kwargs)
self._prev_potential = self.potential_fn(obs)
return obs, info
def step(self, action):
obs, reward, terminated, truncated, info = self.env.step(action)
current_potential = self.potential_fn(obs)
shaping = self.gamma * current_potential - self._prev_potential
self._prev_potential = current_potential
info["raw_reward"] = reward
info["shaping_reward"] = shaping
return obs, reward + shaping, terminated, truncated, info
Usage with a distance-based potential:
import gymnasium as gym
def negative_distance_potential(obs):
# For MountainCar: obs[0] is position, goal is at 0.5
return -abs(obs[0] - 0.5)
env = gym.make("MountainCar-v0")
env = PotentialShapingWrapper(env, negative_distance_potential, gamma=0.99)
Handling terminal states
At terminal states, the potential should be zero (the episode is over, no future to discount). Adjust the wrapper:
def step(self, action):
obs, reward, terminated, truncated, info = self.env.step(action)
if terminated:
current_potential = 0.0
else:
current_potential = self.potential_fn(obs)
shaping = self.gamma * current_potential - self._prev_potential
self._prev_potential = current_potential
info["raw_reward"] = reward
info["shaping_reward"] = shaping
return obs, reward + shaping, terminated, truncated, info
Curiosity-driven exploration: ICM implementation
The Intrinsic Curiosity Module (Pathak et al., 2017) has three components:
- Feature encoder — maps raw observations to a compact embedding.
- Forward model — predicts the next embedding given the current embedding and action.
- Inverse model — predicts the action given two consecutive embeddings (regularises the feature space).
import torch
import torch.nn as nn
import torch.nn.functional as F
class ICM(nn.Module):
def __init__(self, obs_dim: int, act_dim: int, feat_dim: int = 64):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(obs_dim, 128), nn.ReLU(),
nn.Linear(128, feat_dim),
)
self.forward_model = nn.Sequential(
nn.Linear(feat_dim + act_dim, 128), nn.ReLU(),
nn.Linear(128, feat_dim),
)
self.inverse_model = nn.Sequential(
nn.Linear(feat_dim * 2, 128), nn.ReLU(),
nn.Linear(128, act_dim),
)
def forward(self, obs, next_obs, action_onehot):
phi = self.encoder(obs)
phi_next = self.encoder(next_obs)
# Inverse model
action_pred = self.inverse_model(torch.cat([phi, phi_next], dim=-1))
inverse_loss = F.cross_entropy(action_pred, action_onehot.argmax(dim=-1))
# Forward model
phi_next_pred = self.forward_model(torch.cat([phi, action_onehot], dim=-1))
forward_loss = F.mse_loss(phi_next_pred, phi_next.detach())
# Intrinsic reward = forward prediction error
intrinsic_reward = forward_loss.detach()
return intrinsic_reward, forward_loss, inverse_loss
During training, add intrinsic_reward * η (a scaling coefficient, typically 0.01–0.1) to the environment reward. Tune η carefully — too high and the agent explores forever; too low and curiosity has no effect.
Random Network Distillation (RND)
RND (Burda et al., 2018) is simpler than ICM:
class RND(nn.Module):
def __init__(self, obs_dim: int, feat_dim: int = 64):
super().__init__()
# Fixed random target network — never trained
self.target = nn.Sequential(
nn.Linear(obs_dim, 128), nn.ReLU(),
nn.Linear(128, feat_dim),
)
for p in self.target.parameters():
p.requires_grad = False
# Predictor network — trained to match target
self.predictor = nn.Sequential(
nn.Linear(obs_dim, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
nn.Linear(128, feat_dim),
)
def intrinsic_reward(self, obs: torch.Tensor) -> torch.Tensor:
with torch.no_grad():
target_feat = self.target(obs)
pred_feat = self.predictor(obs)
return F.mse_loss(pred_feat, target_feat, reduction="none").mean(dim=-1)
def loss(self, obs: torch.Tensor) -> torch.Tensor:
target_feat = self.target(obs).detach()
pred_feat = self.predictor(obs)
return F.mse_loss(pred_feat, target_feat)
Novel states produce high prediction error because the predictor has never seen them. As the predictor trains on visited states, their intrinsic reward drops.
Normalising intrinsic rewards
RND rewards can vary wildly in scale. Maintain a running mean and standard deviation:
class RunningMeanStd:
def __init__(self):
self.mean = 0.0
self.var = 1.0
self.count = 1e-4
def update(self, x):
batch_mean = x.mean().item()
batch_var = x.var().item()
batch_count = x.shape[0]
self._update_from_moments(batch_mean, batch_var, batch_count)
def _update_from_moments(self, mean, var, count):
delta = mean - self.mean
total = self.count + count
self.mean += delta * count / total
m_a = self.var * self.count
m_b = var * count
m2 = m_a + m_b + delta ** 2 * self.count * count / total
self.var = m2 / total
self.count = total
def normalize(self, x):
return (x - self.mean) / (self.var ** 0.5 + 1e-8)
Automated reward search
Hand-crafting rewards is tedious. Recent approaches automate it:
Reward learning from demonstrations (IRL)
Inverse RL infers a reward function from expert trajectories. Libraries like imitation (built on SB3) provide:
from imitation.algorithms import bc, dagger
from imitation.rewards.reward_nets import BasicRewardNet
reward_net = BasicRewardNet(obs_space, act_space)
# Train reward_net using AIRL or GAIL on expert demonstrations
LLM-based reward design
A recent trend uses large language models to propose reward functions from natural language task descriptions. The LLM generates Python code for a reward function, the agent trains, and metrics feed back to the LLM for iterative refinement.
Diagnosing reward problems
| Symptom | Likely cause | Fix |
|---|---|---|
| Agent exploits a loophole | Reward hacking | Add constraints or use PBRS |
| Reward climbs but behaviour is nonsensical | Proxy gaming | Evaluate with held-out metrics the agent never sees |
| Training stalls at zero reward | Sparse reward, no exploration | Add intrinsic motivation (ICM/RND) or curriculum |
| Agent oscillates between strategies | Shaped reward creates local optima | Verify shaping satisfies PBRS conditions |
| Performance drops when shaping is removed | Shaping changed optimal policy | Switch to potential-based form |
Combining techniques
In practice, production RL pipelines layer multiple shaping strategies:
- PBRS for domain-knowledge-based guidance.
- Reward normalisation (running std) for stable gradients.
- Intrinsic motivation (RND or ICM) for early exploration.
- Curriculum to manage difficulty progression.
- Reward clipping as a final safety net.
The key is to anneal shaping signals over time. Start with strong guidance, then fade it so the agent optimises the true objective.
The one thing to remember: Safe reward shaping follows the PBRS formula γΦ(s’) - Φ(s), and everything else — curiosity, curricula, reward learning — builds on top of that guarantee.
See Also
- Python Environment Wrappers How thin add-on layers let you change what a learning program sees and does without rewriting the game itself
- Python Monte Carlo Tree Search The clever trick behind AlphaGo — how a program explores millions of possible moves by playing quick random games against itself
- Python Multi Agent Reinforcement What happens when multiple programs learn together in the same world — cooperation, competition, and emergent teamwork
- Python Openai Gym Environments Why OpenAI Gym is the playground where robots and programs learn by trial and error — no prior coding knowledge needed
- Python Policy Gradient Methods Instead of scoring every move, what if the program just learned which moves feel right? That is policy gradients