Reward Shaping — Deep Dive

Implement potential-based shaping, curiosity-driven exploration, and automated reward search in Python RL pipelines

Formal foundation

Consider an MDP (S, A, T, R, γ). A shaping function F(s, a, s’) modifies the reward:

R'(s, a, s') = R(s, a, s') + F(s, a, s')

Ng’s theorem (1999): If F(s, a, s’) = γΦ(s’) - Φ(s) for some bounded real-valued function Φ: S → ℝ, then the optimal policy under R’ is the same as under R. Any other form of F may change the optimal policy.

This is the theoretical bedrock. All safe shaping techniques either use PBRS directly or prove equivalence to it.

Implementing PBRS as a Gymnasium wrapper

import gymnasium as gym
import numpy as np

class PotentialShapingWrapper(gym.Wrapper):
    """Add potential-based reward shaping to any environment."""

    def __init__(self, env: gym.Env, potential_fn, gamma: float = 0.99):
        super().__init__(env)
        self.potential_fn = potential_fn
        self.gamma = gamma
        self._prev_potential = 0.0

    def reset(self, **kwargs):
        obs, info = self.env.reset(**kwargs)
        self._prev_potential = self.potential_fn(obs)
        return obs, info

    def step(self, action):
        obs, reward, terminated, truncated, info = self.env.step(action)
        current_potential = self.potential_fn(obs)
        shaping = self.gamma * current_potential - self._prev_potential
        self._prev_potential = current_potential

        info["raw_reward"] = reward
        info["shaping_reward"] = shaping
        return obs, reward + shaping, terminated, truncated, info

Usage with a distance-based potential:

import gymnasium as gym

def negative_distance_potential(obs):
    # For MountainCar: obs[0] is position, goal is at 0.5
    return -abs(obs[0] - 0.5)

env = gym.make("MountainCar-v0")
env = PotentialShapingWrapper(env, negative_distance_potential, gamma=0.99)

Handling terminal states

At terminal states, the potential should be zero (the episode is over, no future to discount). Adjust the wrapper:

def step(self, action):
    obs, reward, terminated, truncated, info = self.env.step(action)
    if terminated:
        current_potential = 0.0
    else:
        current_potential = self.potential_fn(obs)
    shaping = self.gamma * current_potential - self._prev_potential
    self._prev_potential = current_potential
    info["raw_reward"] = reward
    info["shaping_reward"] = shaping
    return obs, reward + shaping, terminated, truncated, info

Curiosity-driven exploration: ICM implementation

The Intrinsic Curiosity Module (Pathak et al., 2017) has three components:

Feature encoder — maps raw observations to a compact embedding.
Forward model — predicts the next embedding given the current embedding and action.
Inverse model — predicts the action given two consecutive embeddings (regularises the feature space).

import torch
import torch.nn as nn
import torch.nn.functional as F

class ICM(nn.Module):
    def __init__(self, obs_dim: int, act_dim: int, feat_dim: int = 64):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, feat_dim),
        )
        self.forward_model = nn.Sequential(
            nn.Linear(feat_dim + act_dim, 128), nn.ReLU(),
            nn.Linear(128, feat_dim),
        )
        self.inverse_model = nn.Sequential(
            nn.Linear(feat_dim * 2, 128), nn.ReLU(),
            nn.Linear(128, act_dim),
        )

    def forward(self, obs, next_obs, action_onehot):
        phi = self.encoder(obs)
        phi_next = self.encoder(next_obs)

        # Inverse model
        action_pred = self.inverse_model(torch.cat([phi, phi_next], dim=-1))
        inverse_loss = F.cross_entropy(action_pred, action_onehot.argmax(dim=-1))

        # Forward model
        phi_next_pred = self.forward_model(torch.cat([phi, action_onehot], dim=-1))
        forward_loss = F.mse_loss(phi_next_pred, phi_next.detach())

        # Intrinsic reward = forward prediction error
        intrinsic_reward = forward_loss.detach()
        return intrinsic_reward, forward_loss, inverse_loss

During training, add intrinsic_reward * η (a scaling coefficient, typically 0.01–0.1) to the environment reward. Tune η carefully — too high and the agent explores forever; too low and curiosity has no effect.

Random Network Distillation (RND)

RND (Burda et al., 2018) is simpler than ICM:

class RND(nn.Module):
    def __init__(self, obs_dim: int, feat_dim: int = 64):
        super().__init__()
        # Fixed random target network — never trained
        self.target = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, feat_dim),
        )
        for p in self.target.parameters():
            p.requires_grad = False

        # Predictor network — trained to match target
        self.predictor = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, 128), nn.ReLU(),
            nn.Linear(128, feat_dim),
        )

    def intrinsic_reward(self, obs: torch.Tensor) -> torch.Tensor:
        with torch.no_grad():
            target_feat = self.target(obs)
        pred_feat = self.predictor(obs)
        return F.mse_loss(pred_feat, target_feat, reduction="none").mean(dim=-1)

    def loss(self, obs: torch.Tensor) -> torch.Tensor:
        target_feat = self.target(obs).detach()
        pred_feat = self.predictor(obs)
        return F.mse_loss(pred_feat, target_feat)

Novel states produce high prediction error because the predictor has never seen them. As the predictor trains on visited states, their intrinsic reward drops.

Normalising intrinsic rewards

RND rewards can vary wildly in scale. Maintain a running mean and standard deviation:

class RunningMeanStd:
    def __init__(self):
        self.mean = 0.0
        self.var = 1.0
        self.count = 1e-4

    def update(self, x):
        batch_mean = x.mean().item()
        batch_var = x.var().item()
        batch_count = x.shape[0]
        self._update_from_moments(batch_mean, batch_var, batch_count)

    def _update_from_moments(self, mean, var, count):
        delta = mean - self.mean
        total = self.count + count
        self.mean += delta * count / total
        m_a = self.var * self.count
        m_b = var * count
        m2 = m_a + m_b + delta ** 2 * self.count * count / total
        self.var = m2 / total
        self.count = total

    def normalize(self, x):
        return (x - self.mean) / (self.var ** 0.5 + 1e-8)

Automated reward search

Hand-crafting rewards is tedious. Recent approaches automate it:

Reward learning from demonstrations (IRL)

Inverse RL infers a reward function from expert trajectories. Libraries like imitation (built on SB3) provide:

from imitation.algorithms import bc, dagger
from imitation.rewards.reward_nets import BasicRewardNet

reward_net = BasicRewardNet(obs_space, act_space)
# Train reward_net using AIRL or GAIL on expert demonstrations

LLM-based reward design

A recent trend uses large language models to propose reward functions from natural language task descriptions. The LLM generates Python code for a reward function, the agent trains, and metrics feed back to the LLM for iterative refinement.

Diagnosing reward problems

Symptom	Likely cause	Fix
Agent exploits a loophole	Reward hacking	Add constraints or use PBRS
Reward climbs but behaviour is nonsensical	Proxy gaming	Evaluate with held-out metrics the agent never sees
Training stalls at zero reward	Sparse reward, no exploration	Add intrinsic motivation (ICM/RND) or curriculum
Agent oscillates between strategies	Shaped reward creates local optima	Verify shaping satisfies PBRS conditions
Performance drops when shaping is removed	Shaping changed optimal policy	Switch to potential-based form

Combining techniques

In practice, production RL pipelines layer multiple shaping strategies:

PBRS for domain-knowledge-based guidance.
Reward normalisation (running std) for stable gradients.
Intrinsic motivation (RND or ICM) for early exploration.
Curriculum to manage difficulty progression.
Reward clipping as a final safety net.

The key is to anneal shaping signals over time. Start with strong guidance, then fade it so the agent optimises the true objective.

The one thing to remember: Safe reward shaping follows the PBRS formula γΦ(s’) - Φ(s), and everything else — curiosity, curricula, reward learning — builds on top of that guarantee.

pythonreinforcement-learningaireward-design