Reward Shaping — Core Concepts

Why reward shaping exists

Most interesting RL problems have sparse rewards — the agent only gets a signal at the very end (win/lose, success/failure). In a maze with 10,000 cells, a single reward at the exit means the agent must stumble onto the exit by pure chance before it can start learning. Reward shaping adds intermediate signals to guide exploration without changing the optimal solution.

The danger of naive shaping

Add an arbitrary bonus and you risk changing what the agent learns, not just how fast it learns. Classic failure: giving a robot a bonus for moving forward causes it to rock back and forth (forward step, bonus, backward step, forward step, bonus…) rather than actually reaching the goal.

This is called reward hacking — the agent optimises the reward you gave, not the reward you intended.

Potential-based reward shaping (PBRS)

Andrew Ng and colleagues proved that if your shaping reward has a specific mathematical form, it is guaranteed not to change the optimal policy. The formula:

F(s, s') = γ · Φ(s') - Φ(s)

Where Φ is a potential function (a single number for each state) and γ is the discount factor. The shaped reward is added to the environment reward:

R_shaped = R_env + F(s, s')

Why this works: the extra rewards telescope along any trajectory and cancel out in the long run, leaving the optimal policy unchanged. But in the short run, they provide a gradient that points the agent toward high-potential states.

Choosing a potential function

Good potentials encode domain knowledge:

  • Negative distance to goal — states closer to the goal have higher potential.
  • Heuristic value — any admissible heuristic (like Manhattan distance) works.
  • Expert value function — if you have a rough estimate of state values from a previous run, use that as Φ.

Intrinsic motivation

Instead of hand-crafting bonuses, let the agent generate its own curiosity signal:

MethodSignalIntuition
Curiosity (ICM)Prediction error of a forward modelVisit states the agent cannot yet predict
Random Network Distillation (RND)Distance between two neural networksNovel states produce high distillation error
Count-basedInverse visit countExplore states visited least often

Intrinsic rewards are added to the environment reward during training. They naturally decay as the agent explores more of the state space.

Reward clipping and normalisation

Even without shaping, raw rewards often need adjustment:

  • Clipping — bound rewards to [-1, 1] to stabilise gradient magnitudes. Atari benchmarks use this extensively.
  • Normalisation — divide by a running standard deviation. Stable-Baselines3’s VecNormalize does this automatically.
  • Discounting — a lower γ makes the agent care more about immediate rewards; a higher γ makes it plan further ahead.

Curriculum learning as implicit shaping

Starting with easy tasks and increasing difficulty is a form of reward shaping in disguise. The easy tasks provide denser rewards, bootstrapping the agent’s skill before sparse rewards dominate.

Example: training a robot gripper by starting with the object already near the gripper, then gradually increasing the starting distance.

Common misconception

People think reward shaping is “cheating” because you are giving the agent extra information. In practice, every real RL deployment involves reward engineering. The choice is not whether to shape, but whether to do it carefully (with theoretical guarantees like PBRS) or carelessly (with arbitrary bonuses that introduce bugs).

The one thing to remember: Potential-based reward shaping is the safe, proven way to speed up learning — it guides the agent without silently changing what “winning” means.

pythonreinforcement-learningaireward-design

See Also