Reward Shaping — Core Concepts
Why reward shaping exists
Most interesting RL problems have sparse rewards — the agent only gets a signal at the very end (win/lose, success/failure). In a maze with 10,000 cells, a single reward at the exit means the agent must stumble onto the exit by pure chance before it can start learning. Reward shaping adds intermediate signals to guide exploration without changing the optimal solution.
The danger of naive shaping
Add an arbitrary bonus and you risk changing what the agent learns, not just how fast it learns. Classic failure: giving a robot a bonus for moving forward causes it to rock back and forth (forward step, bonus, backward step, forward step, bonus…) rather than actually reaching the goal.
This is called reward hacking — the agent optimises the reward you gave, not the reward you intended.
Potential-based reward shaping (PBRS)
Andrew Ng and colleagues proved that if your shaping reward has a specific mathematical form, it is guaranteed not to change the optimal policy. The formula:
F(s, s') = γ · Φ(s') - Φ(s)
Where Φ is a potential function (a single number for each state) and γ is the discount factor. The shaped reward is added to the environment reward:
R_shaped = R_env + F(s, s')
Why this works: the extra rewards telescope along any trajectory and cancel out in the long run, leaving the optimal policy unchanged. But in the short run, they provide a gradient that points the agent toward high-potential states.
Choosing a potential function
Good potentials encode domain knowledge:
- Negative distance to goal — states closer to the goal have higher potential.
- Heuristic value — any admissible heuristic (like Manhattan distance) works.
- Expert value function — if you have a rough estimate of state values from a previous run, use that as Φ.
Intrinsic motivation
Instead of hand-crafting bonuses, let the agent generate its own curiosity signal:
| Method | Signal | Intuition |
|---|---|---|
| Curiosity (ICM) | Prediction error of a forward model | Visit states the agent cannot yet predict |
| Random Network Distillation (RND) | Distance between two neural networks | Novel states produce high distillation error |
| Count-based | Inverse visit count | Explore states visited least often |
Intrinsic rewards are added to the environment reward during training. They naturally decay as the agent explores more of the state space.
Reward clipping and normalisation
Even without shaping, raw rewards often need adjustment:
- Clipping — bound rewards to [-1, 1] to stabilise gradient magnitudes. Atari benchmarks use this extensively.
- Normalisation — divide by a running standard deviation. Stable-Baselines3’s
VecNormalizedoes this automatically. - Discounting — a lower γ makes the agent care more about immediate rewards; a higher γ makes it plan further ahead.
Curriculum learning as implicit shaping
Starting with easy tasks and increasing difficulty is a form of reward shaping in disguise. The easy tasks provide denser rewards, bootstrapping the agent’s skill before sparse rewards dominate.
Example: training a robot gripper by starting with the object already near the gripper, then gradually increasing the starting distance.
Common misconception
People think reward shaping is “cheating” because you are giving the agent extra information. In practice, every real RL deployment involves reward engineering. The choice is not whether to shape, but whether to do it carefully (with theoretical guarantees like PBRS) or carelessly (with arbitrary bonuses that introduce bugs).
The one thing to remember: Potential-based reward shaping is the safe, proven way to speed up learning — it guides the agent without silently changing what “winning” means.
See Also
- Python Environment Wrappers How thin add-on layers let you change what a learning program sees and does without rewriting the game itself
- Python Monte Carlo Tree Search The clever trick behind AlphaGo — how a program explores millions of possible moves by playing quick random games against itself
- Python Multi Agent Reinforcement What happens when multiple programs learn together in the same world — cooperation, competition, and emergent teamwork
- Python Openai Gym Environments Why OpenAI Gym is the playground where robots and programs learn by trial and error — no prior coding knowledge needed
- Python Policy Gradient Methods Instead of scoring every move, what if the program just learned which moves feel right? That is policy gradients