Reward Shaping — ELI5

Imagine training a puppy. If you only give treats when the puppy perfectly fetches a ball from across the park, the puppy has no idea what you want. It tries random things — rolling, barking, chewing a stick — and never gets a treat.

But if you give a small treat when the puppy looks at the ball, a bigger treat when it walks toward the ball, and the biggest treat when it picks it up, the puppy learns fast. You are shaping the reward so the path to success is lined with clues.

Reward shaping is the same idea for computer programs that learn. The program gets a score after every action. If you only score the final goal (“you won the game!”), the program stumbles around for ages because it has no idea if it is getting warmer or colder.

By adding small bonus points along the way — closer to the goal, moved in the right direction, avoided a wall — you light up a trail of breadcrumbs. The program follows the trail and learns much faster.

There is a catch though. If your bonus points accidentally reward the wrong behaviour, the program will happily learn that wrong behaviour and ignore the real goal. Like a puppy that learns to spin in circles because you accidentally rewarded spinning once. So getting the rewards right is more art than science.

This is why researchers say reward design is the hardest part of reinforcement learning. The learning recipe (the algorithm) is often fine — the real question is always: “did I ask for the right thing?”

The one thing to remember: Reward shaping is about leaving breadcrumbs so a learning program can find success faster, but sloppy breadcrumbs lead to shortcuts you never intended.

pythonreinforcement-learningaireward-design

See Also