Policy Gradient Methods — ELI5

Instead of scoring every move, what if the program just learned which moves feel right? That is policy gradients

Think about learning to throw darts. One approach is to calculate the exact angle and force for every throw (that is like Q-Learning — scoring every option). But most people do not throw darts that way. Instead, they throw, see where it lands, and adjust: “a little more to the left next time.”

Policy gradient methods work like that intuitive dart-throwing. Instead of building a giant scorecard for every possible move, the program has a built-in instinct — a set of tendencies that say “in this situation, I usually go left.” After each game, the program strengthens the tendencies that led to good outcomes and weakens the ones that led to bad outcomes.

Imagine the program played ten games. In three of them it tried jumping early and won. In seven it jumped late and lost. The program adjusts: “jumping early seems good — do more of that.” Over thousands of games, these small nudges add up and the program develops strong instincts.

The beautiful thing is that this works even when the actions are smooth, like steering a wheel or controlling a robot arm. Scorecard approaches struggle with smooth actions because there are infinite options to score. But policy gradients just nudge the tendency — “steer a little more left” — without listing every possible angle.

This is why policy gradients power many of the most impressive RL results, from robot locomotion to game-playing agents. They learn by feel, not by bookkeeping.

The one thing to remember: Policy gradients teach a program good instincts by reinforcing moves that led to wins and discouraging moves that led to losses — learning by feel, not by scorecard.

pythonreinforcement-learningaipolicy-gradients

Policy Gradient Methods — ELI5

See Also

Related Topics