Policy Gradient Methods — Core Concepts
Why optimise the policy directly?
Value-based methods (like Q-Learning) learn a value function and derive a policy from it. Policy gradient methods skip the middleman — they parameterise the policy directly (usually as a neural network) and optimise it using gradient ascent on expected reward.
This has two major advantages:
- Continuous actions — a policy network can output a mean and standard deviation for a Gaussian distribution, naturally handling smooth control without discretisation.
- Stochastic policies — the agent can learn to randomise, which is optimal in some games (like rock-paper-scissors).
The policy gradient theorem
The core result: the gradient of expected reward with respect to policy parameters θ is:
∇J(θ) = E[∇ log π(a|s; θ) · R]
Translation: increase the log-probability of actions proportionally to how good they were. Good actions become more likely; bad actions become less likely.
REINFORCE: the simplest policy gradient
REINFORCE collects a full episode, computes the total return for each step, and updates:
# After collecting an episode: states, actions, rewards
returns = compute_discounted_returns(rewards, gamma=0.99)
for s, a, G in zip(states, actions, returns):
log_prob = policy.log_prob(s, a)
loss = -log_prob * G # negative because we do gradient ascent
The problem: REINFORCE has high variance. Returns can swing wildly between episodes, making the gradient noisy and training slow.
Baselines reduce variance
Subtract a baseline b(s) from the return:
∇J(θ) = E[∇ log π(a|s; θ) · (R - b(s))]
The baseline does not change the expected gradient (it is mathematically unbiased) but dramatically reduces variance. The most common baseline is the value function V(s) — an estimate of how good the state is on average. The difference R - V(s) is called the advantage: how much better this action was compared to average.
Actor-Critic
Instead of waiting for a full episode (like REINFORCE), actor-critic methods update at every step using:
- Actor — the policy network π(a|s; θ). Updated with policy gradients.
- Critic — the value network V(s; w). Updated to predict expected returns.
The advantage is estimated as:
A(s, a) ≈ r + γV(s') - V(s)
This TD advantage has lower variance than full returns (at the cost of some bias from the imperfect critic). The result is faster, more stable learning.
PPO: the practical standard
Proximal Policy Optimisation (PPO) is the most widely used policy gradient algorithm. It adds two key ideas:
-
Clipped objective — limits how much the policy can change in one update, preventing destructive large steps:
L = min(ratio · A, clip(ratio, 1-ε, 1+ε) · A)where ratio = π_new(a|s) / π_old(a|s) and ε is typically 0.2.
-
Multiple epochs — reuses collected data for several gradient steps (unlike vanilla policy gradient which uses each batch once).
PPO is the default choice in Stable-Baselines3 and most RL applications because it is robust to hyperparameter choices and works on both discrete and continuous action spaces.
Algorithm family overview
| Algorithm | Type | Key idea | Typical use |
|---|---|---|---|
| REINFORCE | On-policy | Full episode returns | Teaching, simple tasks |
| A2C | On-policy, actor-critic | Synchronous advantage updates | Fast baseline |
| PPO | On-policy, actor-critic | Clipped ratio objective | General purpose |
| TRPO | On-policy, actor-critic | KL-divergence constraint | When precision matters |
| SAC | Off-policy, actor-critic | Entropy-regularised objective | Continuous control |
Common misconception
People think policy gradient methods are “better” than value methods. In reality, they are complementary. Pure policy gradients have high variance; pure value methods struggle with continuous actions. The best modern algorithms (PPO, SAC) combine both approaches via actor-critic architectures.
The one thing to remember: Policy gradients directly adjust the probability of actions based on how well they worked — increase what won, decrease what lost — and the clipped ratio in PPO keeps those adjustments safe.
See Also
- Python Environment Wrappers How thin add-on layers let you change what a learning program sees and does without rewriting the game itself
- Python Monte Carlo Tree Search The clever trick behind AlphaGo — how a program explores millions of possible moves by playing quick random games against itself
- Python Multi Agent Reinforcement What happens when multiple programs learn together in the same world — cooperation, competition, and emergent teamwork
- Python Openai Gym Environments Why OpenAI Gym is the playground where robots and programs learn by trial and error — no prior coding knowledge needed
- Python Q Learning Implementation How a program builds a cheat sheet of every situation and every action to figure out the best move — no teacher required