Policy Gradient Methods — Core Concepts

REINFORCE, baselines, actor-critic, and PPO — the family of algorithms that directly optimise the policy instead of value functions

Why optimise the policy directly?

Value-based methods (like Q-Learning) learn a value function and derive a policy from it. Policy gradient methods skip the middleman — they parameterise the policy directly (usually as a neural network) and optimise it using gradient ascent on expected reward.

This has two major advantages:

Continuous actions — a policy network can output a mean and standard deviation for a Gaussian distribution, naturally handling smooth control without discretisation.
Stochastic policies — the agent can learn to randomise, which is optimal in some games (like rock-paper-scissors).

The policy gradient theorem

The core result: the gradient of expected reward with respect to policy parameters θ is:

∇J(θ) = E[∇ log π(a|s; θ) · R]

Translation: increase the log-probability of actions proportionally to how good they were. Good actions become more likely; bad actions become less likely.

REINFORCE: the simplest policy gradient

REINFORCE collects a full episode, computes the total return for each step, and updates:

# After collecting an episode: states, actions, rewards
returns = compute_discounted_returns(rewards, gamma=0.99)
for s, a, G in zip(states, actions, returns):
    log_prob = policy.log_prob(s, a)
    loss = -log_prob * G  # negative because we do gradient ascent

The problem: REINFORCE has high variance. Returns can swing wildly between episodes, making the gradient noisy and training slow.

Baselines reduce variance

Subtract a baseline b(s) from the return:

∇J(θ) = E[∇ log π(a|s; θ) · (R - b(s))]

The baseline does not change the expected gradient (it is mathematically unbiased) but dramatically reduces variance. The most common baseline is the value function V(s) — an estimate of how good the state is on average. The difference R - V(s) is called the advantage: how much better this action was compared to average.

Actor-Critic

Instead of waiting for a full episode (like REINFORCE), actor-critic methods update at every step using:

Actor — the policy network π(a|s; θ). Updated with policy gradients.
Critic — the value network V(s; w). Updated to predict expected returns.

The advantage is estimated as:

A(s, a) ≈ r + γV(s') - V(s)

This TD advantage has lower variance than full returns (at the cost of some bias from the imperfect critic). The result is faster, more stable learning.

PPO: the practical standard

Proximal Policy Optimisation (PPO) is the most widely used policy gradient algorithm. It adds two key ideas:

Clipped objective — limits how much the policy can change in one update, preventing destructive large steps:
```
L = min(ratio · A, clip(ratio, 1-ε, 1+ε) · A)
```
where ratio = π_new(a|s) / π_old(a|s) and ε is typically 0.2.
Multiple epochs — reuses collected data for several gradient steps (unlike vanilla policy gradient which uses each batch once).

PPO is the default choice in Stable-Baselines3 and most RL applications because it is robust to hyperparameter choices and works on both discrete and continuous action spaces.

Algorithm family overview

Algorithm	Type	Key idea	Typical use
REINFORCE	On-policy	Full episode returns	Teaching, simple tasks
A2C	On-policy, actor-critic	Synchronous advantage updates	Fast baseline
PPO	On-policy, actor-critic	Clipped ratio objective	General purpose
TRPO	On-policy, actor-critic	KL-divergence constraint	When precision matters
SAC	Off-policy, actor-critic	Entropy-regularised objective	Continuous control

Common misconception

People think policy gradient methods are “better” than value methods. In reality, they are complementary. Pure policy gradients have high variance; pure value methods struggle with continuous actions. The best modern algorithms (PPO, SAC) combine both approaches via actor-critic architectures.

The one thing to remember: Policy gradients directly adjust the probability of actions based on how well they worked — increase what won, decrease what lost — and the clipped ratio in PPO keeps those adjustments safe.

pythonreinforcement-learningaipolicy-gradients