Policy Gradient Methods — Core Concepts

Why optimise the policy directly?

Value-based methods (like Q-Learning) learn a value function and derive a policy from it. Policy gradient methods skip the middleman — they parameterise the policy directly (usually as a neural network) and optimise it using gradient ascent on expected reward.

This has two major advantages:

  1. Continuous actions — a policy network can output a mean and standard deviation for a Gaussian distribution, naturally handling smooth control without discretisation.
  2. Stochastic policies — the agent can learn to randomise, which is optimal in some games (like rock-paper-scissors).

The policy gradient theorem

The core result: the gradient of expected reward with respect to policy parameters θ is:

∇J(θ) = E[∇ log π(a|s; θ) · R]

Translation: increase the log-probability of actions proportionally to how good they were. Good actions become more likely; bad actions become less likely.

REINFORCE: the simplest policy gradient

REINFORCE collects a full episode, computes the total return for each step, and updates:

# After collecting an episode: states, actions, rewards
returns = compute_discounted_returns(rewards, gamma=0.99)
for s, a, G in zip(states, actions, returns):
    log_prob = policy.log_prob(s, a)
    loss = -log_prob * G  # negative because we do gradient ascent

The problem: REINFORCE has high variance. Returns can swing wildly between episodes, making the gradient noisy and training slow.

Baselines reduce variance

Subtract a baseline b(s) from the return:

∇J(θ) = E[∇ log π(a|s; θ) · (R - b(s))]

The baseline does not change the expected gradient (it is mathematically unbiased) but dramatically reduces variance. The most common baseline is the value function V(s) — an estimate of how good the state is on average. The difference R - V(s) is called the advantage: how much better this action was compared to average.

Actor-Critic

Instead of waiting for a full episode (like REINFORCE), actor-critic methods update at every step using:

  • Actor — the policy network π(a|s; θ). Updated with policy gradients.
  • Critic — the value network V(s; w). Updated to predict expected returns.

The advantage is estimated as:

A(s, a) ≈ r + γV(s') - V(s)

This TD advantage has lower variance than full returns (at the cost of some bias from the imperfect critic). The result is faster, more stable learning.

PPO: the practical standard

Proximal Policy Optimisation (PPO) is the most widely used policy gradient algorithm. It adds two key ideas:

  1. Clipped objective — limits how much the policy can change in one update, preventing destructive large steps:

    L = min(ratio · A, clip(ratio, 1-ε, 1+ε) · A)

    where ratio = π_new(a|s) / π_old(a|s) and ε is typically 0.2.

  2. Multiple epochs — reuses collected data for several gradient steps (unlike vanilla policy gradient which uses each batch once).

PPO is the default choice in Stable-Baselines3 and most RL applications because it is robust to hyperparameter choices and works on both discrete and continuous action spaces.

Algorithm family overview

AlgorithmTypeKey ideaTypical use
REINFORCEOn-policyFull episode returnsTeaching, simple tasks
A2COn-policy, actor-criticSynchronous advantage updatesFast baseline
PPOOn-policy, actor-criticClipped ratio objectiveGeneral purpose
TRPOOn-policy, actor-criticKL-divergence constraintWhen precision matters
SACOff-policy, actor-criticEntropy-regularised objectiveContinuous control

Common misconception

People think policy gradient methods are “better” than value methods. In reality, they are complementary. Pure policy gradients have high variance; pure value methods struggle with continuous actions. The best modern algorithms (PPO, SAC) combine both approaches via actor-critic architectures.

The one thing to remember: Policy gradients directly adjust the probability of actions based on how well they worked — increase what won, decrease what lost — and the clipped ratio in PPO keeps those adjustments safe.

pythonreinforcement-learningaipolicy-gradients

See Also