Reinforcement Learning — Core Concepts

MDPs, Q-learning, policy gradient methods, value functions, and how DQN and PPO revolutionized RL by combining deep learning with classic RL algorithms.

Markov Decision Processes

Reinforcement learning is formally defined through Markov Decision Processes (MDPs). An MDP consists of:

State space $\mathcal{S}$: All possible states the environment can be in
Action space $\mathcal{A}$: Actions the agent can take
Transition function $P(s’ | s, a)$: Probability of transitioning to state $s’$ given current state $s$ and action $a$
Reward function $R(s, a, s’)$: Reward received after taking action $a$ in state $s$ and transitioning to $s’$
Discount factor $\gamma \in [0, 1]$: How much future rewards are worth relative to immediate rewards

The agent’s goal: find a policy $\pi(a|s)$ (mapping states to actions) that maximizes expected discounted return:

$$G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$$

The Markov assumption: the next state depends only on the current state and action, not the full history.

Value Functions

State value function: $V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]$ — expected return starting from state $s$ following policy $\pi$.

Action value function (Q-function): $Q^\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]$ — expected return starting from state $s$, taking action $a$, then following policy $\pi$.

The Bellman equations decompose these recursively:

$$V^\pi(s) = \sum_a \pi(a|s) \sum_{s’} P(s’|s,a)[R(s,a,s’) + \gamma V^\pi(s’)]$$

$$Q^\pi(s,a) = \sum_{s’} P(s’|s,a)[R(s,a,s’) + \gamma \sum_{a’} \pi(a’|s’) Q^\pi(s’,a’)]$$

The optimal Q-function $Q^(s,a) = \max_\pi Q^\pi(s,a)$ is the maximum achievable return. The optimal policy simply takes the action with highest $Q^$ in each state.

Q-Learning: Off-Policy Value Learning

Q-learning (Watkins, 1989) directly learns $Q^*(s,a)$ from experience without knowing the transition model:

$$Q(s,a) \leftarrow Q(s,a) + \alpha [R + \gamma \max_{a’} Q(s’, a’) - Q(s,a)]$$

The term in brackets is the TD error — difference between current Q-estimate and the target (immediate reward + discounted future Q-value). Q-learning converges to $Q^*$ under mild conditions (all state-action pairs visited infinitely often, appropriate learning rate).

DQN (Deep Q-Network, Mnih et al., 2015): Replace the Q-table with a deep neural network $Q(s, a; \theta)$. Used a convolutional neural network taking raw Atari game frames as input.

Key innovations that made DQN work:

Experience replay: Store transitions $(s, a, r, s’)$ in a replay buffer, sample random mini-batches. Breaks temporal correlations in sequential experience.
Target network: A separate, slowly-updated copy of the Q-network provides stable training targets. Update $\theta_{target} \leftarrow \theta_{online}$ every $C$ steps.

DQN achieved human-level performance on 49 Atari games using pixel inputs and no game-specific features — a landmark result.

Policy Gradient Methods

Q-learning learns a value function and derives a policy from it. Policy gradient methods directly optimize the policy:

$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s,a)]$$

The policy gradient theorem states: the gradient of expected return with respect to policy parameters points in the direction that increases probability of actions that received high Q-values.

REINFORCE: The simplest policy gradient algorithm. Sample trajectories, compute returns, update policy to make high-return actions more likely:

$$\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) G_t$$

Actor-Critic methods: Use a value function (critic) to reduce gradient variance. The critic estimates $V^\pi(s)$ or $Q^\pi(s,a)$; the actor uses these estimates as baselines.

Advantage function: $A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$ — how much better is action $a$ than the average action in state $s$. Using $A$ instead of $Q$ reduces variance while keeping the gradient direction correct.

PPO: Proximal Policy Optimization

PPO (Schulman et al., 2017) is the most widely used policy gradient algorithm in practice — used for RLHF training of GPT and Claude.

The core problem with policy gradient: large policy updates can be destabilizing (if you move too far in parameter space, performance can collapse). PPO constrains updates to stay close to the old policy:

$$L^{CLIP}(\theta) = \mathbb{E}_t[\min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t)]$$

Where $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{old}}(a_t|s_t)$ is the probability ratio.

The clipping prevents the ratio from moving too far from 1 (either too much increase or too much decrease in action probability), providing stable training. $\epsilon = 0.2$ is the standard choice.

PPO is preferred over TRPO (its predecessor) because: simpler implementation, fewer hyperparameters, better empirical performance.

One thing to remember: RL’s core insight is that reward signals can guide learning toward complex behaviors without explicit programming — but deriving the right reward function and providing sufficient experience for learning remain the central challenges.

reinforcement-learningmdpq-learningpolicy-gradientdqnppo

Reinforcement Learning — Core Concepts

Markov Decision Processes

Value Functions

Q-Learning: Off-Policy Value Learning

Policy Gradient Methods

PPO: Proximal Policy Optimization

See Also

Related Topics