Reinforcement Learning — Core Concepts

Markov Decision Processes

Reinforcement learning is formally defined through Markov Decision Processes (MDPs). An MDP consists of:

  • State space $\mathcal{S}$: All possible states the environment can be in
  • Action space $\mathcal{A}$: Actions the agent can take
  • Transition function $P(s’ | s, a)$: Probability of transitioning to state $s’$ given current state $s$ and action $a$
  • Reward function $R(s, a, s’)$: Reward received after taking action $a$ in state $s$ and transitioning to $s’$
  • Discount factor $\gamma \in [0, 1]$: How much future rewards are worth relative to immediate rewards

The agent’s goal: find a policy $\pi(a|s)$ (mapping states to actions) that maximizes expected discounted return:

$$G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$$

The Markov assumption: the next state depends only on the current state and action, not the full history.

Value Functions

State value function: $V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]$ — expected return starting from state $s$ following policy $\pi$.

Action value function (Q-function): $Q^\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]$ — expected return starting from state $s$, taking action $a$, then following policy $\pi$.

The Bellman equations decompose these recursively:

$$V^\pi(s) = \sum_a \pi(a|s) \sum_{s’} P(s’|s,a)[R(s,a,s’) + \gamma V^\pi(s’)]$$

$$Q^\pi(s,a) = \sum_{s’} P(s’|s,a)[R(s,a,s’) + \gamma \sum_{a’} \pi(a’|s’) Q^\pi(s’,a’)]$$

The optimal Q-function $Q^(s,a) = \max_\pi Q^\pi(s,a)$ is the maximum achievable return. The optimal policy simply takes the action with highest $Q^$ in each state.

Q-Learning: Off-Policy Value Learning

Q-learning (Watkins, 1989) directly learns $Q^*(s,a)$ from experience without knowing the transition model:

$$Q(s,a) \leftarrow Q(s,a) + \alpha [R + \gamma \max_{a’} Q(s’, a’) - Q(s,a)]$$

The term in brackets is the TD error — difference between current Q-estimate and the target (immediate reward + discounted future Q-value). Q-learning converges to $Q^*$ under mild conditions (all state-action pairs visited infinitely often, appropriate learning rate).

DQN (Deep Q-Network, Mnih et al., 2015): Replace the Q-table with a deep neural network $Q(s, a; \theta)$. Used a convolutional neural network taking raw Atari game frames as input.

Key innovations that made DQN work:

  • Experience replay: Store transitions $(s, a, r, s’)$ in a replay buffer, sample random mini-batches. Breaks temporal correlations in sequential experience.
  • Target network: A separate, slowly-updated copy of the Q-network provides stable training targets. Update $\theta_{target} \leftarrow \theta_{online}$ every $C$ steps.

DQN achieved human-level performance on 49 Atari games using pixel inputs and no game-specific features — a landmark result.

Policy Gradient Methods

Q-learning learns a value function and derives a policy from it. Policy gradient methods directly optimize the policy:

$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s,a)]$$

The policy gradient theorem states: the gradient of expected return with respect to policy parameters points in the direction that increases probability of actions that received high Q-values.

REINFORCE: The simplest policy gradient algorithm. Sample trajectories, compute returns, update policy to make high-return actions more likely:

$$\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) G_t$$

Actor-Critic methods: Use a value function (critic) to reduce gradient variance. The critic estimates $V^\pi(s)$ or $Q^\pi(s,a)$; the actor uses these estimates as baselines.

Advantage function: $A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$ — how much better is action $a$ than the average action in state $s$. Using $A$ instead of $Q$ reduces variance while keeping the gradient direction correct.

PPO: Proximal Policy Optimization

PPO (Schulman et al., 2017) is the most widely used policy gradient algorithm in practice — used for RLHF training of GPT and Claude.

The core problem with policy gradient: large policy updates can be destabilizing (if you move too far in parameter space, performance can collapse). PPO constrains updates to stay close to the old policy:

$$L^{CLIP}(\theta) = \mathbb{E}_t[\min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t)]$$

Where $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{old}}(a_t|s_t)$ is the probability ratio.

The clipping prevents the ratio from moving too far from 1 (either too much increase or too much decrease in action probability), providing stable training. $\epsilon = 0.2$ is the standard choice.

PPO is preferred over TRPO (its predecessor) because: simpler implementation, fewer hyperparameters, better empirical performance.

One thing to remember: RL’s core insight is that reward signals can guide learning toward complex behaviors without explicit programming — but deriving the right reward function and providing sufficient experience for learning remain the central challenges.

reinforcement-learningmdpq-learningpolicy-gradientdqnppo

See Also

  • Contrastive Learning How AI learns what things are like each other — and what they're not — without any labels, creating the representations behind image search and face recognition.
  • Data Augmentation How AI systems make do with less data by creating variations of what they have — the training trick that prevented ImageNet models from memorizing training examples.
  • Few Shot Learning How AI learned to learn from just a handful of examples — the technique that lets AI generalize like humans instead of needing millions of training samples.
  • Lora Fine Tuning How AI companies adapt massive models to specific tasks by training only a tiny fraction of the parameters — the technique making custom AI affordable.
  • Self Supervised Learning How AI learned to teach itself from unlabeled data — the technique that let GPT and BERT learn from the entire internet without any human labeling.