Reinforcement Learning — Deep Dive

Bellman Optimality and Dynamic Programming

The optimal value functions satisfy the Bellman optimality equations:

$$V^(s) = \max_a \sum_{s’} P(s’|s,a)[R(s,a,s’) + \gamma V^(s’)]$$

$$Q^(s,a) = \sum_{s’} P(s’|s,a)[R(s,a,s’) + \gamma \max_{a’} Q^(s’,a’)]$$

Value iteration: Iteratively apply the Bellman optimality operator until convergence:

$$V_{k+1}(s) = \max_a \sum_{s’} P(s’|s,a)[R(s,a,s’) + \gamma V_k(s’)]$$

This converges to $V^*$ for any initial $V_0$ because the Bellman operator is a contraction mapping (in $\ell_\infty$ norm with factor $\gamma$). Policy iteration alternates between policy evaluation (solving the system of equations for a fixed policy) and policy improvement (greedy update). Both are exact only when the full transition model $P$ is known.

Model-free temporal difference: When $P$ is unknown (the realistic case), TD methods estimate value functions from experience:

$$V(s_t) \leftarrow V(s_t) + \alpha \underbrace{[r_{t+1} + \gamma V(s_{t+1}) - V(s_t)]}_{\text{TD error}}$$

The TD error bootstraps — uses current estimates of future value, not complete trajectories. This introduces bias but reduces variance compared to Monte Carlo (which uses complete trajectory returns).

Rainbow DQN: Combining Improvements

van Hasselt et al. (2015) Double DQN, Schaul et al. (2016) Prioritized Experience Replay, and other improvements were combined in Hessel et al. (2018) “Rainbow: Combining Improvements in Deep Reinforcement Learning.”

Double DQN: Standard DQN overestimates Q-values because the same network selects and evaluates actions. Double DQN separates selection and evaluation: $$Y_t = R_{t+1} + \gamma Q(S_{t+1}, \arg\max_a Q(S_{t+1}, a; \theta); \theta^-)$$

Online network $\theta$ selects actions; target network $\theta^-$ evaluates them.

Prioritized Experience Replay: Sample transitions with probability proportional to their TD error magnitude. Higher TD error → larger potential learning signal → higher sample priority.

Dueling Networks (Wang et al., 2016): Separate Q into value and advantage streams: $$Q(s,a;\theta,\alpha,\beta) = V(s;\theta,\beta) + A(s,a;\theta,\alpha) - \frac{1}{|\mathcal{A}|}\sum_{a’} A(s,a’;\theta,\alpha)$$

Better generalization when actions don’t affect the value much.

Multi-step returns: $n$-step TD provides better bias-variance tradeoff than 1-step. Use $n=3$ to balance.

Distributional RL (Bellemare et al., 2017): Learn the full distribution $Z(s,a)$ over returns, not just the expectation. Better generalization and more stable training.

Noisy Networks (Fortunato et al., 2017): Replace $\epsilon$-greedy exploration with stochastic weights. The network itself has noise that provides state-dependent exploration.

Rainbow combined all six: +34% over the previous best on 57 Atari games, with significantly better sample efficiency.

Soft Actor-Critic: Maximum Entropy RL

Haarnoja et al. (2018) “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning” addressed the challenges of continuous action spaces.

Maximum entropy RL: Augment the reward with an entropy bonus: $$J(\pi) = \sum_t \mathbb{E}[R(s_t, a_t) + \alpha H(\pi(\cdot|s_t))]$$

This encourages exploration (high entropy = diverse actions) and provides a principled exploration-exploitation tradeoff. $\alpha$ is the temperature parameter controlling entropy weight.

SAC algorithm:

  • Actor: Maximizes entropy-augmented return
  • Critic: Two Q-networks (to reduce overestimation), soft Bellman update
  • Temperature: Learned automatically to target a desired entropy level

SAC is:

  • Off-policy: Samples from a replay buffer (sample efficient)
  • Continuous actions: Uses reparameterization trick through squashed Gaussian policy
  • Automatic temperature tuning: $\alpha$ is adjusted to target a desired entropy $\mathcal{H}^* = -|\mathcal{A}|$

On continuous control benchmarks (HalfCheetah, Ant, Humanoid in MuJoCo physics simulation), SAC achieves state-of-the-art with fewer environment interactions than on-policy methods.

AlphaZero: MCTS + Self-Play + Deep Networks

Silver et al. (DeepMind, 2017) “Mastering Chess and Shogi by Self-Play” showed that a single algorithm could master multiple games by combining Monte Carlo Tree Search with neural network self-play.

Components:

  • Policy network $p(s)$: Predicts move probabilities from board state
  • Value network $v(s)$: Predicts probability of winning from board state
  • MCTS: Uses $p$ and $v$ as heuristics to guide tree search

MCTS in AlphaZero: For each node in the search tree, select action using: $$a = \arg\max_a [Q(s,a) + c_{puct} P(s,a) \frac{\sqrt{N(s)}}{1 + N(s,a)}]$$

Where $Q(s,a)$ is the mean value of subtree, $P(s,a) = p_\theta(a|s)$ is prior policy, and $c_{puct}$ balances exploration. Evaluate leaf nodes with $v_\theta$. Backup values through the tree.

Self-play training: AlphaZero trains entirely by playing against itself. No human games, no domain knowledge (just the rules). Each training step:

  1. Play $N$ games of self-play using MCTS
  2. Use game outcomes as training targets for $v$
  3. Use MCTS move probabilities as training targets for $p$
  4. Update $(\theta_p, \theta_v)$ on these targets
  5. Repeat

Starting from random play, AlphaZero learned superhuman chess in 4 hours and superhuman Go in 8 hours. The self-play loop continuously generates training data at the frontier of the current capability.

Model-Based RL: Learning the Environment

Model-free RL (DQN, PPO, SAC) learns directly from environment interactions. Model-based RL first learns a model of the environment $\hat{P}(s’|s,a)$, then uses this model to plan:

Dyna-Q (Sutton, 1990): Combine real experience with simulated experience from the learned model. Each real step generates $n$ simulated steps from the model — improving sample efficiency.

MBPO (Janner et al., 2019): Model-Based Policy Optimization. Learn an ensemble of transition models (handles model uncertainty), generate short rollouts from these models, train a policy on the mixture of real and simulated experience.

DreamerV3 (Hafner et al., 2023): World model + latent dynamics + actor-critic in latent space. Train the world model to predict in a compact learned latent space; plan and train the policy entirely in the world model. Achieved state-of-the-art across diverse benchmarks (Atari, continuous control, 3D environments) with a single set of hyperparameters.

One thing to remember: The field of RL is much larger than its most famous applications — beneath AlphaGo, ChatGPT’s RLHF, and robot locomotion lies a rich mathematical framework for any sequential decision-making problem, and the gap between “RL in simulation” and “RL in the real world” remains one of the field’s defining challenges.

reinforcement-learningrainbow-dqnsacmodel-based-rlalphazeromcts

See Also

  • Contrastive Learning How AI learns what things are like each other — and what they're not — without any labels, creating the representations behind image search and face recognition.
  • Data Augmentation How AI systems make do with less data by creating variations of what they have — the training trick that prevented ImageNet models from memorizing training examples.
  • Few Shot Learning How AI learned to learn from just a handful of examples — the technique that lets AI generalize like humans instead of needing millions of training samples.
  • Lora Fine Tuning How AI companies adapt massive models to specific tasks by training only a tiny fraction of the parameters — the technique making custom AI affordable.
  • Self Supervised Learning How AI learned to teach itself from unlabeled data — the technique that let GPT and BERT learn from the entire internet without any human labeling.