Reinforcement Learning — Deep Dive

MDPs, Q-learning, policy gradients, RLHF — the full technical picture of how RL actually works and where it breaks.

Reinforcement Learning — Deep Dive

Reinforcement learning has a deceptively clean mathematical foundation and an extremely messy practical reality. This is a field where the theory predicts convergence to optimal solutions, and the practice produces agents that discover bugs in the reward function you spent three weeks designing.

The Formal Framework: Markov Decision Processes

RL is almost always formalized as a Markov Decision Process (MDP), defined by the tuple (S, A, P, R, γ):

S — state space (all possible situations the agent can be in)
A — action space (all moves the agent can make)
P(s’|s, a) — transition probability (given state s and action a, probability of landing in state s’)
R(s, a, s’) — reward function (the scalar signal the agent receives)
γ (gamma) — discount factor ∈ [0,1] (how much future rewards are worth)

The agent seeks a policy π(a|s) — a probability distribution over actions given states — that maximizes expected cumulative discounted reward:

E[Σ γ^t R_t]  from t=0 to ∞

The Markov property is the key assumption: the future depends only on the current state, not the full history. This makes the math tractable. In practice, many real-world problems aren’t truly Markovian (stock prices depend on history; a poker hand depends on what cards have been played), which is why state representation engineering matters enormously.

Value Functions

Rather than learning a policy directly, many RL algorithms learn value functions — estimates of how good it is to be in a state (or to take an action in a state).

State value function:

V^π(s) = E[Σ γ^t R_t | s_0 = s, policy π]

Action-value function (Q-function):

Q^π(s, a) = E[Σ γ^t R_t | s_0 = s, a_0 = a, policy π]

The Bellman equation links the value of a state to its successors:

V^*(s) = max_a Σ P(s'|s,a) [R(s,a,s') + γ V^*(s')]

This recursive relationship is what makes dynamic programming approaches possible. The optimal policy π* simply picks whichever action maximizes Q*(s, a) at each state.

Q-Learning and Deep Q-Networks

Q-learning (Watkins, 1989) learns Q-values by iterative updates:

Q(s,a) ← Q(s,a) + α [r + γ max_a' Q(s',a') - Q(s,a)]

The term in brackets is the temporal difference (TD) error — the difference between the current Q estimate and the “target” (actual reward plus discounted future value). This is off-policy: the update uses the greedy max action for the next state, regardless of what the agent actually did.

For small state spaces, you can represent Q as a lookup table. For anything real-world — Atari games, Go boards, robot sensor readings — the state space is astronomically large. This is where Deep Q-Networks (DQN) come in.

DeepMind’s 2013/2015 DQN paper used a convolutional neural network to approximate Q(s, a), taking raw pixel frames as input. Two key tricks made this stable:

Experience replay: store past transitions (s, a, r, s’) in a buffer; sample random minibatches for training. This breaks temporal correlations that would destabilize neural network training.
Target network: use a separate, periodically-updated copy of the network to compute targets. Without this, you’re chasing a moving target, which causes oscillations.

DQN achieved superhuman performance on 49 Atari games with a single architecture — something that required game-specific tuning with prior approaches.

Policy Gradient Methods

Q-learning learns value functions and derives a policy implicitly (pick the action with highest Q). Policy gradient methods optimize the policy directly.

The fundamental result (Policy Gradient Theorem, Sutton et al. 1999):

∇_θ J(θ) = E[∇_θ log π_θ(a|s) · Q^π(s,a)]

In plain terms: increase the probability of actions that led to higher-than-expected returns, decrease the probability of actions that led to lower-than-expected returns.

The simplest implementation is REINFORCE: run an episode, compute returns, update policy weights. This works but has very high variance — the returns from a single trajectory are noisy.

Modern policy gradient methods address this:

Actor-Critic: use a learned value function (the “critic”) as a baseline to reduce variance while keeping bias low
PPO (Proximal Policy Optimization, 2017): OpenAI’s workhorse algorithm; clips gradient updates to prevent catastrophically large policy changes. Surprisingly simple, robust across a huge range of tasks
SAC (Soft Actor-Critic, 2018): adds entropy regularization — the agent is explicitly rewarded for maintaining uncertainty, which improves exploration and prevents premature convergence

PPO is what OpenAI used to train the Dota 2 bot that defeated professional players in 2019. The agent trained for the equivalent of ~45,000 years of gameplay in 10 months of real time, distributed across thousands of CPUs.

The Exploration Problem in Depth

ε-greedy exploration (random action with probability ε) is adequate for small problems. For complex environments with sparse rewards, it fails completely.

Consider: teach an RL agent to explore a large maze and find a key that opens a door. With random exploration, the agent might never stumble on the key in a reasonable timeframe — the chance of randomly hitting the right sequence of actions is vanishingly small.

Solutions that actually work:

Intrinsic motivation / curiosity-driven exploration: reward the agent for encountering novel states. Pathak et al. (2017) used an agent’s prediction error about the next state as an intrinsic reward — states where the agent’s model is wrong are “surprising” and therefore worth exploring. This works, except in visually complex environments where the screen changes randomly (like a TV screen), which create infinite novelty for free. The agent gets “addicted” to the TV. This is real; it happened in experiments.

Count-based exploration: maintain approximate counts of state visits; give bonuses for undervisited states. Works in theory; memory and approximation get messy in continuous spaces.

Go-Explore (Adewole et al., 2018): remember “interesting” states explicitly, then return to them and explore forward. Solved Montezuma’s Revenge (famously difficult for exploration) with superhuman performance. Conceptually simple; embarrassingly effective.

RLHF: How ChatGPT Gets Its Personality

Reinforcement Learning from Human Feedback is the technique that turned GPT-3 (technically capable but difficult to use) into ChatGPT (aligned with human preferences).

The pipeline:

Supervised fine-tuning: start with a pre-trained LLM; fine-tune on curated human demonstrations of good responses
Reward model training: show human raters pairs of model responses; train a separate neural network to predict human preference scores
RL optimization: use PPO to fine-tune the LLM against the learned reward model; add a KL divergence penalty to prevent the model from drifting too far from the SFT checkpoint (prevents reward hacking)

The KL penalty is critical. Without it, the model learns to produce outputs that exploit the reward model’s weaknesses — generating coherent-looking nonsense that scores high on the learned metric but doesn’t actually match human intent.

Direct Preference Optimization (DPO, 2023) is a newer approach that skips the explicit RL step entirely, reformulating the RLHF objective as a supervised learning problem. Same outcomes, simpler training. Anthropic’s Claude and Meta’s Llama models use variants of this.

Real Failure Modes Engineers Hit

Reward shaping gone wrong: adding intermediate rewards to help the agent learn faster often creates unintended shortcuts. An agent trained to reach a goal flag in a 3D environment learned to spin rapidly next to the flag — the proximity reward accumulated faster than actual movement toward it.

Sim-to-real gap: train a robot in simulation, deploy in the real world, watch it fail. Simulation physics are approximate; real-world sensor noise is different; materials don’t behave identically. OpenAI’s Dactyl hand (Rubik’s cube solving robot) required massive domain randomization — randomizing every parameter of the simulation (friction, mass, lighting) during training — to make the policy robust enough to transfer.

Catastrophic forgetting: when you train on new tasks, RL agents tend to forget old ones. The neural network weights that encoded old knowledge get overwritten. This is an open research problem; EWC (Elastic Weight Consolidation) is one partial solution.

Training instability: deep RL is notoriously brittle. The same hyperparameters that work in one environment fail in another. Reported results often don’t reproduce across different random seeds. Henderson et al. (2018) showed that many published RL results varied by 300%+ across seeds — a problem that doesn’t affect supervised learning nearly as severely.

Benchmarks Worth Knowing

Atari 57: the original DQN benchmark; most modern algorithms have surpassed human performance on most games
MuJoCo continuous control: locomotion tasks (hopper, half-cheetah, ant); standard for policy gradient benchmarking
OpenAI Gym / Gymnasium: standardized environment interfaces; the main research ecosystem
NetHack: a procedurally generated roguelike with combinatorial observation space; still largely unsolved; tests generalization

What RL Can’t Do (Yet)

RL requires either a simulator or a massive real-world interaction budget. Learning to drive with RL would take millions of real crashes — which is why every practical application uses simulation first. The sim-to-real gap is an unsolved engineering challenge.

RL also generalizes poorly. An agent trained on one Atari game is useless on another. An AlphaGo that crushed Lee Sedol cannot play chess. This is in sharp contrast to how human learning works, and it’s one of the driving motivations behind multi-task RL and research into foundation models for decision-making (like Gato, which plays multiple games with a single model).

One thing to remember: RL’s formal elegance — MDPs, Bellman equations, policy gradient theorem — is real and useful. But deploying RL in practice means fighting reward hacking, training instability, and the exploration problem. The algorithms that work at scale (PPO, SAC, DQN) are the ones that solved these engineering problems, not just the math.

reinforcement-learningmdpq-learningpolicy-gradientrlhfdeep-rl

Reinforcement Learning — Deep Dive

Reinforcement Learning — Deep Dive

The Formal Framework: Markov Decision Processes

Value Functions

Q-Learning and Deep Q-Networks

Policy Gradient Methods

The Exploration Problem in Depth

RLHF: How ChatGPT Gets Its Personality

Real Failure Modes Engineers Hit

Benchmarks Worth Knowing

What RL Can’t Do (Yet)

See Also

Related Topics