Q-Learning Implementation — Core Concepts
What Q-Learning does
Q-Learning finds the best action for every state without needing a model of how the environment works. It is model-free and off-policy, meaning the agent can learn the optimal strategy even while following a different (exploratory) strategy.
The Q-value
Q(s, a) represents the expected total future reward if the agent takes action a in state s and then follows the optimal policy forever after. The “Q” stands for quality — a higher Q means a better action.
The Bellman equation
The optimal Q-values satisfy:
Q*(s, a) = E[r + γ · max_a' Q*(s', a')]
Translation: the quality of taking action a in state s equals the immediate reward r plus the discounted best quality the agent can get from the next state s’. This recursive definition is the heart of Q-Learning.
The update rule
Since we do not know Q* in advance, we approximate it iteratively:
Q(s, a) ← Q(s, a) + α · [r + γ · max_a' Q(s', a') - Q(s, a)]
- α (learning rate) — how much each new experience overrides the old estimate. Typical range: 0.01 to 0.5.
- γ (discount factor) — how much the agent cares about future rewards. 0.99 means it values long-term gains; 0.5 means it is impatient.
- r — the reward received after taking action a in state s.
- The term in brackets is the TD error — the difference between what the agent expected and what it got.
Exploration: epsilon-greedy
If the agent always picks the best-known action, it never discovers better alternatives. Epsilon-greedy balances this:
- With probability ε, pick a random action (explore).
- With probability 1 - ε, pick the action with the highest Q-value (exploit).
A common pattern is to start ε high (e.g., 1.0) and decay it over time toward a small value (e.g., 0.01). This ensures broad exploration early and focused exploitation later.
Tabular Q-Learning in code
import numpy as np
import gymnasium as gym
env = gym.make("FrozenLake-v1", is_slippery=False)
q_table = np.zeros((env.observation_space.n, env.action_space.n))
alpha, gamma, epsilon = 0.1, 0.99, 1.0
epsilon_decay, epsilon_min = 0.995, 0.01
for episode in range(10_000):
state, _ = env.reset()
done = False
while not done:
if np.random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(q_table[state])
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
best_next = np.max(q_table[next_state])
td_error = reward + gamma * best_next * (1 - terminated) - q_table[state, action]
q_table[state, action] += alpha * td_error
state = next_state
epsilon = max(epsilon_min, epsilon * epsilon_decay)
After training, the learned policy is simply np.argmax(q_table[state]) for any state.
Convergence conditions
Q-Learning is guaranteed to converge to the optimal Q-values if:
- Every state-action pair is visited infinitely often.
- The learning rate decays appropriately (sum of α = ∞, sum of α² < ∞).
- The environment is a finite MDP.
In practice, “enough episodes” with epsilon-greedy exploration meets these conditions.
When tabular Q-Learning breaks down
If the state space is continuous (like joint angles) or very large (like pixel images), the Q-table becomes impractically huge. This is where Deep Q-Networks (DQN) take over — replacing the table with a neural network that approximates Q-values.
Common misconception
People often confuse Q-Learning with SARSA. Both use a similar update, but Q-Learning uses max Q(s', a') (the best possible next action), while SARSA uses the actual next action the agent took. This makes Q-Learning off-policy and SARSA on-policy. The practical difference: Q-Learning converges to the optimal policy regardless of the exploration strategy.
The one thing to remember: Q-Learning’s update rule — nudge the Q-value toward the reward plus the best future value — is simple, provably optimal, and the foundation of all deep RL value methods.
See Also
- Python Environment Wrappers How thin add-on layers let you change what a learning program sees and does without rewriting the game itself
- Python Monte Carlo Tree Search The clever trick behind AlphaGo — how a program explores millions of possible moves by playing quick random games against itself
- Python Multi Agent Reinforcement What happens when multiple programs learn together in the same world — cooperation, competition, and emergent teamwork
- Python Openai Gym Environments Why OpenAI Gym is the playground where robots and programs learn by trial and error — no prior coding knowledge needed
- Python Policy Gradient Methods Instead of scoring every move, what if the program just learned which moves feel right? That is policy gradients