Q-Learning Implementation — Core Concepts

The Bellman equation, epsilon-greedy exploration, and the update rule that powers tabular Q-Learning in Python

What Q-Learning does

Q-Learning finds the best action for every state without needing a model of how the environment works. It is model-free and off-policy, meaning the agent can learn the optimal strategy even while following a different (exploratory) strategy.

The Q-value

Q(s, a) represents the expected total future reward if the agent takes action a in state s and then follows the optimal policy forever after. The “Q” stands for quality — a higher Q means a better action.

The Bellman equation

The optimal Q-values satisfy:

Q*(s, a) = E[r + γ · max_a' Q*(s', a')]

Translation: the quality of taking action a in state s equals the immediate reward r plus the discounted best quality the agent can get from the next state s’. This recursive definition is the heart of Q-Learning.

The update rule

Since we do not know Q* in advance, we approximate it iteratively:

Q(s, a) ← Q(s, a) + α · [r + γ · max_a' Q(s', a') - Q(s, a)]

α (learning rate) — how much each new experience overrides the old estimate. Typical range: 0.01 to 0.5.
γ (discount factor) — how much the agent cares about future rewards. 0.99 means it values long-term gains; 0.5 means it is impatient.
r — the reward received after taking action a in state s.
The term in brackets is the TD error — the difference between what the agent expected and what it got.

Exploration: epsilon-greedy

If the agent always picks the best-known action, it never discovers better alternatives. Epsilon-greedy balances this:

With probability ε, pick a random action (explore).
With probability 1 - ε, pick the action with the highest Q-value (exploit).

A common pattern is to start ε high (e.g., 1.0) and decay it over time toward a small value (e.g., 0.01). This ensures broad exploration early and focused exploitation later.

Tabular Q-Learning in code

import numpy as np
import gymnasium as gym

env = gym.make("FrozenLake-v1", is_slippery=False)
q_table = np.zeros((env.observation_space.n, env.action_space.n))
alpha, gamma, epsilon = 0.1, 0.99, 1.0
epsilon_decay, epsilon_min = 0.995, 0.01

for episode in range(10_000):
    state, _ = env.reset()
    done = False
    while not done:
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[state])
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        best_next = np.max(q_table[next_state])
        td_error = reward + gamma * best_next * (1 - terminated) - q_table[state, action]
        q_table[state, action] += alpha * td_error
        state = next_state
    epsilon = max(epsilon_min, epsilon * epsilon_decay)

After training, the learned policy is simply np.argmax(q_table[state]) for any state.

Convergence conditions

Q-Learning is guaranteed to converge to the optimal Q-values if:

Every state-action pair is visited infinitely often.
The learning rate decays appropriately (sum of α = ∞, sum of α² < ∞).
The environment is a finite MDP.

In practice, “enough episodes” with epsilon-greedy exploration meets these conditions.

When tabular Q-Learning breaks down

If the state space is continuous (like joint angles) or very large (like pixel images), the Q-table becomes impractically huge. This is where Deep Q-Networks (DQN) take over — replacing the table with a neural network that approximates Q-values.

Common misconception

People often confuse Q-Learning with SARSA. Both use a similar update, but Q-Learning uses max Q(s', a') (the best possible next action), while SARSA uses the actual next action the agent took. This makes Q-Learning off-policy and SARSA on-policy. The practical difference: Q-Learning converges to the optimal policy regardless of the exploration strategy.

The one thing to remember: Q-Learning’s update rule — nudge the Q-value toward the reward plus the best future value — is simple, provably optimal, and the foundation of all deep RL value methods.

pythonreinforcement-learningaiq-learning