Q-Learning Implementation — ELI5

Imagine you are lost in a maze. You have no map, but you carry a notebook. Every time you turn a corner, you write down what happened. “Turned left at the red wall — hit a dead end. Score: bad.” “Turned right at the blue wall — found a cookie. Score: great!”

After wandering for a long time, your notebook becomes a cheat sheet. For every spot in the maze and every direction you could go, you have a score. Next time you visit a spot, you just check your notebook and pick the direction with the best score.

Q-Learning is that notebook. The program keeps a big table (the Q-table) with a row for every situation it might be in and a column for every action it could take. Each cell holds a score — how good that action is in that situation.

At first, the table is full of zeros because the program knows nothing. It picks random moves, sees what happens, and updates the scores. Over time the scores get more accurate and the program starts making smarter choices.

The magic ingredient is that Q-Learning does not just record what happened right now. It peeks at the best future score in the notebook and uses that to update the current score. So a move that leads to a great position later gets a high score even if the immediate reward is small.

After enough exploring, the program can play the game almost perfectly by always choosing the action with the highest Q-value. No teacher, no instructions — just a notebook and persistence.

The one thing to remember: Q-Learning is a notebook that scores every possible move in every possible situation, and the program just picks the highest score.

pythonreinforcement-learningaiq-learning

See Also