Monte Carlo Tree Search — Core Concepts
What problem MCTS solves
In games like Go or chess, the number of possible move sequences grows exponentially. Evaluating every branch is impossible. Traditional minimax with alpha-beta pruning works for chess but breaks down when the branching factor is huge (Go has ~250 legal moves per turn). MCTS handles this by sampling — it only explores the most promising branches deeply.
The four phases
Every MCTS iteration repeats four steps:
1. Selection
Starting from the root (the current game state), walk down the tree by choosing the child with the highest UCB1 score at each level until you reach a node that has unexplored children.
2. Expansion
Pick one unexplored child and add it to the tree. This new node represents a game state one move beyond its parent.
3. Simulation (rollout)
From the new node, play the game to completion using a fast, usually random, policy. The result is a win, loss, or draw.
4. Backpropagation
Walk back up from the new node to the root, updating every node along the path with the simulation result (total score and visit count).
After many iterations, the root’s child with the most visits (not highest average score — visits are more robust) is chosen as the best move.
UCB1: the exploration-exploitation formula
At each node, children are scored with:
UCB1(child) = win_rate(child) + C × √(ln(parent_visits) / child_visits)
- win_rate pulls the search toward moves that have worked well (exploitation).
- The square root term pulls toward moves that have been tried less (exploration).
- C is a tuning constant. Higher C means more exploration. √2 is the theoretical default; practical values range from 1.0 to 2.0.
This formula is what makes MCTS self-balancing. Good moves get explored more, but surprising moves always get a second chance.
When MCTS shines
| Strength | Why |
|---|---|
| No evaluation function needed | Random rollouts provide signal without domain expertise |
| Anytime algorithm | More iterations = better results, but any amount gives a legal move |
| Handles huge branching factors | Only explores promising branches deeply |
| Works for imperfect info games | Variants like Information Set MCTS handle hidden cards |
When MCTS struggles
- Deep tactical games — if the critical move is 50 turns away, random rollouts rarely find it.
- Real-time constraints — if you only get 10ms per move, there is no time for enough iterations.
- Continuous action spaces — the tree structure assumes discrete choices.
MCTS + neural networks (AlphaZero style)
AlphaZero replaced random rollouts with a neural network that evaluates positions directly. The network provides:
- A value estimate (how good is this position?) — replaces the simulation phase.
- A policy prior (which moves look promising?) — biases the selection phase.
This combination dramatically reduces the number of iterations needed. Where vanilla MCTS might need millions of rollouts, AlphaZero-style MCTS needs only hundreds because each evaluation is much more informative.
Common misconception
Many people think MCTS “looks ahead” like a human chess player planning moves. It does not plan in the traditional sense — it samples outcomes statistically. The “tree” is not a plan but a statistical record of which moves led to which results. The quality emerges from volume, not foresight.
The one thing to remember: MCTS builds a lopsided tree — deep where it matters, shallow where it does not — by balancing exploitation of known-good moves with exploration of unknown ones via UCB1.
See Also
- Python Environment Wrappers How thin add-on layers let you change what a learning program sees and does without rewriting the game itself
- Python Multi Agent Reinforcement What happens when multiple programs learn together in the same world — cooperation, competition, and emergent teamwork
- Python Openai Gym Environments Why OpenAI Gym is the playground where robots and programs learn by trial and error — no prior coding knowledge needed
- Python Policy Gradient Methods Instead of scoring every move, what if the program just learned which moves feel right? That is policy gradients
- Python Q Learning Implementation How a program builds a cheat sheet of every situation and every action to figure out the best move — no teacher required