Monte Carlo Tree Search — Core Concepts

The four phases of MCTS — selection, expansion, simulation, backpropagation — and where UCB1 balances exploration with exploitation

What problem MCTS solves

In games like Go or chess, the number of possible move sequences grows exponentially. Evaluating every branch is impossible. Traditional minimax with alpha-beta pruning works for chess but breaks down when the branching factor is huge (Go has ~250 legal moves per turn). MCTS handles this by sampling — it only explores the most promising branches deeply.

The four phases

Every MCTS iteration repeats four steps:

1. Selection

Starting from the root (the current game state), walk down the tree by choosing the child with the highest UCB1 score at each level until you reach a node that has unexplored children.

2. Expansion

Pick one unexplored child and add it to the tree. This new node represents a game state one move beyond its parent.

3. Simulation (rollout)

From the new node, play the game to completion using a fast, usually random, policy. The result is a win, loss, or draw.

4. Backpropagation

Walk back up from the new node to the root, updating every node along the path with the simulation result (total score and visit count).

After many iterations, the root’s child with the most visits (not highest average score — visits are more robust) is chosen as the best move.

UCB1: the exploration-exploitation formula

At each node, children are scored with:

UCB1(child) = win_rate(child) + C × √(ln(parent_visits) / child_visits)

win_rate pulls the search toward moves that have worked well (exploitation).
The square root term pulls toward moves that have been tried less (exploration).
C is a tuning constant. Higher C means more exploration. √2 is the theoretical default; practical values range from 1.0 to 2.0.

This formula is what makes MCTS self-balancing. Good moves get explored more, but surprising moves always get a second chance.

When MCTS shines

Strength	Why
No evaluation function needed	Random rollouts provide signal without domain expertise
Anytime algorithm	More iterations = better results, but any amount gives a legal move
Handles huge branching factors	Only explores promising branches deeply
Works for imperfect info games	Variants like Information Set MCTS handle hidden cards

When MCTS struggles

Deep tactical games — if the critical move is 50 turns away, random rollouts rarely find it.
Real-time constraints — if you only get 10ms per move, there is no time for enough iterations.
Continuous action spaces — the tree structure assumes discrete choices.

MCTS + neural networks (AlphaZero style)

AlphaZero replaced random rollouts with a neural network that evaluates positions directly. The network provides:

A value estimate (how good is this position?) — replaces the simulation phase.
A policy prior (which moves look promising?) — biases the selection phase.

This combination dramatically reduces the number of iterations needed. Where vanilla MCTS might need millions of rollouts, AlphaZero-style MCTS needs only hundreds because each evaluation is much more informative.

Common misconception

Many people think MCTS “looks ahead” like a human chess player planning moves. It does not plan in the traditional sense — it samples outcomes statistically. The “tree” is not a plan but a statistical record of which moves led to which results. The quality emerges from volume, not foresight.

The one thing to remember: MCTS builds a lopsided tree — deep where it matters, shallow where it does not — by balancing exploitation of known-good moves with exploration of unknown ones via UCB1.

pythonreinforcement-learningaisearch-algorithms