Summary

Monte Carlo Tree Search (MCTS) is a tree-search algorithm that estimates the value of game states by running random (Monte Carlo) playouts and backpropagating the results. It gained widespread recognition after its success in Go (AlphaGo’s predecessors) and has been applied to a wide range of games from board games to RTS and turn-based strategy.

MCTS is not universally successful. In games with large action spaces or narrow winning paths, naïve MCTS performs poorly without domain-specific modifications. The key techniques described here — hierarchical expansion, action set ordering, abstract simulation, and evaluation function shaping — address these failure modes.

Material drawn from Roelofs (Xenonauts 2, Chapter 16) and Roelofs et al. (Prismata, Chapter 18), see source-game-ai-pro-360-tactics-strategy.

“MCTS has a poor track record for tactical decision-making. More accurately, it struggles with problems that have a very narrow path of victory or success.” — Roelofs, Chapter 16


MCTS fundamentals

MCTS builds a partial game tree incrementally. Each iteration performs four phases:

1. Selection

Starting from the root, traverse the tree by selecting the child with the highest UCB1 score until reaching a node that is not fully expanded:

UCB1(node) = Q(node)/N(node) + C × √(ln(N(parent)) / N(node))

Where:

  • Q(node) = total value accumulated through this node (sum of playout outcomes)
  • N(node) = visit count
  • N(parent) = parent visit count
  • C = exploration constant (typically √2)

The first term exploits high-value nodes; the second term explores less-visited nodes. UCB1 balances exploitation and exploration without hyperparameter tuning of the exploration-exploitation trade-off.

2. Expansion

Add one or more children to the selected node by applying legal moves from the current game state.

3. Simulation (playout / rollout)

From the newly expanded node, play the game to completion (or a fixed depth) using a playout policy — typically random, but optionally guided by a lightweight heuristic.

4. Backpropagation

Propagate the outcome (win/loss/score) back up the tree, updating Q and N for each ancestor node.

After a fixed number of iterations (or a time budget), return the action with the highest visit count from the root.

// Minimal MCTS in C# (turn-based, two-player)
public class MCTSNode
{
    public IGameState State;
    public MCTSNode Parent;
    public List<MCTSNode> Children = new();
    public float TotalValue;
    public int Visits;
    public IGameAction ActionTaken;
 
    public float UCB1(float C = 1.414f) =>
        Visits == 0
            ? float.MaxValue
            : TotalValue / Visits + C * Mathf.Sqrt(Mathf.Log(Parent.Visits) / Visits);
}
 
public class MCTS
{
    private int _iterations;
 
    public IGameAction Search(IGameState rootState)
    {
        var root = new MCTSNode { State = rootState };
 
        for (int i = 0; i < _iterations; i++)
        {
            var node = Select(root);
            node = Expand(node);
            float result = Simulate(node.State);
            Backpropagate(node, result);
        }
 
        return root.Children.OrderByDescending(c => c.Visits).First().ActionTaken;
    }
 
    MCTSNode Select(MCTSNode node)
    {
        while (node.Children.Count > 0)
            node = node.Children.OrderByDescending(c => c.UCB1()).First();
        return node;
    }
 
    MCTSNode Expand(MCTSNode node)
    {
        var actions = node.State.GetLegalActions();
        if (actions.Count == 0) return node;
        var action = actions[UnityEngine.Random.Range(0, actions.Count)];
        var child = new MCTSNode
        {
            State = node.State.ApplyAction(action),
            Parent = node,
            ActionTaken = action
        };
        node.Children.Add(child);
        return child;
    }
 
    float Simulate(IGameState state)
    {
        // Random playout to terminal state or depth limit
        int depth = 0;
        while (!state.IsTerminal() && depth++ < 50)
            state = state.ApplyAction(state.GetRandomAction());
        return state.GetScore();
    }
 
    void Backpropagate(MCTSNode node, float result)
    {
        while (node != null)
        {
            node.Visits++;
            node.TotalValue += result;
            node = node.Parent;
            result = 1f - result; // flip for opponent
        }
    }
}

Why MCTS struggles in strategy games

Large action spaces

A single turn in an RTS or complex strategy game can have thousands of legal actions (move every unit to every cell, attack every combination of targets). The branching factor makes the tree too wide to explore meaningfully within a time budget.

Narrow victory paths

In tactical scenarios with a “correct” sequence of actions (e.g. flanking a chokepoint, timing a combined arms push), the probability that a random playout hits that path is near zero. MCTS fails to discover these paths because its simulation policy cannot distinguish them from the noise of random play.

“MCTS struggles with problems that have a very narrow path of victory or success. It has trouble finding those rare paths because the simulations tend to be more random.” — Roelofs, Chapter 16


Hierarchical expansion

Core idea

Rather than treating each turn as a single atomic choice from thousands of options, decompose each turn into a sequence of atomic actions and build the tree at the action level rather than the turn level.

Each tree level corresponds to one atomic decision (e.g. “choose the next unit to act” then “choose where to move it” then “choose what to attack”). The tree depth increases but the branching factor at each level is far smaller.

Action ordering

Not all actions are equally worth exploring first. Two ordering strategies:

Heuristic ordering: Evaluate each candidate action with a fast domain-specific heuristic (e.g. “prefer actions that move units toward the objective”). Expand high-scoring actions first.

Entropy-based ordering: Sort the candidate action set by the uncertainty in their outcome. Actions whose results are highly variable (high entropy in playout outcomes) are expanded first — they carry the most information about the position.

Computing entropy online has overhead; Roelofs (Chapter 16) notes this is an open problem — whether the information gain justifies the cost depends on the game.

Average playout value completion: When a turn is only partially specified (some units have been assigned actions, others have not), complete the remainder by averaging over playout values rather than making random choices. This gives more stable intermediate state evaluations.


Abstract simulation models

Full game simulation during playouts is expensive — simulating hundreds of units at full resolution for 50 turns is prohibitive. Abstract simulation replaces the full simulator with a faster approximation:

  • Groups of units become single aggregate “stacks” with summed health and damage.
  • Movement is replaced by per-region presence/absence (using a region graph — see strategic-ai-rts).
  • Combat is resolved using Lanchester models rather than per-unit simulation.

Abstract simulation is less accurate but enables 10–100× more playout depth within the same budget. Accuracy is calibrated to match the full simulator at the scale of strategic decisions (which army wins) rather than tactical precision (which unit survives).


Evaluation function shaping

For games with a narrow victory path, terminal-outcome evaluation (win=1, loss=0) provides almost no gradient signal for the search. Most random playouts end in loss regardless of how well the AI played.

Intermediate evaluation functions reward steady progress:

  • Unit survival ratio.
  • Objective proximity.
  • Territory held.
  • Economy ratio.

The challenge is avoiding evaluation functions that guide the AI toward a locally good state but strategically inferior outcome (Goodhart’s law in miniature). Roelofs (Chapter 16) describes this as “reward hacking” for the evaluation function — the AI maximises the proxy rather than actual victory.

A practical approach: use intermediate evaluation during early search to provide gradient, but switch to terminal outcome for final move selection.


See Hierarchical Portfolio Search (HPS) for the Prismata implementation. HPS wraps MCTS with a portfolio of Partial Player functions, replacing the raw action enumeration at each search node with a small set of tactically meaningful move generators. This addresses the branching factor problem at the architecture level rather than by modifying MCTS itself.


Applications in games

Game / SystemMCTS useModification
AlphaGo / AlphaZero (Go)Core algorithmNeural network policy + value functions replace random rollouts
Xenonauts 2Tactical turn-based combatHierarchical expansion; entropy-based action ordering
PrismataCard/unit strategyHPS (portfolio of partial players)
General Game Playing (GGP)Universal game playerDomain-agnostic MCTS with minimal heuristics
StarCraft II (AlphaStar)Strategic macroCombines MCTS-inspired search with RL policy

Trade-offs

ApproachBest whenLimitations
Naïve MCTSBoard games with moderate branching factorFails with large action spaces or narrow victory paths
Hierarchical expansionAction-decomposable turnsRequires domain structure; tree depth can grow unwieldy
Abstract simulationLarge maps / long horizonsAccuracy loss; calibration required
HPSComplex strategy gamesRequires handcrafted PP functions per phase
Alpha-beta (Minimax)Perfect-information two-playerExponential in depth; doesn’t scale to multiplayer or stochastic games

Evidence

  • Roelofs (Chapter 16, see source-game-ai-pro-360-tactics-strategy) describes the failure modes of naïve MCTS in Xenonauts 2 and the hierarchical expansion solution in detail.
  • Roelofs et al. (Chapter 18) describe Hierarchical Portfolio Search and the Prismata Master Bot, including the symmetric-play leaf evaluation and its robustness to balance patches.
  • The UCB1/UCT formula is due to Kocsis & Szepesvári (2006) and is the standard exploration–exploitation balance used in game MCTS.

Open questions

  • What is the overhead of computing entropy-based action ordering online vs. offline, and does the information gain justify it for tactical games?
  • Can abstract simulation accuracy be validated automatically (by comparing abstract vs. full-simulation outcomes on a held-out replay set) as part of game development?
  • How does MCTS interact with imperfect information (fog of war)? Determinisation (sample one possible state, run MCTS, repeat) is the standard approach — but determinisation is known to produce poor decisions in bluffing-sensitive games.

strategic-ai-rts · game-ai-agent-design · squad-ai-patterns · influence-maps · source-game-ai-pro-360-tactics-strategy