Summary

Machine learning (ML) gives game agents the ability to learn from experience rather than executing hand-authored rules. In games, ML is used for adaptive difficulty, procedural content generation, playtesting automation, and increasingly for NPC behaviour. The most relevant approaches for game AI are reinforcement learning (agents learn from reward signals) and genetic algorithms (populations evolve toward better solutions — see genetic-algorithms for detail). (Taulli, Artificial Intelligence Basics, see source-ai-basics; Yannakakis and Togelius, Artificial Intelligence and Games, see source-ai-and-games)

(Prof Charles, CRE341 Wks 7–9, see source-cre341-lectures)


AI, ML, and deep learning

These three terms are often conflated. They are nested concepts:

Artificial Intelligence (broadest)
  └── Machine Learning
        └── Deep Learning (narrowest)
  • AI — concerned with intelligent behaviour in artefacts: perception, reasoning, learning, communicating, and acting in complex environments. (Nils Nilsson, Artificial Intelligence: A New Synthesis)
  • Machine learning — AI systems that learn from data rather than executing hand-authored rules
  • Deep learning — ML using multi-layer neural networks; particularly effective for pattern recognition, image classification, and sequential prediction

Types of AI

TypeDescriptionExample
Weak AIFocused on a single narrow taskControlling flow in a pipe; a chess engine
Strong AICan adapt, problem-solve, or learn across domainsNot yet achieved reliably
AGI (Artificial General Intelligence)More human-like; can generalise and perform a range of different tasksResearch goal; not yet achieved
SuperintelligenceFaster, more accurate, and beyond human understandingSpeculative

Most AI in games today is Weak AI — purpose-built for a specific task (pathfinding, difficulty adjustment, procedural generation). Strong AI and AGI are research horizons, not production tools.

Brief history

PeriodMilestone
1950Turing Test proposed; Dartmouth Conference (John McCarthy coins artificial intelligence); information theory and symbolic reasoning
1964ELIZA — first chatbot (Joseph Weizenbaum, MIT)
1975–1982Expert systems (Edward Feigenbaum); neural networks conceptualised; optical character recognition; speech recognition
~1990AI Winter I — limited compute, storage, and network capacity; combinatorial explosion; disappointing results
1990s–2000sMachine learning resurgence; deep learning (pattern analysis, classification); big data; fast processors
1997IBM Deep Blue defeats Garry Kasparov (chess)
2011IBM Watson wins Jeopardy!; Apple integrates Siri into iPhone
2014AI learns to recognise cats from YouTube videos (Google); DeepMind defeats Lee Sedol at Go
~2000sAI Winter II — collapse of dedicated hardware vendors; disappointing scaling results
2015+Transformers, large language models, Stable Diffusion; modern deep learning era

(Prof Charles, CRE341 Wk 7, citing Lavenda/Marsden AI timeline)


Uncertainty and probability

Games and AI share a foundational relationship with uncertainty. ML uses probability to model likelihood and predict future events.

Basic probability rules

P(A) = outcomes where A happens / total outcomes
P(B | A) = P(A and B) / P(A)          // conditional probability
P(A or B) = P(A) + P(B) - P(A and B)  // addition rule
P(A and B) = P(A) * P(B|A)            // multiplication rule
P(not A) = 1 - P(A)

Bayes’ Theorem

Updates a belief (prior probability) in light of new evidence:

P(H | E) = ( P(E | H) * P(H) ) / P(E)
TermMeaning
P(H)Prior — initial belief before seeing evidence
`P(EH)`
P(E)Marginal — total probability of the evidence across all hypotheses
`P(HE)`

Game application: Estimating a player’s current ability level from observed performance, then updating the estimate as more data arrives (dynamic difficulty adjustment).

Shannon entropy

Entropy measures the average information content of a probability distribution:

H(X) = -Σ p(x) * log₂(p(x))

Individual information content of event x:

I(x) = -log₂(p(x))

Example: A fair coin flip. P(heads) = 0.5, so I(heads) = -log₂(0.5) = 1 bit.

“Learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.” — Goodfellow et al., Deep Learning

  • Low probability event → high information (surprising)
  • High probability event → low information (unsurprising)

Entropy underlies decision tree construction (information gain) and many ML scoring metrics.


ML algorithm taxonomy

TypeDescriptionExample algorithmsGame use
Supervised learningLearn from labelled (input, output) pairs; tackles regression and classificationDecision Tree, Linear/Logistic Regression, Neural NetworksPlayer behaviour classification, cheat detection
Unsupervised learningDiscover structure in unlabelled data; groups similar examplesClustering (k-means), PCAPlayer segmentation, behaviour clustering
Reinforcement learningLearn through action → reward signals; no labels neededQ-Learning, SARSA, PPOAdaptive NPCs, game-playing agents (AlphaGo, OpenAI Five)
Recommender systemsLearn to suggest items matching user preferencesCollaborative filteringIn-game recommendations, matchmaking

Search-based ML

Hill climbing: Iteratively moves toward higher-valued solutions in the search space. Fast but gets stuck in local optima.

Simulated annealing: Like hill climbing, but with a probability of accepting a worse solution (decreasing over time). Allows escape from local optima at the cost of slower convergence.


Reinforcement learning

RL is the most directly applicable ML paradigm for game AI. An agent learns which actions produce rewards by interacting with an environment.

Formal model

  • States S — all possible configurations the environment can be in
  • Actions A — all possible actions the agent can take
  • Rewards R — scalar feedback signal received after taking an action
  • Policy π — the agent’s mapping from states to actions (what we want to learn)

Temporal difference (TD) learning

The agent learns a Q-table: a mapping from (state, action) pairs to expected cumulative reward. Q-values are updated iteratively using the Bellman equation:

Q(s, a) ← Q(s, a) + α * (r + γ * max_a' Q(s', a') - Q(s, a))
SymbolMeaning
αLearning rate (0–1): how fast new information overrides old
γ (gamma)Discount factor (0–1): how much future rewards matter vs. immediate rewards
rReward received on transitioning to new state s'
max Q(s', a')Best expected future reward from the next state

Q-Learning is off-policy: the update always uses the best possible next action, regardless of what the agent actually does.

SARSA is on-policy: the update uses the action the agent actually took next (making it more conservative in risky situations).

Action selection

During learning, the agent must balance:

  • Exploitation (greedy): always take the action with the highest current Q-value
  • Exploration (probabilistic): sometimes choose randomly to discover better options

ε-greedy: Choose the best action with probability (1-ε); choose randomly with probability ε.

Probabilistic greedy (proportional): The probability of choosing action a equals Q(s,a) / Σ Q(s, a') — every action gets some chance, proportional to its Q-value.

Worked example (three-state world)

States S0, S1, S2. S1 has reward 3, Qmax 5. S2 has reward 7, Qmax 6. α=0.6, γ=0.5.

Time 0:  Q(S0→S1)=0,   Q(S0→S2)=0
Time 1:  Q(S0→S1) = 0 + 0.6*(3 + 0.5*5 - 0) = 3.3
Time 2:  Q(S0→S2) = 0 + 0.6*(7 + 0.5*6 - 0) = 6.0
Time 3:  Q(S0→S2) = 6.0 + 0.6*(7 + 0.5*6 - 6.0) = 8.4
Time 4:  Q(S0→S2) = 8.4 + 0.6*(7 + 0.5*6 - 8.4) = 9.12
Final:   Q(S0→S1) = 3.3 + 0.6*(3 + 0.5*5 - 3.3) = 4.62

The agent converges on preferring S2 (higher reward + higher future value). This generalises to any Markov decision process.

Game applications of RL

Use caseState spaceReward signal
Enemy combat AIPlayer position, health, distanceDamage dealt - damage received
Game-playing agentBoard/game stateWin/loss/score delta
Difficulty adjustmentPlayer performance historyError from target win rate
Resource management AIResource counts, time, territoryScore/survival time

Neural networks

Artificial Neural Networks (ANNs) model computation loosely after biological neurons. Each neuron sums weighted inputs and applies an activation function; layers of neurons compose complex non-linear functions.

Architecture

Input layer → Hidden layer(s) → Output layer
  • Input: game state features (positions, health, distances, etc.)
  • Hidden: learned feature representations
  • Output: action probabilities or Q-values

Backpropagation algorithm

Training a network with labelled data:

  1. Feed input forward through all weights to compute output
  2. Compute loss (error between output and target)
  3. Back-propagate error through the network (chain rule)
  4. Update weights to reduce loss (gradient descent)
  5. Repeat until training error is acceptable

Neural networks + RL = Deep RL

When the state space is too large for a Q-table (e.g. pixels), a neural network approximates the Q-function. The input is the game state; the output is Q-values for each action. This is the approach used in DeepMind’s Atari-playing agents and Unity ML-Agents.

Unity ML-Agents provides a ready-made RL training pipeline that connects to a Python backend (PyTorch), enabling training of neural network policies directly inside Unity. See Unity ML-Agents documentation for setup.

Neuroevolution

Instead of backpropagation, evolve network weights using a genetic algorithm (see genetic-algorithms). Suited to environments where:

  • No labelled training data exists
  • Performance can only be measured through interaction
  • The fitness landscape has many local minima that gradient descent would get stuck in

Convolutional neural networks (CNNs)

When the input is pixel data (an image or game screen), a standard fully-connected network is impractical — the parameter count explodes. CNNs apply learned filters spatially, reducing dimensionality before passing to fully-connected layers.

Canonical game AI architecture (DeepMind Atari agent):

Input:      84×84×4 frames (grayscale; 4 stacked for velocity)
Conv 1:     8×8 filters, 16 feature maps → 20×20×16
Conv 2:     4×4 filters, 32 feature maps → 9×9×32
FC layer:   256 units
Output:     Q-values for each action (4 outputs for Atari)

The stack of frames encodes motion implicitly — two identical frames would look static; four consecutive frames allow the network to detect velocity.

Generative Adversarial Networks (GANs)

A GAN trains two networks adversarially:

  • Generator (G) — produces fake samples from random noise
  • Discriminator (D) — tries to distinguish real samples from generated ones

Training loop:

  1. G produces a fake image
  2. D scores it real or fake; backpropagation updates D to be a better detector
  3. Backpropagation also updates G to fool D more effectively
  4. At equilibrium, G produces samples indistinguishable from real data

Game applications: texture generation, concept art, character sprite synthesis, upscaling low-resolution assets. GANs underlie many commercial generative art tools (see generative-ai-game-dev).