Summary
Machine learning (ML) gives game agents the ability to learn from experience rather than executing hand-authored rules. In games, ML is used for adaptive difficulty, procedural content generation, playtesting automation, and increasingly for NPC behaviour. The most relevant approaches for game AI are reinforcement learning (agents learn from reward signals) and genetic algorithms (populations evolve toward better solutions — see genetic-algorithms for detail). (Taulli, Artificial Intelligence Basics, see source-ai-basics; Yannakakis and Togelius, Artificial Intelligence and Games, see source-ai-and-games)
(Prof Charles, CRE341 Wks 7–9, see source-cre341-lectures)
AI, ML, and deep learning
These three terms are often conflated. They are nested concepts:
Artificial Intelligence (broadest)
└── Machine Learning
└── Deep Learning (narrowest)
- AI — concerned with intelligent behaviour in artefacts: perception, reasoning, learning, communicating, and acting in complex environments. (Nils Nilsson, Artificial Intelligence: A New Synthesis)
- Machine learning — AI systems that learn from data rather than executing hand-authored rules
- Deep learning — ML using multi-layer neural networks; particularly effective for pattern recognition, image classification, and sequential prediction
Types of AI
| Type | Description | Example |
|---|---|---|
| Weak AI | Focused on a single narrow task | Controlling flow in a pipe; a chess engine |
| Strong AI | Can adapt, problem-solve, or learn across domains | Not yet achieved reliably |
| AGI (Artificial General Intelligence) | More human-like; can generalise and perform a range of different tasks | Research goal; not yet achieved |
| Superintelligence | Faster, more accurate, and beyond human understanding | Speculative |
Most AI in games today is Weak AI — purpose-built for a specific task (pathfinding, difficulty adjustment, procedural generation). Strong AI and AGI are research horizons, not production tools.
Brief history
| Period | Milestone |
|---|---|
| 1950 | Turing Test proposed; Dartmouth Conference (John McCarthy coins artificial intelligence); information theory and symbolic reasoning |
| 1964 | ELIZA — first chatbot (Joseph Weizenbaum, MIT) |
| 1975–1982 | Expert systems (Edward Feigenbaum); neural networks conceptualised; optical character recognition; speech recognition |
| ~1990 | AI Winter I — limited compute, storage, and network capacity; combinatorial explosion; disappointing results |
| 1990s–2000s | Machine learning resurgence; deep learning (pattern analysis, classification); big data; fast processors |
| 1997 | IBM Deep Blue defeats Garry Kasparov (chess) |
| 2011 | IBM Watson wins Jeopardy!; Apple integrates Siri into iPhone |
| 2014 | AI learns to recognise cats from YouTube videos (Google); DeepMind defeats Lee Sedol at Go |
| ~2000s | AI Winter II — collapse of dedicated hardware vendors; disappointing scaling results |
| 2015+ | Transformers, large language models, Stable Diffusion; modern deep learning era |
(Prof Charles, CRE341 Wk 7, citing Lavenda/Marsden AI timeline)
Uncertainty and probability
Games and AI share a foundational relationship with uncertainty. ML uses probability to model likelihood and predict future events.
Basic probability rules
P(A) = outcomes where A happens / total outcomes
P(B | A) = P(A and B) / P(A) // conditional probability
P(A or B) = P(A) + P(B) - P(A and B) // addition rule
P(A and B) = P(A) * P(B|A) // multiplication rule
P(not A) = 1 - P(A)
Bayes’ Theorem
Updates a belief (prior probability) in light of new evidence:
P(H | E) = ( P(E | H) * P(H) ) / P(E)
| Term | Meaning |
|---|---|
P(H) | Prior — initial belief before seeing evidence |
| `P(E | H)` |
P(E) | Marginal — total probability of the evidence across all hypotheses |
| `P(H | E)` |
Game application: Estimating a player’s current ability level from observed performance, then updating the estimate as more data arrives (dynamic difficulty adjustment).
Shannon entropy
Entropy measures the average information content of a probability distribution:
H(X) = -Σ p(x) * log₂(p(x))
Individual information content of event x:
I(x) = -log₂(p(x))
Example: A fair coin flip. P(heads) = 0.5, so I(heads) = -log₂(0.5) = 1 bit.
“Learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.” — Goodfellow et al., Deep Learning
- Low probability event → high information (surprising)
- High probability event → low information (unsurprising)
Entropy underlies decision tree construction (information gain) and many ML scoring metrics.
ML algorithm taxonomy
| Type | Description | Example algorithms | Game use |
|---|---|---|---|
| Supervised learning | Learn from labelled (input, output) pairs; tackles regression and classification | Decision Tree, Linear/Logistic Regression, Neural Networks | Player behaviour classification, cheat detection |
| Unsupervised learning | Discover structure in unlabelled data; groups similar examples | Clustering (k-means), PCA | Player segmentation, behaviour clustering |
| Reinforcement learning | Learn through action → reward signals; no labels needed | Q-Learning, SARSA, PPO | Adaptive NPCs, game-playing agents (AlphaGo, OpenAI Five) |
| Recommender systems | Learn to suggest items matching user preferences | Collaborative filtering | In-game recommendations, matchmaking |
Search-based ML
Hill climbing: Iteratively moves toward higher-valued solutions in the search space. Fast but gets stuck in local optima.
Simulated annealing: Like hill climbing, but with a probability of accepting a worse solution (decreasing over time). Allows escape from local optima at the cost of slower convergence.
Reinforcement learning
RL is the most directly applicable ML paradigm for game AI. An agent learns which actions produce rewards by interacting with an environment.
Formal model
- States S — all possible configurations the environment can be in
- Actions A — all possible actions the agent can take
- Rewards R — scalar feedback signal received after taking an action
- Policy π — the agent’s mapping from states to actions (what we want to learn)
Temporal difference (TD) learning
The agent learns a Q-table: a mapping from (state, action) pairs to expected cumulative reward. Q-values are updated iteratively using the Bellman equation:
Q(s, a) ← Q(s, a) + α * (r + γ * max_a' Q(s', a') - Q(s, a))
| Symbol | Meaning |
|---|---|
α | Learning rate (0–1): how fast new information overrides old |
γ (gamma) | Discount factor (0–1): how much future rewards matter vs. immediate rewards |
r | Reward received on transitioning to new state s' |
max Q(s', a') | Best expected future reward from the next state |
Q-Learning is off-policy: the update always uses the best possible next action, regardless of what the agent actually does.
SARSA is on-policy: the update uses the action the agent actually took next (making it more conservative in risky situations).
Action selection
During learning, the agent must balance:
- Exploitation (greedy): always take the action with the highest current Q-value
- Exploration (probabilistic): sometimes choose randomly to discover better options
ε-greedy: Choose the best action with probability (1-ε); choose randomly with probability ε.
Probabilistic greedy (proportional): The probability of choosing action a equals Q(s,a) / Σ Q(s, a') — every action gets some chance, proportional to its Q-value.
Worked example (three-state world)
States S0, S1, S2. S1 has reward 3, Qmax 5. S2 has reward 7, Qmax 6. α=0.6, γ=0.5.
Time 0: Q(S0→S1)=0, Q(S0→S2)=0
Time 1: Q(S0→S1) = 0 + 0.6*(3 + 0.5*5 - 0) = 3.3
Time 2: Q(S0→S2) = 0 + 0.6*(7 + 0.5*6 - 0) = 6.0
Time 3: Q(S0→S2) = 6.0 + 0.6*(7 + 0.5*6 - 6.0) = 8.4
Time 4: Q(S0→S2) = 8.4 + 0.6*(7 + 0.5*6 - 8.4) = 9.12
Final: Q(S0→S1) = 3.3 + 0.6*(3 + 0.5*5 - 3.3) = 4.62
The agent converges on preferring S2 (higher reward + higher future value). This generalises to any Markov decision process.
Game applications of RL
| Use case | State space | Reward signal |
|---|---|---|
| Enemy combat AI | Player position, health, distance | Damage dealt - damage received |
| Game-playing agent | Board/game state | Win/loss/score delta |
| Difficulty adjustment | Player performance history | Error from target win rate |
| Resource management AI | Resource counts, time, territory | Score/survival time |
Neural networks
Artificial Neural Networks (ANNs) model computation loosely after biological neurons. Each neuron sums weighted inputs and applies an activation function; layers of neurons compose complex non-linear functions.
Architecture
Input layer → Hidden layer(s) → Output layer
- Input: game state features (positions, health, distances, etc.)
- Hidden: learned feature representations
- Output: action probabilities or Q-values
Backpropagation algorithm
Training a network with labelled data:
- Feed input forward through all weights to compute output
- Compute loss (error between output and target)
- Back-propagate error through the network (chain rule)
- Update weights to reduce loss (gradient descent)
- Repeat until training error is acceptable
Neural networks + RL = Deep RL
When the state space is too large for a Q-table (e.g. pixels), a neural network approximates the Q-function. The input is the game state; the output is Q-values for each action. This is the approach used in DeepMind’s Atari-playing agents and Unity ML-Agents.
Unity ML-Agents provides a ready-made RL training pipeline that connects to a Python backend (PyTorch), enabling training of neural network policies directly inside Unity. See Unity ML-Agents documentation for setup.
Neuroevolution
Instead of backpropagation, evolve network weights using a genetic algorithm (see genetic-algorithms). Suited to environments where:
- No labelled training data exists
- Performance can only be measured through interaction
- The fitness landscape has many local minima that gradient descent would get stuck in
Convolutional neural networks (CNNs)
When the input is pixel data (an image or game screen), a standard fully-connected network is impractical — the parameter count explodes. CNNs apply learned filters spatially, reducing dimensionality before passing to fully-connected layers.
Canonical game AI architecture (DeepMind Atari agent):
Input: 84×84×4 frames (grayscale; 4 stacked for velocity)
Conv 1: 8×8 filters, 16 feature maps → 20×20×16
Conv 2: 4×4 filters, 32 feature maps → 9×9×32
FC layer: 256 units
Output: Q-values for each action (4 outputs for Atari)
The stack of frames encodes motion implicitly — two identical frames would look static; four consecutive frames allow the network to detect velocity.
Generative Adversarial Networks (GANs)
A GAN trains two networks adversarially:
- Generator (G) — produces fake samples from random noise
- Discriminator (D) — tries to distinguish real samples from generated ones
Training loop:
- G produces a fake image
- D scores it real or fake; backpropagation updates D to be a better detector
- Backpropagation also updates G to fool D more effectively
- At equilibrium, G produces samples indistinguishable from real data
Game applications: texture generation, concept art, character sprite synthesis, upscaling low-resolution assets. GANs underlie many commercial generative art tools (see generative-ai-game-dev).
Related
- genetic-algorithms — evolutionary optimisation; detailed GA mechanics; Knapsack, TSP, Lunar Lander examples
- neuroevolution — evolving neural networks rather than training them only with gradients
- game-ai-agent-design — applying ML to autonomous agents; Behaviour Trees; GOAP
- ai-state-machine-pattern — simpler rule-based alternative to ML for NPC behaviour
- player-modelling — one of the major applied uses of ML in games
- procedural-generation — PCGML: applying ML to content generation
- generative-ai-game-dev — practical generative AI tools (Muse, Firefly, Scenario AI, SoundRaw, Copilot); AI in the development pipeline
- overview-artificial-intelligence-in-games — synthesis view across classical AI, learning, PCG, and player models
- reward-systems — game design reward schedules echo RL reward signal design
- source-cre341-lectures