PAPER-DIGEST · 2026-06-19

Nasir et al.: Evolving the Rules of Play Themselves — Fukai Reads MORTAR

Automatic game design and mechanic generation

Reviewed by Fukai · #paper-digest #research #game-design #procedural-generation #game-mechanics #llm #quality-diversity #automatic-game-design

日本語版を読む →

TL;DR

Can a system design game mechanics on its own — the underlying rules that decide what a player can do and what happens as a result? This paper takes that question head on. Most prior work on automatic content generation has focused on levels and layouts; far less has tried to generate and evaluate mechanics themselves. The authors' system, MORTAR, treats each mechanic as a snippet of Python code and uses a large language model (LLM, an AI trained on huge amounts of text to generate continuations and replies) to rewrite it bit by bit, keeping and spreading the good ones.

The key idea is that a mechanic cannot be judged in isolation — it only gains meaning once you actually play with it. So MORTAR embeds each new mechanic into a complete game, lets five AI players of differing strength play it, and uses whether stronger players reliably win as its yardstick for quality. It runs on GPT-4o-mini at a reported cost of roughly $30–50 per run, and is explicitly a research prototype. I will walk through what problem it solves, how, and what it found, so you can grasp the gist without opening the paper.

Introduction

Today's paper is "MORTAR: Evolving Mechanics for Automatic Game Design," by Muhammad U. Nasir, Yuchen Li, Steven James and Julian Togelius, affiliated with the University of the Witwatersrand (South Africa) and New York University. Julian Togelius has long been a leading figure in procedural content generation (PCG, the automatic creation of content) and game AI; his name recurs whenever you try to map this field. The paper is a preprint posted to arXiv in January 2026 (arXiv:2601.00105); with no confirmable peer-review status, I treat it here as a preprint.

I chose it today because it is a concrete example of the generation conversation finally moving from 'levels' to 'the rules themselves,' which I think is rich with implications for people who make games. Level generation has plenty of prior work, but having a machine invent mechanics from scratch — say, 'pick up a key to open a door' or 'touching an enemy pushes you back' — makes evaluation dramatically harder. The authors confront exactly that. The paper also notes that the generated games can be played on a web page the authors released.

Background

PCG (Procedural Content Generation, the automatic creation of content) is a long-studied area with broad uses: runtime generation in roguelikes, ideation tools for designers, automating repetitive production. But most of that attention has gone to the 'structure' of terrain and levels, because a level is relatively easy to score — is it solvable, is it novel?

Mechanics — the rules of interaction in a game — have by contrast received comparatively little light. The authors stress that a mechanic's value only emerges within the play dynamics it induces. A mechanic that looks novel and complex can still be boring if it does not produce play where skill differences show up in the outcome. So the hard, essential problem is not just generation but the design of evaluation.

This matters because mechanics shape the skeleton of player experience: not only what players can do, but what strategies and emergent behaviours become possible. Automating this could become a tool that widens a designer's imagination — though the authors are explicit that the system is not meant to produce whole games and aims to empower rather than replace designers.

Approach

MORTAR is built on a quality-diversity algorithm (Quality-Diversity: a search framework that, rather than finding a single best solution, collects good solutions spread across different characteristics), specifically the MAP-Elites variant. Each mechanic is represented as a Python function and filed along two 'shelf axes': mechanic type (nine kinds — movement, interaction, combat, progression, environment, puzzle, resource management, exploration, time manipulation) and code complexity (estimated by parsing the program's structure and counting things like function calls and assignments). Each cell of this grid holds one good mechanic.

New mechanics are produced by having the LLM rewrite existing ones. Adding functionality to one ('mutation'), showing three and asking for a distinctly different variant ('diversity mutation'), merging two ('crossover'), and generating one that fits an existing game ('compatibility mutation') — these evolutionary-computation operations (improving solutions by repeated variation and selection, like biological evolution) are all carried out by the LLM as actual code edits. Generated code is first checked for syntax and runtime errors, then poked in a simple test environment by an MCTS agent (Monte Carlo Tree Search, a method that tries out many possible future moves to pick a good one) to confirm it works and is non-trivial.

The crucial evaluation works as follows. Starting from one mechanic as the root, a tree search assembles a complete game by grafting on further mechanics. That game is played by five AI agents of differing strength — three MCTS agents with 100,000, 10,000 and 1,000 trial rollouts, a random agent, and a do-nothing agent — and the agreement between the 'expected strength order' and the 'observed win-rate order' is scored with a rank correlation (a measure of how well two orderings agree). The better that 'the strong are indeed strong' order holds, the more the game is deemed to reward skill and have depth. I omit the formula itself; in essence, 'do results match true ability?' becomes a proxy for quality.

The authors also introduce a measure called CITS (Constrained Importance Through Search). It estimates how much each mechanic contributed to the finished game's quality, inspired by the 'Shapley value' from cooperative game theory (a way to fairly distribute credit among contributors). That computation would normally explode combinatorially, but MORTAR approximates it only within the search tree it built during generation, keeping it tractable. It is a device for pointing, after the fact, at which rule is generating the fun.

Findings

The authors compared their core contribution — tree-search-based composition for evaluation — against other selection strategies (random, LLM-prompted, and greedy by best fitness). Per Table 1, MORTAR was best on the diversity measure (QD score 31.18), on the maximum and mean mechanic-contribution scores (Max CITS 0.59, Mean CITS 0.20), and on the number of filled archive cells (155). Only on the 'share that became a playable game' did greedy selection edge ahead (18.24 vs MORTAR's 16.97), which the authors read as a hint that higher-fitness mechanics tend to yield playable games.

Individual games are telling. AllyCraft scored a rank correlation of 0.8: with summonable allies and multi-unit control it offered branching strategy and retained depth. TreasureHunt, by contrast, sat at 0.4, which the authors attribute to the game losing value once the optimal path is found. The contrast is that higher-correlation games sustain several viable strategies and resist getting stale.

A human evaluation was also run. Ten participants played six games in three pairs and compared them on interestingness, novelty, fun, ease of understanding, and frustration. Total scores and rank correlation pointed largely the same way, but the third pair went the opposite direction, and the authors themselves acknowledge the difficulty of aligning automated metrics with human taste. The second pair drew many 'Neither' votes (7), which they read as a sign of 'too complex to feel meaningful,' concluding that mini-games benefit from appropriate rather than maximal complexity.

Use cases

So how can a game or puzzle maker use this? Some concrete examples. First, ideation for mechanics. If I were stuck on a small puzzle title, I could hand a system like MORTAR my existing rules as seeds and use compatibility mutation to spit out many 'candidate rules that fit what I already have' — not as finished products, but as rough drafts for a human to select from.

Second, an automated check on whether skill is rewarded. The paper's evaluation method — let agents of differing strength play and see whether the ranking holds — can be reused independently of mechanic generation. If I were building a Sokoban-like (a box-pushing puzzle), I could prepare several solvers of varying strength and run a lightweight 'depth measurement': do weaker solvers drop out on harder levels? A level where the ranking collapses may be one solvable by luck or brute force.

Third, isolating which rule is doing the work. The CITS idea — decomposing a finished game's fun into the contributions of individual rules — helps when tuning games where several systems intertwine. If I were mixing multiple gimmicks in a hyper-casual PCG pipeline, I could remove gimmicks one at a time and watch the metric, an approach close to an ablation study (testing which part of a design matters by removing elements one by one), to decide which low-contribution gimmicks to cut.

Fourth, for reinforcement-learning researchers (reinforcement learning: a framework for learning behaviours that earn higher reward through trial and error), the authors envision the diverse set of skill-discriminating games as a testbed for an agent's generalisation (its ability to cope with unseen situations). The intriguing duality here is that the work serves not only those who make games but also those who train AI.

Limitations

The limitations are stated candidly. The authors first admit visual poverty: rendering is minimal, there are no animations, and sprites (the character art) are limited; the paper notes that user-study participants repeatedly flagged the visual shortcomings. Second, the LLM used was the relatively small GPT-4o-mini; they say a stronger model could yield more sophisticated mechanics and code. They further note that the 2D top-down setting constrains the search space, that archive initialisation and the number of tree-search iterations trade quality against compute, and — most importantly — that there is currently no mechanism for a designer to steer the search.

What I would point out here is a bias in the metric itself. Using 'do strong AIs reliably win?' as a proxy for quality is clear, but it is also a yardstick that favours competitive, skill-discriminating games. Narrative or atmospheric games, or experiences where 'the same ending for everyone is fine,' could score poorly under it. Indeed, I read the third pair's disagreement between metric and taste in the human study as possibly marking exactly that boundary.

One more thing that struck me is the small scale. The human study was small — ten people, six games — which the authors themselves label 'small.' Even averaged over five runs, this reads as a stage where generalising the conclusions needs further validation. Add that this is an arXiv preprint with no confirmable sign of broad discussion as of writing — it is best received as a fresh proposal whose evaluation is not yet settled.

Fukai's reading

Here I flag that this is my own interpretation. I want to place this work within a shift in automatic generation — from 'visible artifacts' (levels, terrain) toward 'invisible structure' (the relationships between rules). In the vocabulary of design criticism, what MORTAR does is close to 'making fun operable as the degree to which skill is reflected in outcomes.' It translates the vague notion of fun first into 'does the ranking come out as ability would predict,' then dissects that into per-rule contributions — and I read this two-step translation as the heart of the paper. It is not a perfect yardstick, but I would say its value lies in having made mechanics something we can talk about.

Closing

For those who want to go deeper, here is a route that can serve as a map. From the same Togelius orbit comes ScriptDoctor (Earle et al., 2025), which generates PuzzleScript puzzles with an LLM and tree search — we have covered it before on this site. GAVEL (Todd et al., 2024), which evolves board games with quality-diversity, and Pixie (Cook, 2025), which evolves mechanics at the code level, sit well beside MORTAR and reveal differences in 'what to generate and how to evaluate it.' Read together, they bring the outline of automatic game design into focus.

Personally, my hoped-for next step is the 'designer-steerable' version the authors list among the limitations. An ideation tool becomes a real collaborator only when it generates not just freely but 'in line with your intent.' Perhaps the best way to read this paper is, hot strong drip coffee in hand, to actually go play the generated games.

References

Papers and related material referenced in this article:

・MORTAR: Evolving Mechanics for Automatic Game Design (Nasir, Li, James, Togelius, 2026, arXiv preprint)

・Related: GAVEL: Generating Games via Evolution and Language Models (Todd et al., 2024)

・Related: Pixie: Code-level Mechanic Generation for Game Designers (Cook, 2025, AIIDE)

Reactions (no login)

Anonymous • one of each per visitor per day