PAPER-REVIEW · 2026-06-10

Can AI Build a Whole Puzzle Game? ScriptDoctor and Its Generate-Playtest-Repair Loop

Fukai's paper digest — Earle et al. on automatic PuzzleScript generation (arXiv:2506.06524)

Reviewed by Fukai · #paper-review #research #puzzlescript #llm #automatic-game-design #procedural-generation #increpare #sokoban

日本語版を読む →

Introduction

Nice to meet you — this is Fukai's paper digest. I read academic papers on puzzles and game design, translating jargon into everyday words, and walk through each one in five parts: problem, method, findings, where you can use it, and limitations. For the first installment, I picked a study that asks an AI to build a puzzle game whole.

The paper is ScriptDoctor: Automatic Generation of PuzzleScript Games via Large Language Models and Tree Search, posted to arXiv on June 6, 2025. It is co-authored by seven researchers — led by Sam Earle of New York University, with colleagues from the University of Malta, the University of the Witwatersrand, and Microsoft — and was submitted as a short paper to the IEEE Conference on Games (CoG). One of the authors, Julian Togelius, is among the best-known figures in procedural content generation research.

Why this paper first? Because the subject is PuzzleScript. Released by Stephen Lavelle (increpare) in 2013, this tiny domain-specific language describes a whole Sokoban-style game — rules, sprites, and levels — in a few dozen lines of text, and indie developers worldwide, Hempuli of Baba Is You among them, have used it for prototyping. A world familiar to readers of this site is, here, the laboratory.

The Problem: How Do You Verify 'an AI Made a Game'?

Scroll the right social feeds and you'll find no end of 'I had an AI make a game' posts. The authors start from a plain question about that spectacle: is it a good game? Is it more than a slight rehash of something in the training data? And is there any way to check other than a human sitting down to play each one?

For ordinary games written in JavaScript or Python, there is essentially no way for a machine to play and assess quality automatically. As long as evaluation depends on humans, experiments can't scale, and we can't systematically study how to prompt for better games. Before asking whether AI can make games, we lack a ruler for whether it did. That is the problem this paper takes on.

So the authors chose PuzzleScript as a small experimental world — the way biologists study E. coli or fruit flies as 'model organisms.' Three reasons. First, a whole game fits in a short text, which suits a language model's output. Second, the engine is lightweight with standardized input and output, so machines can play thousands of rollouts. Third, the body of published PuzzleScript games is small enough to roughly know in full — a rare setting where 'is this actually new?' can someday be asked seriously.

The Method: A Loop of Generation, Inspection, and Repair

ScriptDoctor is, in a phrase, a generate-inspect-repair loop. An LLM (GPT-4o, o1, and o3-mini in the experiments) writes a complete PuzzleScript program — object definitions, rules, win conditions, levels. Its prompt is padded with PuzzleScript documentation, examples drawn from a database of 610 human-authored games, and design ideas produced by a separate 'brainstorming' agent.

Inspection happens in two stages. First, compilation: the generated code goes through the PuzzleScript engine, and any errors or warnings are bounced straight back to the LLM. A parser built from a context-free grammar of PuzzleScript (written with the Python library Lark) flags syntax errors as well.

If the code compiles, stage two: the machine actually plays. Breadth-first search (BFS) — a methodical, near-exhaustive search that checks shallow moves first — explores each level up to one million unique states and reports three numbers: whether it's solvable, how long the solution is, and how many states were examined to find it. Those numbers are fed back to the LLM via a server.

The LLM gets ten chances. A trial succeeds the moment every level admits a solution longer than ten moves; otherwise the model is shown its previous code and the inspection results and told to revise. The design philosophy: take the human designer's build-play-fix cycle and run it, as far as possible, on machines alone.

Findings: Examples Help, Reasoning Models Win, and 'Broken but Interesting'

The clearest result first. Without human-made examples (zero-shot), only 30% of GPT-4o's games compiled, and 0% had a level the search agent could solve. Put examples in the prompt (few-shot), and compilation jumps to 70–80%, with more than half the games gaining solvable levels. PuzzleScript isn't a major language like Python, so worked examples matter decisively — the authors note this was unsurprising even to them.

The model comparison is just as interesting. Under the same few-shot, 10k-token conditions, ten games each: GPT-4o compiled 67% of the time, reasoning model o1 87%, o3-mini 93%. On 'all levels solvable,' o3-mini's 20% was the best. Models that think before they write are better at assembling coherent rules and levels. Meanwhile, stuffing more examples into the prompt hit diminishing returns around 30,000 tokens.

The paper's showpiece is 'Unconventional PushPull,' generated by o1. On top of the familiar push-and-pull, it adds a twist — when the player's path is blocked, a crate slides one extra tile — plus a switch and a gate. The shortest solution is 34 moves, and even the methodical BFS had to examine 1,006 states to find it. The authors' verdict: tricky but sensible to a human player.

It isn't all good news, though, and the authors are candid: the most complex generated games tended to be complex 'in spite of, or even as a result of' broken mechanics. In one generated game where a wizard can teleport through or destroy walls, presumably unintended interactions ended up producing the most interesting solutions. And the generated sprites were abstract — sometimes literally invisible — to the point that the paper's figures swap in art from human-made games. Writing rules is one thing; aligning look and intent is still far off.

Where You Can Use It: Takeaways for Developers and Players

For indie developers, the most portable takeaway isn't the AI itself but the design habit of automated solver verification. Feed your levels to a search algorithm and look at solution length and search effort — a quality check you can copy today, no LLM required. A level with a three-move solution, or one that even brute force can't crack, will show up in the data before a playtester ever sees it.

If you do want to use it the ScriptDoctor way, think of it as a draft factory. The ten-round self-repair loop is a rough automation of build-play-fix. On day one of a game jam, have it pitch ten rule twists, throw away nine, polish one. What Unconventional PushPull suggests is that the one might be surprisingly decent.

For players, the paper reads as a ruler for the distance between solvable and fun. BFS node counts measure difficulty for a computer, which is not the same thing as the human pleasure of insight. Put the other way around: whatever great human-made puzzles have — communicated intent, engineered misdirection, the click of a solution that makes sense — stands out in outline precisely because this paper's instruments can't see it.

In the research context, the authors propose using the pipeline's output — games guaranteed to compile and be solvable — as a dataset for fine-tuning smaller models. A generator with a built-in verifier doubles as a factory for teaching material.

Limitations: What This Paper Does Not Measure

Housekeeping first. This is a five-page short paper, and at the time of its arXiv posting it was at the 'submitted to CoG' stage. Each experimental condition covers only 10–20 generated games, so the tables are best read as trends rather than rigorous comparisons.

Next, the limitation that matters most. The study measures whether games compile, whether they're solvable, and whether solutions are long — not whether they're fun. There is no human player evaluation, and the authors themselves note that many beloved human-made games have short solutions yet remain interesting. Solution length and search effort are only crude proxies for fun.

The technical weaknesses are stated plainly too. LLMs struggle with the spatial reasoning level design demands, and left alone tend to produce levels with overly short solutions. Asked to add challenge, they often just stretch the level by a few rows or scatter some obstacles. Nor can the system diagnose on its own that a mechanic is broken relative to intent; as future work, the authors suggest showing gameplay footage to a vision-language model to spot such flaws.

Finally, the opening question — is this just a rehash of the training data? — is not actually answered here either. Measuring similarity against the 610 existing games is left as future work, and the real payoff of choosing PuzzleScript, a world whose entire published corpus is roughly knowable, will come only when that check is built.

References

・ScriptDoctor: Automatic Generation of PuzzleScript Games via Large Language Models and Tree Search (Sam Earle, Ahmed Khalifa, Muhammad Umair Nasir, Zehua Jiang, Graham Todd, Andrzej Banburski-Fahey, Julian Togelius; arXiv:2506.06524, posted June 6, 2025; CC BY 4.0)

・PuzzleScript (Stephen Lavelle's scripting language for puzzle games, released 2013; source on GitHub)

・PuzzleScript games database (source of the 610 human-authored games the paper uses as few-shot examples)

・Related work: GAVEL: Generating Games via Evolution and Language Models (Todd et al., 2024 — an earlier system combining evolutionary search and LLMs in the Ludii framework)

In Closing

Talk of AI making games tends to swing between hype and dread. What I like about this paper is that it stops short of both and builds the measuring device first. What the device can see, so far, is only 'it compiles, and brute force can solve it' — most of what we actually enjoy in a puzzle doesn't register. But the outline of what can't be measured only became clear because someone tried measuring.

That's it for the first installment of Fukai's paper digest. Next time, another study on the design of play, read through the same five lenses.

Reactions (no login)

Anonymous • one of each per visitor per day