DESIGN-ROUNDUP · 2026-07-03

Letting an LLM build a whole game, and an AI playtest it: ScriptDoctor and the state of automatic game design

Tsumiki Design Roundup — 2026-07-03

Introduction

Tsumiki's design roundup — one piece today.

Today's piece is "ScriptDoctor: Automatic Generation of PuzzleScript Games via Large Language Models and Tree Search," from an international team led by Sam Earle and Julian Togelius at New York University, with co-authors from the University of Malta, the University of the Witwatersrand (Johannesburg, South Africa) and Microsoft. Posted to arXiv (2506.06524) in June 2025 and submitted as a short paper to the IEEE Conference on Games (CoG). Read the original (arXiv) ↗. I read it in the original English.

It is a researchers' paper rather than a designer's, but it takes on one of the most immediate design questions of the moment: can an LLM build a whole game, and can that game be playtested automatically without a human in the loop? For someone like me who aspires to the making side, it felt like a grounded, sober measurement of just how far generative AI can push into the design process.

A note: today I could not verify a non-English primary source to my credibility standard in the original language, so rather than force a second item I kept it to one. I will note, though, that this paper is itself a multinational effort spanning the US, Malta and South Africa.

ScriptDoctor: Automatic Generation of PuzzleScript Games via Large Language Models and Tree Search

The authors begin by puncturing the social-media spectacle of "an LLM made a game." Even when it looks like it works, we lack the means to measure quality — for now, only humans can play and judge — and because a modern LLM's training data contains countless small games, there is a lingering suspicion that anything generated is at best a minor variation on something it has already seen. To fill this evaluation gap, they deliberately avoid free-form generation in JavaScript or Python and pick a highly constrained domain-specific language, PuzzleScript, as a "model organism." PuzzleScript, created by Stephen Lavelle (increpare) (see our designer page on Lavelle), lets you describe a complete turn-based 2D-grid puzzle game in a small amount of text, and hundreds of authors have published thousands of games in it. Its advantages line up neatly: games are compact in text; diverse, high-quality works are possible; the input/output format is standardized so AI playtesting is lightweight; and we have an approximate record of what has ever been published (so novelty can be gauged).

ScriptDoctor itself is an iterative loop. The LLM emits PuzzleScript code; it is compiled by the PuzzleScript engine (a fork of increpare's); any errors or warnings are fed back to the LLM. They also define PuzzleScript as a context-free grammar (CFG) using the Lark library, used to catch syntax errors and repair code where possible. If it compiles, each level is solved by breadth-first search (BFS), reporting solvability, the number of expanded nodes, and solution length (search runs until a solution is found or it reaches one million nodes). The LLM is given ten chances to output functional code; a script counts as a success when it compiles and every level admits a solution longer than ten moves, at which point the trial ends.

The experiments compare zero-shot and few-shot prompting. For few-shot, they pack randomly sampled human-made games from a dataset of 610 PuzzleScript titles into the context, up to its limit. They also vary chain-of-thought, model (GPT-4o, o1, o3-mini) and context length (10k–70k tokens). The findings are clear. Few-shot prompting with human examples plainly raises the compile rate, the share of solvable games, and solution complexity. The reasoning models o1 and o3-mini beat the non-reasoning GPT-4o on both functionality and complexity (consistent with the boost from chain-of-thought). Longer context raises the compile rate, but quality gains plateau beyond 30,000 tokens. A puzzle o1 generated, "Unconventional PushPull," featured pushing and pulling crates plus a one-tile slide when obstructed, along with a switch and gate; its most direct solution is 34 moves and takes BFS 1,006 iterations — reaching, in the authors' words, a difficulty that is "tricky but sensible" to a human. Read the original (arXiv) ↗

Why it matters

Automatic Game Design (AGD) is a long-running thread within puzzle-design theory. Research on algorithmically generating rules and levels goes back decades, but the perennial obstacle has always been how to measure the "goodness" of what is generated. This paper's contribution is clear: it shifts weight from generation to verification (automatic playtesting), and by wiring a search-based player and a grammar parser into the LLM it offers one concrete example of a longer-horizon, human-free pipeline. Choosing PuzzleScript as the testbed is well judged, too, grounded as it is in the accumulated culture of the puzzle-design community since increpare.

What struck me most, as an aspiring designer, is the failure analysis. The authors honestly write that the most complex-looking generated games were often complex only because of — or in spite of — broken mechanics. When BFS says a level is "solvable," that does not mean it is interesting as intended: solvable is not the same as good. It is an obvious line, yet one that automation easily overlooks, and the paper presses it home with data. The more you want generative AI as a design partner, the more you must keep that line in view.

A line that stayed with me

From the original English:

"we observe that the most complex games tend to be solvable or complex in spite of or even as a result of their broken mechanics."

(Original English above; the Japanese edition adds a translation.)

If you make "solvable" your success metric, you end up counting the accidental difficulty produced by broken mechanics as "success." Whether a design is good lives outside the mere existence of a solution — a line I note to myself as much as to anyone.

References

Today's piece:

· ScriptDoctor: Automatic Generation of PuzzleScript Games via Large Language Models and Tree Search (Sam Earle, Ahmed Khalifa, Muhammad Umair Nasir, Zehua Jiang, Graham Todd, Andrzej Banburski-Fahey, Julian Togelius / arXiv:2506.06524, short paper submitted to the IEEE Conference on Games, June 2025)

· Related (on this site): Stephen Lavelle (increpare) — creator of PuzzleScript

In closing

I'm poor at solving puzzles myself, but reading with an eye for how things are designed, I liked that this paper neither hypes nor dismisses generative AI: it calmly separates what it can and cannot do. How to build tools that measure not just "can it be made" but "is it good" — that, I suspect, is where the designer's work still lives.

Tomorrow, again, I hope to read one trustworthy discussion from somewhere in the world, in the original, and pass it along.

Reactions (no login)

Anonymous • one of each per visitor per day

Read next

FEATURED ESSAY · 2026-07-03

Mirowski et al.: From Writing a Story to Finding One — Fabula, a Writing AI Grown With the Writers' Community — Fukai Reads

A paper on Fabula, a Google DeepMind writing-support AI. Its hierarchical story planner-generator, the Drama Manager, was critically co-developed with 42 experts; it proved strong at structure but weak at style and surprise. Fukai reads it for lessons that apply directly to game interactive narrative.

Related reviews