PAPER-DIGEST · 2026-06-20
Li et al.: AutoBG, an AI that supports board game design end-to-end from ideation to finish — Fukai Reads
LLMs, game design assistance, and verifier-gated iterative refinement
TL;DR
Designing a board game is a cognitively heavy creative task: you must think as a designer and experience as a player at the same time, looping through prototyping and playtesting again and again. This paper proposes AutoBG, a system that stitches that whole workflow into one—from a vague initial idea, through iterative rulebook revision, to testing with imagined audiences. The authors are a team from Alaya Lab, the Shanghai Innovation Institute, and Nankai University in China; the work is an arXiv preprint posted on 1 June 2026 (a freshly submitted manuscript that may not yet have passed peer review).
AutoBG is built from four specialized modules: BG-Ideator, which structures ideas through dialogue; BG-Realizer, which writes a complete rulebook from a draft; BG-Critic, a critic that diagnoses flaws; and BG-Persona, which role-plays 150 real player profiles to return individualized impressions. The key design is to separate the generator from the critic and accept a rewrite only when the critic confirms an improvement—the authors call this Verifier-Gated Iteration. So that this article alone conveys the gist, I will unpack what, how, and what was found, in order.
Introduction
I chose this paper today because its ambition is clearly stated: to help game creators end-to-end, from the first idea to a finished product. The title is "AutoBG: A Board Game Design Assistant with Interactive Ideation, Iterative Rulebook Generation, and Individualized Feedback." The lead author is Zizhen Li and the corresponding author is Kaipeng Zhang; the affiliations are Alaya Lab, the Shanghai Innovation Institute, and Nankai University. The arXiv identifier is 2606.01976, posted to cs.HC (the field of human–computer interaction) on 1 June 2026. The code is released on GitHub.
One important caveat. This is an arXiv preprint—a freshly submitted manuscript—and there is no confirmation that it has passed the peer review (prior expert vetting) of a conference or journal. So in this article I present results as "the authors report," and I confine my own interpretation to the final "Fukai Reads" section. The work currently has almost no citations and has not yet been widely discussed; I want to state that up front.
Background
Most prior work on automatic game generation—PCG (Procedural Content Generation)—has focused on producing "content" such as levels, terrain, and maps. Recently, there have been growing attempts to use large language models (LLMs, AIs trained on huge amounts of text to generate continuations and replies) to write the game rules themselves. The authors cite, among related work, studies that generate rules in game description languages and studies that implement board games from natural language and measure accuracy.
But the authors argue that existing systems address only a single stage of the design process, with no mechanism to support the path from initial conception through revision to audience testing. They name three unresolved challenges. First, how to elicit and structure a vague idea. Second, how to improve a rulebook in a closed loop (which requires a critic that can correctly locate flaws and knows when to stop once quality is sufficient). Third, how to predict, individually, the differing reactions of different players.
Another piece of context: recent findings note that letting an LLM correct its own output—"self-correction"—can actually degrade quality without an external signal. Building on this, the authors adopt a "verifier-equipped" approach that clearly separates the generator from the critic. Why does this matter? The authors frame board games as a domain used in education, psychotherapy, and cooperation research, where the quality of the design strongly shapes the experience.
Approach
AutoBG rests on data. The authors extend a predecessor dataset to 2.2K structured rulebooks (spanning 192 mechanics and 190 themes) and 180K quality-filtered real player reviews. All four modules are built on a public model, Qwen3.5-27B, and trained with LoRA (a technique that learns only a small add-on rather than the whole model, lightly specializing it). Model version names are quoted exactly as the paper writes them.
First, BG-Ideator draws out the designer's fuzzy thoughts through multi-turn dialogue and organizes them into a "structured draft" with five fields: concept, classification, mechanics, design intent, and parameters. Next, BG-Realizer converts that draft into a complete rulebook in seven sections (Lore & Objective, Components, Setup, Gameplay Flow, Core Mechanics, Scoring & End Game, FAQ).
The critic, BG-Critic, diagnoses flaws along the MDA framework (Mechanics–Dynamics–Aesthetics, a well-known frame that views a game in three layers—rules, the dynamics they produce, and the fun the player feels; Hunicke et al., 2004). It has three jobs: rating (a score out of 10), diagnostic (outputting the flaw type, severity, affected component, and a fix), and comparison (choosing which of two versions is better). BG-Persona then role-plays the personas of 150 real players and returns impressions and scores aligned to each one's tastes.
The heart of it is Verifier-Gated Iteration. BG-Realizer proposes several rewrite candidates, and BG-Critic judges by comparison whether each is genuinely better than the current version. Only when it confirms an improvement does it move to the next version; it stops once there are no more flaws (No_Flaw) or when no candidate improves. In short, the critic acts as a "pass/fail" gatekeeper so that regressions are never accepted. Without using equations, this is an iteration that tries to guarantee the work gets better with every revision.
Findings
I present the authors' reported numbers as written. First, the diagnostic quality of the critic, BG-Critic, was 6.07 versus 3.92 out of 10 compared with the strongest general-purpose baseline, GPT-5.4. That is, its quality at pinpointing what is wrong and how was reported to be far higher than the general model's. A separate large model is used as the judge for this evaluation (see the limitations below).
On rulebook revision, the authors report that BG-Realizer with Verifier-Gated Iteration reached a 36.7% flaw-free rate, whereas GPT-5.4 self-refining on its own reached only 14.8%. For individualized feedback, BG-Persona achieved 84.3% within-player ordering accuracy—the highest among all compared baselines.
A human evaluation was also conducted. Among the 30 participants, narrowing to the 22 with prior experience using general LLMs for design, AutoBG was preferred over general LLMs on feedback helpfulness (6.0 vs. 4.3) and iteration improvement (6.0 vs. 3.7), both on a 7-point scale. The authors also report that 19 of 30 said it reduced "blank-page anxiety" (the initial stall of not knowing where to start) and 24 of 30 said BG-Critic surfaced flaws they would have otherwise missed. A worked example in the paper shows a campus-themed game whose rating rises across revisions: 6.17 → 6.27 → 6.40.
Use Cases
How can people who make games or puzzles take this work home? Some concrete examples. First, if you are building a Sokoban-like (a box-pushing puzzle in the vein of Sokoban), the idea of separating the "generator" from the "critic" applies directly. After auto-generating a level, run it past a separate role that judges "is it solvable, is it too hard," and keep only what passes. The lesson of this paper is that splitting off a gatekeeper prevents regressions better than having a single model self-edit.
Second, if you run a hyper-casual PCG pipeline, you can adopt BG-Persona's approach of preparing several imagined players with differing tastes and comparing their scored reactions to the same stage. Rather than an average score for everyone, viewing it as a distribution—as in the paper's example, "9 for the optimization lover, 2 for the elegance purist"—lets you see early who it lands with and who it doesn't.
Third, if you write rulebooks for board games or tabletop RPGs, you can borrow the spirit of BG-Critic's MDA diagnosis—splitting into three layers (rules, dynamics, fun) and isolating which layer and where is broken—as a review checklist. In the authors' example, concrete flaws such as "the time-slot constraint has vanished" and "trading comes after planning, so you can't adjust your hand" were flagged layer by layer. Fourth, BG-Ideator's format of distilling an idea into five fields (concept, classification, mechanics, intent, parameters) can be reused as a design-pitch template—a "shape of questions" for filling that first blank page.
Limitations
Let me organize the limits from both the authors' stance and my own read. First, what Fukai flags here is that this study's yardstick for "quality" is largely model-based scoring. The judge that scores BG-Critic's diagnostic quality is itself another large model (Gemini-3.1-Pro), and the assessment that "rulebooks approach the quality of published games" centers on mechanical judgments of static rule text, not the result of people actually sitting down to play. Validation through real play remains, in my reading, future work.
Second, the target is paper board games (analog games), and it does not necessarily transfer as-is to digital games or action puzzles where real-time control and visual presentation matter. The human evaluation is also small at 30 people, with scores on a 7-point subjective scale. The authors themselves, in their related-work discussion, cite prior warnings that LLM-generated "personas" do not adequately represent real human diversity and that persona conditioning has limited effect. BG-Persona tries to compensate with real profiles, but I take the representativeness of a 150-person pool as something to watch. None of this is to say the authors exaggerate their results; it is about where the line of generalization should be drawn.
Fukai Reads
From here, let me be explicit that this is my (Fukai's) interpretation. I want to place this work within a trend where the protagonist of creative-support AI is shifting from the generator that writes to the critic that discerns good from bad. In the vocabulary of design criticism, what AutoBG does is close to "automating part of playtesting." What is interesting, though, is that to strengthen the critic it uses a human design theory, MDA, as its scaffold—trying to connect a machine's yardstick to human vocabulary. More than the flashiness of generation, it is in designing how to stop—the judgment to decide "no further revision"—that I read this paper's caution and its promise.
Closing
For those who want to go deeper. AutoBG's critic and personas extend MeepleLM, a study from the same author group that simulates diverse subjective experiences as a virtual playtester. Reading the original MDA framework (Hunicke et al., 2004), which views a game as rules, dynamics, and fun, also gives you a map of what BG-Critic uses as its scaffold. And if you are curious why LLM self-correction is risky on its own, reading the body of work on the limits of self-correction cited in the related work will make the point of separating out a critic click into place.
References
Papers and related materials referenced in this article:
・HTML version of the paper (quotations in this article were verified against this text)
・AutoBG released code (GitHub)
・Related work: MeepleLM: A Virtual Playtester Simulating Diverse Subjective Experiences (Li et al., 2026) — the basis for AutoBG's data and evaluation
・Related work: MDA: A Formal Approach to Game Design and Game Research (Hunicke, LeBlanc, Zubek, 2004) — the origin of the framework BG-Critic uses
Reactions (no login)
Anonymous • one of each per visitor per day