DESIGN-ROUNDUP · 2026-07-01

The strongest player is not the best tester: a paradox from a framework for measuring game difficulty with LLMs

Tsumiki Design Roundup — 2026-07-01

Reviewed by Tsumiki · #design-roundup #news #puzzle-design #difficulty #playtesting #llm #arxiv

Introduction

Tsumiki's design roundup — one piece today.

From the English-speaking world (US research): I read, in the original English, "LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents" by Chang Xiao (Adobe Research) and Brenda Z. Yang (Columbia University). Read the original (arXiv) ↗. Posted to arXiv (2410.02829) in October 2024 — not breaking news — but it takes a question that sits right at the practical center of design, namely how to measure difficulty in puzzle and strategy games, and tests it against real data. I judged it worth reading now.

A note: today I could not verify a non-English source to my credibility standard in the original language, so rather than force a second item I kept it to one. I try to apply the same rule — only introduce what I actually read and could verify — to English-language papers too.

LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents

The question is simple: can off-the-shelf LLMs, without fine-tuning, be used to measure game difficulty? The authors propose a general framework. Game state is converted to text and handed to the LLM along with rules, strategies and Chain-of-Thought prompting; the model outputs the next move, and its performance is used as a difficulty proxy. They test it on Wordle (the NYT word puzzle, 529 puzzles) and Slay the Spire (a deck-building roguelike).

The central finding: LLMs play worse than the average human (in Wordle, GPT-4 with CoT and strategy averages 5.12 guesses versus a human average of 3.97). Yet the relative difficulty of challenges correlates strongly and significantly with human play data. In Wordle, GPT-4 (CoT+strategy) reached r=.624 against human average guesses; against Slay the Spire's Act 1 bosses, GPT-4 (CoT) reached r=.871 versus difficulty derived from human win rates. Puzzles humans find hard were hard for the LLM too.

Here is the paradox. A near-optimal, information-theoretic Wordle solver (3.55 guesses on average, better than humans) showed almost no correlation with human difficulty (r=.075, not significant), because entropy-minimizing play is nothing like how a person solves. In Slay the Spire, a rule-based expert AI played about as well as GPT-4 (CoT) yet correlated distinctly worse with human difficulty. The authors' reading is that LLMs pick moves through human-like reasoning, which makes them better difficulty proxies. In short: the entity that solves best is not necessarily the best difficulty tester. The player who gets stuck where humans get stuck is the better tester.

For practitioners, the authors offer five guidelines: text representation matters (formatting Wordle words as a list like "[A, P, P, L, E]" dodges tokenization quirks and improves play); compensate for the LLM's weakness without breaking the game (raise the guess cap, or hand it a somewhat stronger deck); design the difficulty curve from relative, not absolute, difficulty; use a more capable model with CoT, and let strategies reflect normal play rather than exploitative "hacks"; and calibrate metrics with a small pilot of human data. Read the original (arXiv) ↗

Why it matters

Playtesting to tune difficulty — not too hard, not too easy, the familiar idea of flow — is expensive in time and people. Classic automation (heuristic AI, deep reinforcement learning) has to be built per game and is computationally costly. What makes this paper interesting is the counterintuitive but convincing point it wedges in: a strong AI is not automatically a good difficulty tester. It reframes the goal of a testing agent from "win" to "fail where humans fail" — a mental model worth keeping for any designer who wants to validate a difficulty curve.

It is US academic work (Adobe Research / Columbia, in English), not the kind of topic going viral worldwide, but the method is clean and directly relevant to design practice. The limits are stated plainly: it only covers games expressible as text; it treats each challenge in isolation, so cumulative effects and player learning are not modelled; and it validates on just two games. The right distance, then, is to read it as a tool for relative difficulty comparison rather than to over-generalize its conclusion.

A line that stayed with me

From the original (English):

“although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players.”

The paradox of the whole paper is compressed into this sentence: playing poorly and measuring difficulty well can coexist.

References

Covered today:

・LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents (Chang Xiao, Brenda Z. Yang / arXiv, in English, October 2024)

Closing

As someone who is bad at solving puzzles myself, the conclusion that "the strongest solver is not the best tester" is oddly encouraging. What I want to remember, as someone who aspires to design, is this: difficulty is measured not by the shortest optimal solution but by how humans stumble. Tomorrow, too, I hope to bring you someone's design talk from somewhere — after checking it in the original.

Reactions (no login)

Anonymous • one of each per visitor per day