PAPER-DIGEST · 2026-06-16

Li et al.: Can LLMs Play and Beat 2D Games? - Fukai Reads GVGAI-LLM

Game-AI benchmark / spatial reasoning of language models

Reviewed by Fukai · #paper-digest #research #game-ai #llm #benchmark #gvgai #procedural-generation #spatial-reasoning

TL;DR

Large language models (LLMs - AI trained on huge amounts of text to generate continuations and replies) are good at writing, but whether they can actually play and beat a 2D game is a different matter entirely. This article introduces GVGAI-LLM, a benchmark (a shared set of test problems for comparing performance) that has language models play 118 arcade-style games to measure their reasoning and problem-solving. Each board is turned into an ASCII map (a grid of typeable characters) and handed to the model, and behavior is scored by win rate and the ratio of "meaningful moves."

The headline result: today's models clear almost none of the games. According to the paper, GPT-4o-mini scored a 0% win rate on 477 of 540 levels, and its overall win rate was just 10.27%. There remain deep weaknesses in grasping space and in planning several moves ahead. I will unpack what is inside, so the key points come across without opening the paper.

Introduction

What I printed out and marked up today is "GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games" by Yuchen Li, Cong Lin, Muhammad Umair Nasir, Philip John Bontrager, Jialin Liu, and Julian Togelius. The affiliations span New York University (NYU), the University of the Witwatersrand, Meta, and Lingnan University. The source is an arXiv preprint (arXiv:2508.08501); a note in the paper says it is under review at AAAI 2026. In other words, it is a manuscript that has not necessarily passed peer review yet, and I treat it that way here.

Why I picked it today: one of the authors, Julian Togelius, has led research in PCG (Procedural Content Generation, technology that automatically creates game content) and game AI for years, and I was drawn to his group taking on, head-on, the question of what happens when you make an LLM play games. Putting generative AI on the "playing and solving" side rather than the "building" side reveals where it stumbles. I judged this to be foundational for anyone making games who wants to use AI for playtesting or difficulty tuning.

Background

The foundation of this work is a framework called GVGAI (General Video Game AI). It is a research environment gathering more than a hundred 2D games, built to measure not "AI that is good at one game" but "AI that can reasonably play even a game it has never seen." Game rules and levels are written in VGDL (Video Game Description Language, a language that expresses a game's rules and board in a compact notation), so new games and stages can be produced endlessly. The title's "Infinite Games" comes from this, with the benefit of preventing AI from simply memorizing answers.

Previous LLM benchmarks centered on static tasks with fixed answers - like MMLU for knowledge or HumanEval for code generation. But actually playing a game requires reading a constantly changing board, grasping spatial relationships, and acting with several moves of foresight. The authors point to a gap - that there was no benchmark measuring decision-making with game-style rules and spatial reasoning in a structured symbolic world - and rebuilt GVGAI for language models to fill it. That is the starting point of this study.

Approach

The heart of the method is translating the game world into text the language model can read. At each step the board is rendered as a two-dimensional ASCII map, and the rules are also translated into natural language. The paper's Translator module rewrites the game's internal rules into plain sentences - for example, "If the avatar touches a key, the key disappears and the avatar obtains it." Then the Player module picks a concrete move like "move right" from the current board and the goal. Crucially, the model is given neither code execution nor look-ahead simulation; it must reason with words alone.

The default setup is zero-shot (answering from only the current board, with no worked examples), and it passes no memory of past moves or states at all. Treating each step independently is meant to measure on-the-spot reasoning rather than memorization. The authors also tried contextual prompting (instructions that include a history of prior exchanges), but say they did not adopt it for the main evaluation because reasoning errors compounded and it only raised token (the smallest unit a model handles text in) usage and cost without improving win rate.

The yardsticks are also considered. One is the "meaningful step ratio," which looks at how many moves actually changed the board, as opposed to wasted moves like walking into a wall. Another is "step efficiency," expressing on a 0-to-1 scale whether the win came in fewer moves. And then win rate. An overall score averaging these aims to capture behavior from several angles. The equations appear in the paper, but in essence you can read them as measuring whether the agent wastes few moves, finishes quickly, and actually wins.

Findings

In the first experiment, GPT-4o-mini was evaluated on all 118 games. The results are harsh. According to Table 2, of 540 tested levels the win rate was 0% on 477, the overall win rate was 10.27%, the overall score was 0.2764, and step efficiency averaged 0.3293. The meaningful step ratio averaged 49.71%, meaning nearly half of moves changed nothing on the board. Even on small, simple stages a human would solve by intuition, the model fails - so the authors state.

In the second experiment, six models (gpt-4o-mini, o3-mini, gemini-2.0-flash-exp, gemini-2.5-pro, deepseek-chat, Deepseek-r1) were compared on six games of differing character, from real-time action to spatial puzzles (zelda, aliens, boulderdash, realsokoban, escape, sokoban). In Table 3, among LLMs the reasoning-focused GPT-o3-mini stands a head above, with win rates of 80.0% on Aliens, 72.0% on Zelda, 52.0% on Sokoban, and 44.0% on Escape. The reasoning model Deepseek-r1 also held up in planning-heavy spots, at 50.0% on Sokoban and 54.5% on Escape. Meanwhile realsokoban was near 0% for almost every model, with only gemini-2.5-pro reaching 4.0%.

The classic search algorithms placed as baselines remain strong. The tree-search agent olets recorded 100.0% on Aliens, 76.0% on Zelda, and 68.0% on Escape. While summarizing that "LLMs generally fall short of search-based methods," the authors note that some LLMs held up surprisingly well in planning-heavy environments like Sokoban and Escape, and cautiously suggest they may have useful structured-reasoning priors for situations where search alone is insufficient. They also tried coordinate tagging to aid spatial grounding, but Table 6 reports no statistically significant improvement under Fisher's exact test (a method for judging whether a difference is chance even with few trials).

Where you can use it

So how can people making games and puzzles use this? Let me give concrete examples. First, if you are building a Sokoban-like pushing puzzle and thinking of using AI as a playtester, this paper is a realistic map. The fact that realsokoban was near 0% for almost every model shows you cannot fully hand off multi-move planning of the "push boxes to clear a path" kind to a language model alone today. To have AI solve levels and gauge difficulty, you need to pick a reasoning-focused model or pair it with an external search algorithm.

Second, if you auto-generate levels for hyper-casual or PCG, the idea of preparing a language that writes rules and levels compactly (like VGDL) and running generated stages through an algorithmic auto-evaluation loop transfers directly. Building a mechanism to mechanically check solvability into your generation pipeline keeps you from mass-producing broken levels. Third, if you build tutorials or hint systems, the paper's failure analysis is a treasure map. Knowing the model's habits - mistaking "itself after picking up a key" for a different entity, or choosing to do nothing when it should act - lets you design around the weaknesses of a player-assist AI in advance.

It is also useful for research and education. The paper's prompt design (how the instructions to the model are assembled) - turning the board into ASCII and stating coordinates, translating rules into natural language - makes a practical recipe for testing your own game AI. Building on the authors' released code (their GitHub repository), you can start experiments that let a language model play your own game without building from scratch.

Limitations

I will note both the limitations the authors acknowledge and the ones I noticed reading it. The authors clearly state that the benchmark is "very far from solved," that coordinate-tagging spatial aid does not fully resolve the core weakness, that language models lack algorithmic path planning in the sense of A* (a classic shortest-path search method), and that contextual prompting did not help. They organize the failures - not explainable by random noise - into three roots: spatial grounding, symbolic identity, and behavioral alignment.

What I would point out here is a bias in the evaluation design. Only GPT-4o-mini was run across all 118 games; the multi-model comparison is limited to six games. So from this paper alone, you cannot conclude at 118-game scale which model is strongest overall. Also, the zero-shot design that passes no memory is coherent for measuring on-the-spot reasoning, but agents people actually use normally combine memory and tools. I want to be careful not to over-generalize these numbers into "LLMs cannot solve games."

One more point: the models evaluated (gpt-4o-mini, o3-mini, gemini, deepseek family) are the lineup as of mid-2025, and this field updates models quickly. And this is a preprint under review at AAAI 2026 with few accumulated citations, so it is not yet a widely debated stage. I think it is safest to read it on the premise that the conclusions could shift in future versions.

Fukai's reading

From here, with the caveat that this is my interpretation: I want to position this study as a new chapter, in the language-model era, of a question game-AI research has built up for years - how to measure generality. In the vocabulary of design criticism, it matters greatly that GVGAI prepared, via the small description language VGDL, an "infinitely extensible test ground for making games"; this reads as close to an attempt to detach the very yardstick for evaluating difficulty and fun from any specific title and make it reusable. The result that search algorithms still beat language models appears, to me, to quietly insist that "thinking in words" and "planning the board as space" are different abilities.

Closing

For those who want to go deeper, reading the related work by the same authors brings the map into view. GameTraversalBenchmark (Nasir, James, Togelius, 2024), which evaluates LLMs on 2D map navigation, is continuous with this paper's spatial-reasoning weakness. To grasp the generation-side theory, the textbook "Procedural Content Generation in Games" (Shaker, Togelius, Nelson, 2016) is the foundation. The authors foreshadow extending the work so language models not only play games but design them - generating rules and levels - and for makers, that may be where the real subject begins. I will brew a strong coffee and wait for the follow-up.

Sources

Papers and related materials referenced in this article:

・GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games (Li, Lin, Nasir, Bontrager, Liu, Togelius, 2025, arXiv preprint; under review at AAAI 2026)

・Authors' released code (GitHub: doveliyuchen/GVGAI_GYM)

・Related work: GameTraversalBenchmark (Nasir, James, Togelius, 2024) (evaluating LLM 2D map navigation and planning)

・Related work: Shaker, Togelius, Nelson, "Procedural Content Generation in Games" (Springer, 2016) (a PCG textbook)

Reactions (no login)

Anonymous • one of each per visitor per day