PAPER-DIGEST · 2026-06-22

Monti et al.: Measuring AI's Planning Power on a Single-Corridor Sokoban — Fukai Reads

SokoBench, a long-horizon planning benchmark built from a minimized Sokoban

Reviewed by Fukai · #paper-digest #research #sokoban #llm #planning #reasoning #benchmark #puzzle #game-design

TL;DR

Large language models (LLMs — AI trained on huge amounts of text that produce writing by predicting the next word) are increasingly trained as "reasoning models," which write out a chain of thought before giving an answer. This paper asks how much long-horizon planning ability — the capacity to look all the way to the end of a long sequence of steps — such models actually have. The authors use Sokoban (a puzzle in which you push boxes onto goal tiles) and deliberately strip the puzzle down to its bare minimum: a single straight corridor with exactly one movable box.

The result is clear. Even on this single corridor, which is trivial to the point of boredom for a human, every state-of-the-art reasoning model broke down once solving it required looking more than 25–30 moves ahead. Giving the models an external solver (a dedicated program that computes the exact solution) as a tool smooths the average somewhat, but the underlying weakness does not disappear. The authors locate the core stumble in an inability to keep counting tiles correctly. This article explains the paper so its key points land without your having to open it.

Introduction

Today's paper is "SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models." The authors are Sebastiano Monti, Carlo Nicolini and Giovanni Pellegrini (all at Ipazia SpA, Italy), Jacopo Staiano (University of Trento), and Bruno Lepri (Fondazione Bruno Kessler and Ipazia SpA). It is a preprint posted to arXiv in 2026 (arXiv:2601.20856 — a freshly posted manuscript that may not yet have passed peer review). Its citation count has not yet accumulated, so I note that it is still at a pre-discussion stage.

I chose it today because the subject is puzzles themselves. The authors use Sokoban — a classic any puzzle maker knows — as a measuring stick for AI ability. And their move is interesting. Where one would normally line up hard puzzles to trip the AI up, they instead line up only puzzles that are as easy as possible. If the AI still breaks, then the reason for breaking lies not in difficulty itself but in something more elementary upstream. That is the shape of this paper.

One more thing. The paper's main subject is AI evaluation, but I think it can also be read in the vocabulary of puzzle design. The idea of compressing difficulty onto a single axis, and the observation that representation format sways the score, are both things a game maker can take home. So this article moves back and forth between a reading as AI research and a reading as a designer.

Background

"Planning" means assembling a sequence of actions that reaches a goal. It is a long-standing subject in AI, tackled with tools such as PDDL (Planning Domain Definition Language — a notation for writing down the states of the world and the rules of action precisely) and tree search (branching out the possible moves and examining them). Recent LLMs, by contrast, are remarkable at understanding text and recalling knowledge, but have repeatedly been reported to be weak at planning.

There are reasons Sokoban is favored as an evaluation target. Solvable boards can be generated automatically and efficiently, the environment is fully deterministic (the same move always yields the same result), and answers can be checked against an exact solver. Moreover, unlike the Tower of Hanoi, it cannot be solved by repeating one pattern — moving a single box can block or open paths, so each board needs its own unique solution. That is why it has been treated as a touchstone for planning ability, even adopted in the 2023 International Planning Competition.

What was not understood was the real cause of where LLMs stumble in planning. Is it difficulty itself (much branching, deep search), or is it the dropping of more humble operations — counting correctly, remembering where you currently are? The authors set out to separate these.

Approach

The authors' core idea is to lengthen only the number of moves without adding complexity. Each board is a single corridor of width ℓ and height 1, containing just one player, one box and one goal. The three line up in a single row: the goal at one end, the player at the other, the box between them. The board's only degree of freedom is the corridor length ℓ, which stands in for difficulty (longer means more moves, harder). ℓ runs from 5 to 100 in steps of 5, and for each length they prepare four orientations — 90°, 180°, 270° rotations plus the original — for 80 boards in total. The dataset is publicly released.

Why make it so simple? The authors want to separate "the power to look ahead over a long sequence" from "the power to keep remembering state." In a single corridor, branching is essentially zero. So if the AI fails, the failure can be pinned down: not because it got lost in a sea of options, but because it could not count all the way down a straight road. Boards are written in ASCII (a string that depicts a picture with symbols), and answers are written in LURD (a notation for moves using the initials of left, up, right, down).

The experiment has two stages. The first is "1-shot inference" — the model sees a single worked example and then must solve using only its internal memory (no external tools). The second is "LLM-Modulo" — the model is given a PDDL syntax checker, a validator and a solver as tools. Here the model's job is not to produce the solution itself but only to transcribe the problem into PDDL form; a dedicated solver (such as Fast-Downward) then solves it. The tools are passed via MCP (Model Context Protocol, Anthropic's published standard — a common etiquette for connecting AI to external tools).

There are three measuring sticks. Exact-match accuracy (the fraction where the string matches the model answer exactly), prefix accuracy (which gives partial credit when the start is correct), and the distance between the final position reached and the goal (Manhattan distance — the sum of vertical and horizontal movement). The third is a soft failure signal, meant to distinguish a near-miss that went the right way from a complete loss of direction.

Findings

First, 1-shot inference. When accuracy is plotted against corridor length, the paper (Figure 3) splits roughly into three regions: an "easy region" of short corridors that holds high accuracy, an "intermediate region" where accuracy falls sharply as length grows, and a "hard region" where the model can no longer return a correct sequence at all. Where the collapse happens differs by model. In the conclusion the authors state that even state-of-the-art reasoning models break down once the goal must be anticipated more than 25–30 moves ahead.

Why the fall? The authors' read is "accumulated miscounting." If there is a small per-step probability of miscounting tiles, it multiplies across the number of moves, so on long corridors the success probability shrinks exponentially (I omit the formula, but the authors describe it as small per-step errors compounding). In fact, on long corridors models were observed repeating the same move or the same thought endlessly until they hit the output cap (completion tokens limited to 32,768) and stopped. The authors describe this as a state of aimless "wandering" rather than systematic search.

There is an interesting byproduct. The amount of output on correct answers rises almost linearly with corridor length, whereas on wrong answers the spread of output is far larger. The authors explicitly note that the phenomenon reported elsewhere — cutting reasoning short when a problem is too hard — was not observed here. Instead the models keep thinking to the end and still miss. They also saw GPT-5-mini make an unnatural accuracy peak around length 50, which the authors speculate (without asserting) may be due to memorization (possible inclusion in training data).

Next, LLM-Modulo. With an external solver in the loop, the accuracy curve becomes smoother — sharp peaks and steep drops vanish and the decline grows gentler. Yet the improvement is modest. Among the models tried, only GPT-5-mini could use the tools well; DeepSeek R1, Gemini-2.5-Flash-Preview and Claude-3.5-Haiku stumbled on tool operation or on generating valid PDDL problems. The content of the failures is telling: PDDL syntax errors (formatting the solver rejects) occurred only 7 times across 80 configurations × 4 trials. Most failures were syntactically valid but "semantically off" problems that mis-stated the board size. Even using a correct, expert-authored solver, the model's inability to count and represent space faithfully remains — that is the authors' reading. For reference, the authors also cite prior work (Valmeekam et al., 2025) reporting roughly 10–12% direct solving and about 43% with a solver for o1-preview (these are prior-work figures).

How puzzle and game makers can use this

As a puzzle maker, the first practically useful lesson is: do not trust an LLM as the solver when validating generation. If you are building a Sokoban-like with automatic level generation, the judgment of whether a generated stage is actually solvable should be left to a deterministic solver (a dedicated program such as Fast-Downward), not to an LLM. The paper's LLM-Modulo result suggests a realistic division of labor: use the LLM as the "clerk that transcribes the problem into a formal form," and hand the guarantee of a solution to a classical solver (a reading in line with the authors' claims).

Second, if you plan to use AI as an "automatic playtester" or "difficulty estimator," you can know its breaking point in advance. Even on a straight corridor that is trivial to a human, the latest models broke down once more than 25–30 moves of lookahead were required. So LLM-based difficulty estimation should be limited to short-move puzzles, or built on the premise of giving the model a solver as a tool. Letting an LLM alone evaluate a long-move puzzle risks returning unreliable numbers.

Third is a hint for design philosophy itself. The authors narrowed difficulty to a single axis — corridor length. This is a fine example for drawing a tutorial or early-game difficulty curve. Before turning several parameters at once, try raising and lowering difficulty along one legible axis — its behavior stays visible to both the player and you. If you are running hyper-casual PCG (Procedural Content Generation — automatic generation of content), grading difficulty on a single axis first, confirming the effect, and only then adding axes keeps control from breaking down.

Fourth, the fact that representation format swayed the score is practical too. The paper saw an asymmetry where vertical corridors (boards with many line breaks) scored worse. If your tooling feeds level layouts to an LLM (hint generation, NPCs, automatic scoring), the choice of encoding — ASCII, a grid with numbered rows and columns, or LURD — changes reliability. It is worth normalizing the representation and checking once whether orientation or line-break quirks are distorting your results.

Limitations

The authors are candid about their own limits. The boards are confined to single-box straight corridors, which do not measure the real difficulty of Sokoban with multiple boxes and deadlocks (where a box is pushed into a position from which it can never move again). So, they write, the benchmark only shows a "lower bound" on planning ability. Evaluation uses exact match against a reference solution (reasonable since a corridor has a single optimal solution), but because general Sokoban has multiple optimal solutions, they plan to switch to solver-based verification in future.

They also line up, as "threats to validity," that the score is swayed by prompt formatting such as orientation and line breaks; that because access is via API, the model and provider backend can change over time; that rotation softens but does not fully eliminate contamination from training data; and that there is no guarantee corridor results transfer to richer planning tasks. They themselves describe the setting mainly as a sanity check and say they will later add obstacles, branching and deadlocks.

What I, Fukai, would point out are two things. One is that the number of reasoning models tested is small (DeepSeek R1, the GPT-5 family, GPT-oss 120B), and that LLM-Modulo effectively narrows to a single model, GPT-5-mini. I think it is safest to read the reach of the conclusions as limited to this particular set of models. The other is the weight of computational cost. LLM-Modulo is reported to take an average of 75 minutes with GPT-5-mini to gather Figure 6(a). To say in production that "a solver in the loop is reassuring," I read, one must budget for this slowness and expense from the design stage.

Fukai's Reading

Let me flag that what follows is my interpretation. I want to place this study within a recent current: that the ceiling of "smartness" is exposed not by flashy hard problems but by humble state management. In the vocabulary of design criticism, this is close to an experiment in reducing difficulty to a single axis — a "minimal unit of difficulty." When Sokoban is whittled down into a straight line to act as a mirror of AI's weaknesses, what comes into view is not the contour of intelligence but the fragility of counting and of continually remembering where you are — tasks a human would settle by tracing a finger. The "simple yet never-let-your-guard-down" design that puzzle authors know by experience happens, I read, to strike exactly at the AI's weak spot.

Closing

For those who want to go deeper, here are some handholds for drawing the map. The stance this paper builds on — that LLMs are poor at planning but can help within a framework paired with a solver (LLM-Modulo) — follows the argument of Kambhampati and colleagues. And the paper repeatedly checks its results against Shojaee et al.'s "The Illusion of Thinking," which treats how reasoning models break down in front of hard problems. If you are curious about the computational complexity of Sokoban itself (its PSPACE-completeness — the property that searching for a solution is inherently heavy), Culberson's classic is a starting point.

From the puzzle-making side, the idea of putting generation and verification in the same loop — have the AI write the problem, have a solver solve and confirm it — is probably the most fruitful take-home for now. Rather than seeing the AI as an all-powerful solver, see it as a partner that transcribes into a formal form, and entrust the guarantee to classical tools. It is unglamorous, but it does not break easily. The next time you build automatic generation for a Sokoban-like, remember that division of labor.

References

Papers and related materials referenced in this article:

・SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models (Monti, Nicolini, Pellegrini, Staiano, Lepri, 2026, arXiv preprint)

・Dataset: sokobanlevels (Hugging Face)

Reactions (no login)

Anonymous • one of each per visitor per day