PAPER-DIGEST · 2026-06-24

Zeytuncu: Puzzle Difficulty Comes Down to How Many Numbers You Use — Fukai Reads

Puzzle difficulty / structural difficulty modeling and adaptive learning

Reviewed by Fukai · #paper-digest #research #difficulty #puzzle-design #procedural-generation #adaptive-learning #game-design

日本語版を読む →

TL;DR

Number games where you combine given numbers with arithmetic to hit a target (think of the numbers round in the UK TV show Countdown, or France's 'Le compte est bon') are the setting here. This paper asks where the 'difficulty' of such puzzles comes from, and tries to answer not from players' play logs but from the structure of the puzzle itself. The author builds an exact solver that decides whether the target can be reached with the four arithmetic operations, generates about 3.47 million puzzle instances, and defines difficulty by the minimum number of operations needed to reach the target.

The central finding is clean: a single structural quantity, the number of inputs actually used in a minimal solution (how many of the available numbers you end up using), determines difficulty exactly under this definition. Machine learning on surface statistics (number magnitudes, the target value) keeps missing the 'easy' instances, but adding this 'count of numbers used' makes classification perfect. The author calls it a 'minimal sufficient statistic' for difficulty - the smallest, just-enough cue. This article lets you grasp the paper's points without opening it.

Introduction

The author is Yunus E. Zeytuncu (University of Michigan-Dearborn). The paper is posted on arXiv as a preprint (a pre-peer-review manuscript, submitted March 2026), but the text states it was accepted at AIED 2026 (Artificial Intelligence in Education), an international conference on AI for education - so it is peer-reviewed. I note, however, that what I read was the arXiv preprint version.

Why pick it today? I scan the arXiv new listings every morning. 'Measuring difficulty' is an eternal theme for puzzle makers, yet it is usually estimated after the fact from 'outcome data' like clear rates and churn. This paper goes the other way, explaining difficulty from the contents of the problem itself. And its subject is a number-combination puzzle nearly everyone has met. It felt close to implementation and easy to carry home, so I chose it.

Background

In adaptive learning systems (which change the tasks they serve based on a learner's proficiency), how you set task difficulty is the crux. Tasks that are too easy teach little; tasks that are too hard kill motivation. This is exactly the same headache as difficulty design in puzzle games: how do you estimate the right amount of challenge in advance?

But in most prior methods, difficulty has been treated as 'a label inferred from performance data such as accuracy and response time.' That means you cannot know difficulty until many people have played, and even then 'why is this task hard?' goes unexplained. The author asks whether difficulty can be defined from the structure of the problem itself, before any performance is observed. If so, even a brand-new problem can be sized up before play.

Approach

The puzzle is named '4OPS' (for four operations). You are given several positive integers and one target value, and asked whether the target can be built using only addition, subtraction, multiplication, and division. There are integer-only constraints: each number is used at most once, every intermediate value must be a positive integer, subtraction must not go negative, and division must be exact. You need not use all the numbers - only a subset is fine. The author notes that this 'subset-allowed' formulation lets finer structural distinctions emerge.

First the author builds an exact solver that decides reachability. Rather than blind brute force, it uses dynamic programming (remember the answers to sub-problems already solved, then combine them), recording reachable values and their 'minimum operation count.' It also reconstructs the contents of the shortest solution - which numbers were actually used - as a 'witness' (the minimal solution's construction itself). This becomes the key to the later finding.

Then the dataset. There are six numbers: five single digits from 1 to 9 (repetition allowed) and one of 25, 50, 75 - essentially the same makeup as Countdown's numbers round. Removing duplicates leaves 3,861 distinct bags, and targets span all three-digit values from 100 to 999. Multiplied out, that is 3,474,900 instances. Each is labeled exactly by the solver (solvable or not, minimum operation count, structural features). Difficulty is binned by operation count: Easy (0-2), Medium (3-4), Hard (5), and Unsolvable.

Findings

First the big picture. About 87% of instances are solvable. The difficulty distribution is skewed: easy instances (0-2 operations) are relatively rare, while medium (3-4) and hard (5) dominate. Combinations that reach the target in few operations are, in fact, uncommon.

How does machine learning on surface features (statistics from the numbers and the target) fare? For predicting solvability, logistic regression (a simple linear classifier) reached about 90% accuracy. But difficulty classification was far harder: even gradient boosting (stacking weak predictors into a strong one) stalled at about 73%, and consistently missed the 'easy' instances. The author frames this as: easiness is not captured by surface statistics; it depends on the fine detail of how a solution is constructed.

So they extract structural features from the minimal-solution 'witness': the number of inputs used, which operators were used, the magnitude of intermediate values, how many distinct minimal solutions exist, and so on. Adding these dramatically improves difficulty classification, reaching - the author reports - perfect accuracy against the solver-defined difficulty labels.

Then the clincher. An ablation study (removing features one at a time to see which part matters) shows that adding just one feature - the number of inputs used in a minimal solution (the author's 'minimal input usage') - recovers every difficulty class perfectly. No other feature improves it further. The author calls this a 'minimal sufficient statistic' for difficulty under this definition. Structurally: easy instances are solvable with few numbers, while hard ones require combining almost all of them.

Where you can use this

The first use for puzzle makers is automatic difficulty labeling. If I were mass-producing number puzzles (build-the-target-from-given-numbers), I would run the solver once on each generated instance and record 'how many numbers the minimal solution uses.' That alone tags each puzzle as easy/medium/hard before any player touches it - no waiting for clear-rate data to accumulate.

The second is ordering. The author notes that arranging problems by increasing count of numbers used yields an explainable, coherent difficulty progression. It is a ready-made metric for bridging a tutorial into the main game, or for making a daily puzzle slightly heavier as the week goes on - and the payoff is that the designer can explain 'why this order' to themselves.

The third is generation-side control. If I were running PCG (Procedural Content Generation, automatic content creation) for a hyper-casual puzzle, I could reverse-engineer it: aim for 'solvable with 2 numbers' to make easy stages, and 'needs all 5' for chewy ones. Instead of measuring difficulty afterward, you target it from the start.

Beyond that, I read the scope as not limited to number games. The idea of 'how many elements a minimal solution must combine at once' translates to structural quantities elsewhere: in a Sokoban-like (push-the-boxes puzzle) it could be 'how many independent insights the solution path requires'; in wiring or routing puzzles, 'how many paths the solution must satisfy simultaneously.' It is a view that measures difficulty by the number of elements coordinated at once, not by solution length.

Limitations

First, the weaknesses the author acknowledges. The difficulty here is based on 'solver-defined minimum operation count,' and the author explicitly says aligning it with human-perceived difficulty is future work. The count of inputs captures structural necessity but does not explain factors that shape human performance such as fluency, strategy familiarity, or mood. The author also notes the results hold only within this integer-arithmetic-puzzle framework.

What I, Fukai, want to flag is how to take the strong words 'perfect accuracy' and 'minimal sufficient statistic.' Difficulty is defined by minimum operation count, and the count of numbers used is also extracted from the same minimal solution (witness). So the two are close to different views of the same structure, and that one nearly determines the other is, to a degree, a definitional inevitability. This is not a flaw, but it is more accurate to read it as 'the solver's difficulty definition collapses into a single structural quantity' rather than 'it pinned down what humans feel as hard.'

One more point. The paper has almost no citations yet; it is a fresh preprint not yet widely discussed. I could not confirm the details of dataset or code availability from the text alone (the author does state the underlying puzzle is released as a free mobile app). Given it is also closed to the specific number-game format, generalizing to other puzzles will require re-validating on your own material.

Fukai's reading

From here on is my interpretation. I want to place this study as a small but symbolic step from an era of 'measuring difficulty from outcomes' toward one of 'designing difficulty from structure.' In the vocabulary of design criticism, it pulls 'hardness' in level design back from playtest statistics to the structure of the solution itself. We have long talked about hardness as 'how stuck players got.' But this paper shows, computationally, that for at least one class of puzzles, hardness reduces to a cognitive load (the amount of information you must hold and operate on at once) - 'how many elements you must juggle in your head at once.' This is my reading, but I take it as a potential bridge that quantifies the old psychological notion of working-memory load (the mind's brief hold-and-manipulate function) from the structural side of the puzzle.

Closing

For those who want to go deeper. To see the orthodox route of estimating difficulty 'from player data,' an empirical study on difficulty modeling in mobile puzzle games (2024) offers a complementary map. Where this paper attacks 'from structure,' that one attacks 'from outcomes.' Placed side by side, you can see difficulty - that slippery target - pincered from two directions.

And if the idea of a solution's minimum operation count intrigues you, a more theoretical paper on the computational complexity of finding arithmetic expressions (how reachable numbers change with and without parentheses, 2021) is worth a look as part of the foundation this field rests on. Since this paper leans toward implementation and educational application, pairing it with the theory side makes the picture three-dimensional.

References

Papers and related references cited in this article:

・4OPS: Structural Difficulty Modeling in Integer Arithmetic Puzzles (Yunus E. Zeytuncu, 2026, arXiv:2603.25356 preprint / accepted at AIED 2026)

Reactions (no login)

Anonymous • one of each per visitor per day