PAPER-DIGEST · 2026-06-17

McConnell & Zhao: Generating Just-Right Puzzles in Real Time with a Genetic Algorithm — Fukai Reads

Adaptive puzzle generation and player modeling with a genetic algorithm

Reviewed by Fukai · #paper-digest #research #procedural-generation #genetic-algorithm #player-modeling #dynamic-difficulty #puzzle-generation #serious-games

日本語版を読む →

TL;DR

Today I picked a paper that reports an attempt to build puzzles on the fly with a genetic algorithm and fit their difficulty to each individual player. The material is a path-drawing cargo puzzle closely modeled on Cosmic Express. The authors record how a player solves (time taken, retries, how often they nearly solved it, and so on), decide the next difficulty from that, and generate a puzzle matching that difficulty with a genetic algorithm — an optimization method that imitates biological evolution to gradually search for good solutions — in roughly 7 seconds per puzzle.

In a small study with 18 participants, the version that relied on time alone clearly lagged, while versions using multiple metrics scored higher on "the difficulty felt right" and "I felt a sense of progression." It is an implementation-minded paper that simply connects real-time generation with adaptive difficulty. I will write this so that this article alone tells you what they built and what they found.

Introduction

The paper is "From Frustration to Fun: An Adaptive Problem-Solving Puzzle Game Powered by Genetic Algorithm" by Matthew McConnell and Richard Zhao (Department of Computer Science, University of Calgary, Canada). My source is the preprint posted to arXiv in September 2025 (arXiv:2509.23796). The manuscript carries an AAAI (the AI society) copyright notice and thanks anonymous reviewers, which reads as if it were prepared for and accepted at an AAAI-affiliated venue, but within what I can confirm I treat the arXiv version as the source.

I chose it because it sits squarely on Puzzlebyrinth's home turf: automatic generation of path puzzles, numerical difficulty, player modeling (estimating "how skilled is this person right now" from play logs), and dynamic difficulty adjustment (raising or lowering difficulty mid-play; DDA). All of these fit into one playable system, and the paper even includes pseudocode for the algorithms. I appreciate that it is at a grain a game maker can directly borrow from.

One caveat up front. This is a fairly new preprint with almost no citations yet, and the study itself is a pilot (preliminary) study with 18 participants. So there are few results one can call "proven." I will quote the numbers as written and, at the end, carefully separate what can be said from what cannot.

Background

The idea of fitting difficulty to each person is not new. Both games and education have long used the notion that "too easy is boring, too hard is frustrating, and immersion (flow) lives in between." The authors start from the classic finding that one-on-one tutoring is extremely effective for learning — Bloom's 1984 result that the average tutored student lands about two standard deviations above the average of conventionally taught students.

So what was missing? In the authors' account, most existing adaptive learning systems run "offline": they analyze data afterward to improve the next session, and few rebuild content on the spot during play. Moreover, the most widely used cue for adaptation has been time-on-task (how long the task took), yet how effective it is on its own has rarely been tested.

This study therefore pushes on two points: (1) building a system that adjusts difficulty online in real time, in a form comparable against a non-adaptive baseline, and (2) isolating the single metric of time to question its very usefulness. At the heart of generation sits the genetic algorithm, a flagship method of PCG (Procedural Content Generation, the automatic creation of content).

Approach

First, the play itself. The authors built APSG (Adaptive Problem-Solving Game) in Unity. The material is a path puzzle modeled on Cosmic Express (Hazelden, Davis, Tyu, 2017). On an n-by-n grid you draw a single, non-branching path from start to goal. A "container" advances one tile at a time along the path and can carry exactly one cargo at once. It picks cargo up at designated pickup points and drops it at designated dropoff points. Paths cannot cross, and you must plan the order or the puzzle is unsolvable. Difficulty is set mainly by grid size, the number of pickup/dropoff points, and the placement of special tiles.

What builds each puzzle on the spot is the genetic algorithm. In words: prepare a population of candidate puzzles and score each by "how close it is to the target difficulty" (the fitness). Pick high-scoring ones as parents, combine two of them (crossover), apply a little random change (mutation) to make children, and repeat over generations so the population drifts toward puzzles at the intended difficulty. The authors base this skeleton on Scirea's (2020) NSFI-2POP structure.

The crossover has a twist. A puzzle is represented as a 2D map of characters (# empty, X path, P pickup, D dropoff, S start, E end, and so on). Cutting two parents left/right at a random column and swapping usually breaks the path. They re-fill the gap with BFS (breadth-first search, which explores from nearby tiles outward to find the shortest connection) and re-place pickup/dropoff points by a distance-based rule across path segments, then always check solvability. In other words, repair is built in so generation does not emit broken solutions.

Fitness moves with "which difficulty are we aiming at." The authors set target values per factor — path length (8-50), corners (0-20), empty space, pickups (1-12), orthogonal pickups, and so on — and interpolate targets according to difficulty 1-10. The score adds up, with weights, how close each factor is to its target (no formula here, but in essence it deducts the further you stray).

So who decides the target difficulty? The player model. The system looks at the player's state and proposes "raise / lower / keep" for the next puzzle. Its inputs are time to solve, number of attempts, backtracks (erasing part of the drawn path to retry), resets, and near-misses (fewer than 25% of special points missed). What is interesting is that the number of attempts is a "hard constraint" — a safety valve that does not raise difficulty while the player is failing a lot — while the rest are "soft constraints," turned into a weighted score that sets the magnitude of the change smoothly.

The parameters used in the study were population size 300, crossover rate 80%, generation limit 10, and a maximum grid of 10x10. With this setting, a puzzle of any difficulty could be generated in roughly 7 seconds. Increasing population, generations, or grid size lets you build larger, more complex puzzles, but generation time grows substantially in return — the authors say this plainly.

Findings

The study had 18 participants, each playing three versions of APSG: Standard (uses all player-model metrics), Increasing (raises difficulty by one each time regardless of performance), and Time-based (uses only time as its metric). Presentation order was rotated across people to limit learning bias. Overall reception was favorable: 89% found it intellectually stimulating, 94.5% said it required problem-solving, and 72.2% felt it required critical thinking (all agree/strongly agree).

Here are the key comparisons in the original numbers. "Reduced frustration" (1-10, higher is better) was Standard 6.47, Increasing 6.94, Time-based 6.83, with no significant difference in ANOVA (analysis of variance, a test for differences among group means): F(2,51)=0.072, p=0.93. "The difficulty was right" was Standard 6.06, Increasing 5.67, Time-based 4.44, which did differ (p=0.039), and the Standard-vs-Time-based gap was clear (p=0.004, effect size Cohen's d=0.869, where effect size gauges how large the difference is).

"I noticed the difficulty change" was Standard 8.17, Increasing 7.56, Time-based 6.89 (no significant difference, p=0.149). "I felt a sense of progression" was Standard 8.56, Increasing 7.5, Time-based 6.0, where a clear difference appeared (p=0.005; Standard vs Time-based p<0.001, d=1.083). In the logs too, Standard and Increasing had nearly the same average difficulty (5.53 and 5.50), whereas Time-based was lower at 3.87 and was solved on average 25.67 seconds faster than the benchmark — read as failing to escalate difficulty enough.

In sum, versions using multiple metrics scored higher across felt measures, and the time-only version showed the weakness of "tending to stay easy." On this basis the authors point toward refining the Standard type as the future core. Yet they also state plainly that many Standard-vs-Increasing differences are not statistically significant.

Use Cases

Let me list concrete uses for people who make games and puzzles. First, if you are building a Cosmic Express- or Sokoban-like path puzzle, this paper's fitness function works directly as a difficulty knob. Set target values for "measurable quantities" — path length, corners, empty space, pickup count, orthogonal pickups — and generate to match. Because you grasp difficulty as numbers rather than words, mass-producing levels per difficulty band gets easier.

Second, if you are wiring up PCG for a hypercasual game, the lesson "do not rely on time alone" matters. This paper combines backtracks, resets, and near-miss counts, and even measures the weakness of the time-only version. The more churn-prone the casual audience, the more it pays to detect getting-stuck early from multiple behavioral signals and fine-tune the next puzzle's difficulty.

Third, if broken generated puzzles plague you, you can lift the design of repairing the path with BFS after crossover and always verifying solvability at the end. Beyond GA, the idea of fixing rather than discarding broken synthesized candidates is useful across constraint-heavy puzzles generally. Fourth, as the authors note for future applications, this can extend to difficulty adjustment for logic puzzles, arithmetic drills, and practice problems for learners with disabilities. Education, outside games, comes into range.

Limitations

First, weaknesses the authors acknowledge. The 18 participants are few, and the results are at a pilot stage. Crucially, there is no comparison against a fully non-adaptive (static) baseline, without which they say one cannot strictly state how much frustration was reduced. The time metric was only isolated and evaluated, while an ablation study (removing design elements one at a time to see which matters) is left to future work. Thresholds were set at the group level, abstracting away individual differences.

From here is what I, Fukai, noticed on reading. First, the roughly 7 seconds per generation could break tempo if it runs synchronously. The paper does not flag this, but it sits in tension with the snappy "on to the next" feel of commercial puzzles. Second, many Standard-vs-Increasing differences are not significant (18 participants lacks the power to detect them). So "Standard is best" reads to me as a suggestion, not a proof.

Third, what bothers me most is that the fitness measures "proxies for difficulty" like path length and corners, not the cognitive difficulty a person actually feels. Proxy and felt difficulty can diverge; indeed in this study the felt challenge and the objective difficulty do not line up perfectly. Finally, the source Cosmic Express is a commercial game, and a "modeled on" reproduction needs rights-side care — a note for implementers, separate from the paper's claims.

Fukai's Reading

Only here I write, with the caveat that this is my interpretation. I want to place this study within the drift of PCG's center of gravity from "just generate lots of variety" toward "generate to fit the one player in front of you." In the vocabulary of design criticism, this is close to automating the adjustment that level designers have implicitly done — reading the opponent and playing the next move. What is interesting is that the automation is built not from advanced machine learning but from a genetic algorithm and hand-written constraints — mature, readable tools. For reproducibility, I read this plainness as a virtue worth valuing.

Closing

For those who want to go deeper, let me draw one line that could serve as a map. This paper's genetic algorithm builds on Scirea's (2020) "adaptive puzzle generation for computational thinking," so start there if you want to dig into the generation side. For an overview of PCG in general, Shaker, Togelius, and Nelson's "Procedural Content Generation in Games" is the standard. If you care about player modeling and DDA, follow the branches from the DDA review in this paper's references (Lopes & Lopes, 2022) to see where this single work stands. Next, I want to find and read studies that directly measure the gap between proxy and felt difficulty.

Again: this rests on an 18-person pilot study and is a preprint not yet widely discussed. Precisely for that reason, I think it is a good piece for taking home "what can and cannot be said" without rushing to conclusions. This morning, too, I read it with a strong hot drip coffee in hand, drawing lines on the printed PDF with colored pens.

References

Papers and related materials referenced in this article:

・From Frustration to Fun: An Adaptive Problem-Solving Puzzle Game Powered by Genetic Algorithm (McConnell & Zhao, 2025, arXiv preprint 2509.23796)

・HTML version of the same paper (checked text, figures, algorithm pseudocode)

・Related: Scirea, M. (2020) "Adaptive puzzle generation for computational thinking" (basis for this paper's NSFI-2POP genetic-algorithm structure)

・Related: Shaker, Togelius & Nelson (2016) "Procedural Content Generation in Games" (a standard PCG reference)

・Related: Lopes & Lopes (2022) "A review of dynamic difficulty adjustment methods for serious games" (a DDA review)

・Source game: Cosmic Express (Hazelden, Davis & Tyu, 2017)

Reactions (no login)

Anonymous • one of each per visitor per day