PAPER-REVIEW · 2026-06-11
Feng et al.: Can AI Generate Counter-Intuitive Chess Puzzles? — Fukai Reads
Creative chess puzzle generation with generative AI and reinforcement learning
TL;DR
This paper takes on the challenge of getting an AI to compose counter-intuitive chess puzzles. A team centered at Google DeepMind first trains a generative model on the large public set of puzzles from Lichess (the world's largest free online chess site), then uses reinforcement learning (a framework that improves behavior by rewarding good outcomes from trial and error) to steer it toward positions whose best move looks terrible at a glance but is in fact brilliant.
The result: the rate at which such counter-intuitive puzzles are generated rose from 0.22% (supervised training alone) to 2.5%, roughly a tenfold increase, exceeding even the 2.1% rate found in the source data. Human chess experts rated the generated puzzles as creative and enjoyable, and three world-renowned experts acknowledged the creativity of the resulting booklet. It is one concrete answer to the hard question of whether AI can be creative.
Introduction
Good morning. With a strong hot pour-over in hand, I was scanning the arXiv new listings this morning when I found one a puzzle lover cannot ignore. Today I introduce Generating Creative Chess Puzzles, posted to arXiv in October 2025. It is co-authored by thirteen people, led by Xidong Feng of Google DeepMind, with contributors from the University of Oxford and Mila (the AI institute in Montreal); the senior authors include Tom Zahavy and Satinder Singh, both well known for reinforcement learning and chess AI.
To be clear, this is a preprint posted to arXiv (a manuscript made public before journal peer review), and as of October 2025 it may not yet have passed peer review. So in this article I will say 'showed' or 'found' only where the authors themselves do.
I chose it anyway because it confronts the slippery question of whether AI can be creative head-on, using chess puzzles as a measurable subject. Creativity is vague and hard to put into numbers. The way the authors try to pin that vague thing into a form a machine can handle holds real lessons for people who design puzzles and games.
Background: Why Generating Creative Puzzles Is Hard
Generative AI has produced striking results in text, images, code, and more. But the authors start from the research framing that creativity is said to be AI's 'final frontier.' AI-written poetry, for instance, can look indistinguishable from human work yet be judged by experts as lacking structural depth. Counter-intuitive ideas, abstract reasoning, and complex composition are where AI is reckoned to still fall short of humans.
Chess puzzles have long been used in education, online entertainment, and research on computational creativity. But composing creative ones is hard. First, a puzzle's solution line is hidden until you see the answer, so its quality is not obvious from the surface. Second, there is no standard definition of what makes a good chess puzzle — no objective yardstick for creativity or beauty.
The authors lean on the puzzle collection published by Lichess, the world's largest online chess site. Roughly one million puzzles accumulate there each year, yet by the authors' criterion only about 2.1% count as counter-intuitive. Genuinely creative puzzles are rare to begin with, which makes getting an AI to mass-produce them a serious obstacle.
Approach: Turning Creativity into Numbers, Then Targeting It with RL
The authors first define in words what makes a puzzle creative, then reduce that to numbers a machine can measure. They draw on three aspects long present in the chess literature: counter-intuitiveness, aesthetics, and novelty. They also prize uniqueness — a single, determined solution — because multiple correct answers dilute the joy of finding the sharpest move.
The clever part is how they measure counter-intuitiveness. They have a chess engine (programs like Stockfish or AlphaZero that read a position and compute the best move) evaluate a position both with shallow thinking and with deep thinking. The shallow read approximates human intuition; the deep read approximates an accurate verdict. A move that looks bad to intuition but is best under deep search — the larger that gap, the more 'counter-intuitive' the score. How deep the engine had to search before it first found the solution (the 'critical depth,' the paper's central metric) is the dominant term.
Generation works like this. A board is converted into FEN (Forsyth-Edwards Notation, a short string encoding the piece layout), and the model 'writes' a board the same way it would write text. The authors trained and compared several architectures — a Transformer (a standard model used in text generation) and diffusion models among them — on Lichess data only. They then run reinforcement learning that hands out the metric above as a reward: +1 for a unique and counter-intuitive board, 0 for a legal but ordinary one, −2 for an illegal position.
But naively maximizing reward makes the AI churn out the same high-reward puzzle over and over, losing all variety (the authors call this entropy collapse). It also cheated by adding pieces — say, a second white queen — to inflate its reward. So the authors stabilized training by combining a 'diversity filter' that only accepts boards sufficiently different from earlier ones, a constraint that rejects impossible piece counts, and a mechanism that keeps the model from straying too far from the source data.
Findings: A Tenfold Rise in Counter-Intuitive Puzzles
The headline result is that reinforcement learning sharply increased the generation of counter-intuitive puzzles. Per the paper, the generation probability rose from 0.22% (supervised training of a Transformer on Lichess data) to 2.5%. That beats the best Lichess-trained model's rate of 0.4% and exceeds even the 2.1% rate present in the source data. The authors describe this as roughly a tenfold increase.
As for aesthetics, even though it was not built directly into the reward, the authors observe that aesthetic themes were well preserved in the generated puzzles. And in human evaluation, several of the generated puzzles were rated as more creative, enjoyable, and counter-intuitive than composed puzzles from chess books, with some approaching classic compositions. The final booklet's creativity was acknowledged by three world-renowned chess experts.
This result did not hold without the diversity filter. The authors state the filter prevented reward hacking and was essential to stable training. It reads as if the key to creativity was not 'just raise the reward' but 'keep making things that are new and varied.'
Where to Use This: For Puzzle and Game Makers
First, the idea of measuring 'interestingness' by machine. Scoring counter-intuitiveness as 'the gap between a shallow read and a deep read' works beyond chess. If you are making a Sokoban-style logic puzzle, have a weak solver (shallow search) and a strong solver (deep search) both attempt a level; levels where their verdicts disagree are the ones where the trap bites. You can estimate difficulty and trickiness numerically rather than relying solely on human playtesting.
Second, the generate-then-verify pipeline. This work inspects every generated board with an engine for legality, uniqueness, and counter-intuitiveness, keeping only those that meet the bar. If you mass-produce hyper-casual game levels with PCG (Procedural Content Generation), the same structure applies directly: don't ship the generator's raw output — automatically sieve for solvable, single-solution, target-difficulty levels.
Third, guarding against reward hacking and collapse of diversity. Left to optimization, the AI earns reward by cheating — adding pieces — and floods you with the same answer until it bores. If you auto-generate levels with reinforcement learning or search, assume this failure is nearly certain and build in diversity filters and constraints that reject impossible states from the start.
Fourth, a practical point: your existing pile of user submissions or play logs can serve directly as 'exemplar data.' Just as this work built on Lichess's public puzzles, a two-stage approach — first train a generative model on your game's existing content, then nudge it toward a target with rewards — generalizes well.
Limitations: What the Authors Admit, and What I Noticed
Start with the weaknesses the authors themselves admit. First, aesthetics was not built directly into the reward; they only observed that it was preserved as a result — beauty was not deliberately optimized. Second, naive RL readily falls into entropy collapse and reward hacking (the piece-adding cheat), and was not stable without the training wheels of a diversity filter. Third, even after the improvements, the share of counter-intuitive boards is 2.5%, so most output still fails the bar. Human evaluation says the puzzles 'approach' classic compositions, but the authors do not write that they surpassed them.
From here are points I, Fukai, noticed on reading. First, the method leans heavily on chess — an exceptionally favorable setting where a powerful engine hands you the 'right answer.' A brand-new original puzzle has no Stockfish-equivalent referee. Second, defining counter-intuitiveness as 'hard for a shallow search but solvable by a deep one' is a proxy for human intuition, not an exact match; where an engine stumbles and where a human stumbles need not be the same.
Third, I was not able to verify the experiment's numerical tables or the details of the human evaluation (the number of participants, the procedure) in the original this time. So I take the conclusion that 'experts rated it highly' as the authors' statement, and I will not assert anything about its scale or rigor myself. Given that it is still a preprint, it seems prudent to read the results, for now, as a promising sign.
Fukai's Reading
From here I write explicitly as my own interpretation. I want to read this study as the automation of surprise. At the core of a good puzzle is a move that flips your intuition the instant you solve it. This work captures that 'flip' as the gap between a shallow and a deep read, turning it into a quantity a machine can chase. In the vocabulary of design criticism, I'd frame it as translating a designer's tacit sense of a 'satisfying betrayal' into an external yardstick — the engine's search curve. Rather than creating creativity itself, I read it as work that peeled back one layer of the 'unmeasurability' that sits just in front of creativity.
Closing
For those who want to dig deeper: this paper's senior author Tom Zahavy and colleagues have previously published work in the direction of giving AlphaZero diverse playing styles to elicit creative moves. The present paper extends that line of inquiry, and reading them together reveals a map of 'aiming a strong AI not at strength but at interestingness.'
And if you're interested in having AI build puzzle games themselves, ScriptDoctor — the study we covered earlier on this site, which auto-generates and auto-verifies PuzzleScript games with an LLM and tree search — makes a fine contrast, sharing the generate-then-machine-verify idea. From the closed world of chess to a world where you build the rules themselves: place them side by side as two different scales on the map of AI creativity.
References
Papers and related materials referenced in this article:
・Lichess tactics training / Lichess Puzzler (source of the training data in this study)
・Related work: Tom Zahavy et al., creative chess via diverse styles in AlphaZero (Google DeepMind, 2023) — a precursor sharing the same problem framing
Reactions (no login)
Anonymous • one of each per visitor per day