PAPER-DIGEST · 2026-06-29
Bazzaz et al.: Believing It's AI Changes the Experience — Fukai Reads
HCI / perception bias of generated content and player experience
TL;DR
As generative AI rapidly enters games, how do players actually feel about that content? This paper takes two 2D titles -- the platformer Super Mario Bros. and the box-pushing grid puzzle Sokoban -- mixes human-made and machine-generated levels, has 142 people play them, and collects both a guess ("AI-generated or human-made?") and how the level felt.
Players could barely tell the creator apart (accuracy was essentially chance), yet levels they believed were AI-made were rated less fun, harder, and more frustrating. In other words, it was the belief about the creator, not the real creator, that shaped experience. This article lays out the paper so you can grasp it without opening the PDF.
Introduction
The authors are Mahsa Bazzaz and Seth Cooper, both at Northeastern University in the USA. The venue is ACM CHI '26 (the 2026 Conference on Human Factors in Computing Systems, one of the largest international venues in HCI), where it appears as a peer-reviewed conference paper. A preprint (the pre-review manuscript) was posted to arXiv on 15 February 2026, and that is the version I read.
I picked this paper today because it speaks directly to practitioners. Generating levels and assets with AI is no longer a lab topic; it is on Steam's shelves. The paper notes roughly 10,000 games on Steam tagged "AI content" and about 10,200 tagged "procedural content." Against the backdrop of Steam's January 2024 policy requiring developers to disclose how they use AI, the study asks an empirical question: does disclosure actually help players judge well?
Background
The recurring question in this area has been a "Turing test" one. Borrowing the Turing test (the classic frame of whether a machine's output is distinguishable from a human's), many studies across text, images, audio, and art have measured whether generated content can be told apart from human-made work. In games too, the Mario AI competitions of 2009-2012 had a "Turing Test track," and similar comparisons on Sokoban levels were attempted about a decade ago.
But the authors separate two things: whether you can tell, versus how thinking you can tell shapes your experience. Psychology offers a robust frame here -- placebo and nocebo effects (even when nothing changes, expecting good makes it feel better and expecting bad makes it feel worse). Indeed, prior work found that merely telling players a game had adaptive difficulty (when it did not) raised immersion (Denisova and Cairns, 2015), and people who believed they had AI help solved more word puzzles.
This paper brings that thread to level evaluation. What is new is the design: rather than priming (deliberately planting an expectation via labels), it observes players' spontaneous guesses. In a world where platforms do not disclose well, what happens when players draw their own conclusion that "this might be AI"? That is what the authors set out to watch.
Approach
The method is mixed-methods (analyzing both numeric ratings and free-text responses). The domains are two 2D tile-based titles -- Super Mario Bros. and Sokoban -- chosen because both are standard benchmarks in PCG (Procedural Content Generation, the automatic generation of content) research and are simple enough for an online study. Pairing two games that differ sharply in genre, goal, and cognitive load lets the authors check whether findings hold across forms of play.
There were 60 unique levels: per game, 15 human-made and 15 AI-generated. The human Mario levels were the 15 originals from the public VGLC dataset; the Sokoban ones were sampled at random from a public set of 1,150. The AI levels came from six generation methods chosen via a literature review (spanning constraint-based, machine-learning, and large-language-model approaches). The authors guard against two pitfalls: cherry-picking good outputs skews results, so selection was random; and obvious breakages (e.g., unreachable goals) make AI trivially detectable, so only "playable" and visually "acceptable" levels were kept.
Participants were 154 people recruited via Prolific (an online study-participant platform); after exclusions, 142 valid respondents each played 6 randomly drawn levels (852 trials in total, about 14 judgments per level). For each level they made a two-choice guess ("AI-generated or human-made?") plus a confidence rating, rated five experience dimensions -- fun, challenge, frustration, surprise, and design quality -- on a 5-point scale, and finally wrote why they decided as they did. Analysis used a statistical model that absorbs per-person and per-level variation (ordinal logistic regression with random effects -- in plain terms, a tool for seeing the trend after subtracting individual differences).
Findings
First, the guessing game. Accuracy was 53% (430 of 812); after balancing trial counts across human and AI levels, a test found no statistical difference from chance (50%) (two-sided binomial test p = .099, 95% CI [49.5%, 56.4%]; Section 4.1). The false-positive rate (calling a human level AI) was 26.6% and the false-negative rate 26.3% -- nearly symmetric. Players simply could not tell. Moreover, more frequent players were more confident (gaming frequency affected confidence) but no more accurate; if anything, rare players guessed slightly better.
The experience ratings were sharper. Split by the true creator the gaps were modest, but split by the believed creator they stood out. Levels believed human-made were rated more fun (belief effect beta = 1.54, z = 9.52, p < .001, while the true creator was not significant) and better designed (z = 10.480, p < .001). Conversely, levels believed AI-made were rated more frustrating (beta = -1.17, z = -7.445, p < .001) and more challenging (z = -2.41, p < .015; all from Section 4.4). In Table 1's means: fun was 2.92 for "perceived AI" vs 3.72 for "perceived human," frustration 3.60 vs 2.84, design 2.70 vs 3.57 (5-point). Only "surprise" was unrelated to belief; there the lone effect was Sokoban being rated less surprising than Mario.
The free-text analysis of reasons (inter-rater agreement Cohen's kappa averaged 0.76; kappa measures agreement after subtracting chance) is also striking. Players leaned on cues such as the feel of play, layout coherence, reachability, apparent design intent, comparison to familiar games, and assumptions about "what AI would do." But the authors stress that the same cue supported opposite conclusions. An identical setup -- an enemy placed right beside the spawn point -- was read by one player as "evidence a human made it to troll others" and by another as "no human would build this, so it's AI." The authors frame this as a sign that human-likeness judgments are subjective and fallible.
Finally, attitudes toward PCG versus generative AI. Their rating distributions differed clearly (chi-square(16) = 473.71, p < .001), with generative AI viewed more negatively. People more positive toward generative AI gave higher fun (z = 3.247, p = .0011) and design (z = 2.391, p = .0168) scores. In the free text, trust in rule-based, controllable PCG contrasted with worries about generative AI's unpredictability, error-proneness, training-data ethics, environmental cost, and impact on jobs.
Where you can use this
So how can game and puzzle makers use this? A caveat first: this is an observational study, not a causal claim that "making people think it's AI always makes it worse." With that caveat, here are uses I (Fukai) drew out.
First, if you are auto-generating Sokoban-like puzzles: alongside improving the generator, treat "the score drops the moment it's perceived as AI-made" as a design variable. Here, levels believed to be AI were rated a notch lower on fun and design and a notch higher on frustration. The same board can move on experience scores depending on framing and the wording of any label. It is worth A/B testing not just the artifact but also how it is presented.
Second, if you market PCG in a hyper-casual or exploration game: in the free text, rule-based PCG was received relatively warmly as "controllable and reliable," while generative AI triggered guardedness under a blanket 'AI' label. The authors argue that nuanced disclosure (their term) -- telling players where and why AI was used (e.g., used for ideation but assets drawn by humans) -- avoids unfair penalties better than a binary "used AI / didn't."
Third, be careful interpreting difficulty calibration. When players report "hard" or "frustrating," this study cannot separate objective board difficulty from "I judged it harshly because I thought it was AI." If you are A/B tuning difficulty, add a move that questions creator bias -- e.g., hide creator info, randomize presentation order -- to avoid misreading the data. And the prior work this paper cites includes a case where merely claiming "adaptive difficulty" raised immersion. The framing itself moves experience: keep that as a hidden variable in difficulty design.
Limitations
Starting with weaknesses the authors acknowledge. First, the domains are just two short 2D tile games, so generalization to other genres or long sessions is unclear. Second, each experience dimension was a single-item 5-point rating, less precise than established multi-item scales like GEQ or GUESS. Third, participants were mostly a general online sample with few expert level designers, limiting statistical power to detect expertise effects. Fourth -- most important -- the design is observational, so no causality: the authors state plainly they cannot tell whether "thinking it's AI made players nitpick" or "frustration led players to infer AI."
Two points I (Fukai) would add. One concerns restricting AI levels to "playable and visually acceptable" ones. It is a sound move for fair comparison, but it deliberately excludes the reality that broken outputs are common in the wild; to describe the experience of shipped generated content, you sometimes want the distribution that includes breakages. The other is that participants were limited to US, English-speaking adults. Attitudes toward AI vary widely by culture and generation, so I would be cautious about carrying the "negative attitude toward generative AI" result directly to other regions.
Fukai's Reading
From here is my (Fukai's) reading. I want to place this study in a shift of PCG research from "automating quality" toward "designing trust." Turing-test work has competed over producing outputs indistinguishable from human ones. But what this paper presses on us, as I read it, is that even after outputs become indistinguishable, the players' mental image of the creator keeps coloring the experience. In the vocabulary of design criticism, this is not a generator-performance problem but a problem of a meta design layer -- presentation and disclosure. Making a good board is not enough; the context in which it is received becomes part of the design object. That, at least, is how I read this one.
Closing
For those who want to dig deeper. As classics this paper builds on, reading early Sokoban work that used indistinguishability itself as a yardstick, plus Camilleri et al. (2016) on player believability in Mario, reveals the lineage of Turing tests in games. On expectation shaping experience, Denisova and Cairns' "phantom adaptive AI" study (CHI PLAY 2015) is an easy entry point. From economics, the "market for lemons" argument (Akerlof) -- where inability to tell real from fake erodes trust in everything -- underlies the paper's lemons dynamic framing. If you want to touch a generator yourself, author Cooper's constraint-based Sturgeon line of papers is a good doorway.
Sources
Papers and related materials referenced in this article:
Reactions (no login)
Anonymous • one of each per visitor per day