PAPER-DIGEST · 2026-06-28

Liu et al.: AI Assistance Erodes Persistence — A Warning for Hint Design — Fukai Reads

AI assistance and cognitive debt / learning & hint design

Reviewed by Fukai · #paper-digest #research #human-ai-interaction #learning #hint-design #persistence #serious-games #cognitive-offloading #game-design

日本語版を読む →

TL;DR

AI assistants (here, conversational generative AI like ChatGPT) will return a complete answer the instant you ask. Convenient. But what does that convenience leave behind in the user's own ability to solve problems and to persist? This paper gives causal evidence on that question through randomized controlled trials (RCTs, where participants are assigned to conditions by lottery so effects can be compared rigorously) with 1,222 participants in total. Across two task types, fraction arithmetic and reading comprehension, AI help did raise performance during the working session. But on a test given right after the AI was taken away, the people who had used AI solved fewer problems and were more likely to 'skip' (give up).

The effect appeared after just 10-15 minutes of use. The authors frame it as a drop in persistence (the capacity to keep working at a hard task), which research treats as one of the strongest predictors of long-term learning. What matters is how the AI was used: people who had the AI produce the answer itself declined most, while those who asked for hints or explanations showed almost none of this drop. This article explains the gist so you don't need to open the paper, and closes with concrete implications for puzzle and game hint design.

Introduction

The authors are Grace Liu (Carnegie Mellon University), Brian Christian (University of Oxford; known for the popular book 'The Alignment Problem'), Tsvetomira Dumbalska (also Oxford), Michiel A. Bakker (MIT), and Rachit Dubey (UCLA). The venue is an arXiv preprint (arXiv:2604.04721, version of 7 April 2026; not yet peer-reviewed, i.e. it may not have passed formal review). Still, it carries three randomized controlled trials and IRB-approved procedures, placing it among the sturdier preprints.

I picked this study today because its theme gives a fresh, experimental answer to an old worry at the heart of puzzle and game design: how much help should you hand out? Everyone who makes games wants to help a stuck player. But help too much and you take away the joy of solving it yourself, and the very mastery the game is supposed to build. This paper shows that tug-of-war in numbers.

Background

It was already known that 'cognitive offloading' (using tools like calculators, notes, or search engines to reduce the mental load of a task) improves performance, yet performance drops when the aid is removed. AI extends this dynamic across essentially every reasoning domain, raising concerns about overreliance and deskilling (skills eroding through use).

But the authors note that prior evidence was largely correlational, from surveys or interviews (it can say 'heavy AI users tend to score lower' but not whether AI is the cause), or limited to very small samples. The novelty here is using RCTs, assigning participants to conditions by lottery, to show on a large scale a causal link: AI assistance lowers both unassisted performance and persistence.

Why it matters: in cognitive science and education, the ability to regulate effort and endure difficulty (persistence) is treated as one of the strongest predictors of long-term academic and workplace success. The authors argue AI can erode that very foundation. Short-term convenience and long-term competence can collide, and that collision is the paper's starting point.

Approach / Method

Every experiment shares the same skeleton. Participants (US-based, recruited on the online platform Prolific) are split by lottery into two groups. One (the AI group) can use a conversational AI (GPT-5 in the paper) from a sidebar during a learning phase. The AI was pre-prompted with each problem and its solution, so a participant could simply type 'answer?' and get the correct solution instantly. The other (control) group solves the same problems without AI.

The crux is this design choice: after the learning phase, the AI is removed without warning, and participants must solve 3 final test problems without any AI. That is the real measure of how much they can solve on their own. Every problem also has a 'skip' button, and participants are told there is no penalty for wrong answers. So choosing to skip reflects not lack of ability but a deliberate decision to stop engaging, which the authors use as a measure of persistence.

There are two task types. Experiments 1 and 2 use fraction arithmetic (getting harder from one-step to two-step to three-step); Experiment 3 uses reading comprehension (SAT practice items). In Experiment 2, to fix a weakness of Experiment 1, the authors added a pretest to gauge ability before deciding exclusions, and gave the control group its own sidebar (showing already-seen pretest solutions) so the interface change matched. Experiment 3, reading comprehension, draws on a different cognition (meaning-making) and serves as a replication to see whether the effect holds.

Findings

In Experiment 1 (307 after exclusions; 185 AI, 122 control), the solve rate on the 3 AI-free test problems was 0.57 on average (s.d. 0.41) for the AI group versus 0.73 (s.d. 0.34) for control, favoring control (paper Section 2.2: t(305)=-3.64, P<0.001, effect size Cohen's d=-0.42). The skip rate was 0.20 for the AI group versus 0.11 for control (P=0.031, d=0.25). So the group that had used AI solved less on its own and gave up more.

Experiment 2 (585 after exclusions) replicated this. Test solve rate was 0.71 (AI) versus 0.77 (control) (P=0.020, d=-0.19). However, the skip-rate difference (0.10 vs 0.07) was not significant in the aggregate (P=0.239). The authors attributed this to variation in how people used the AI and looked at the breakdown.

That breakdown is the paper's most practically useful finding. 61% of the AI group (189 people) self-reported using the AI mainly to get answers directly; 27% (82) used it for hints or clarifications; 12% (37) reported not using it. Although the three groups did not differ at pretest, at the real test the 'direct-answer' group had the lowest solve rate (0.65 vs control 0.77) and skipped more, while the 'hints' group did not differ much from control and showed no decline. The authors explicitly note this breakdown is cross-sectional (correlational, not causal).

Experiment 3 (reading comprehension, 168 after exclusions) reproduced the same direction: test solve rate 0.76 (AI) vs 0.89 (control) (P=0.007, d=-0.42), skip rate 0.08 vs 0.01 (P=0.008, d=0.42). The authors say this suggests the persistence drop is not an artifact of math tasks but a general consequence of AI-assisted problem solving.

Where you can use this

Here are concrete takeaways for puzzle and game makers. (1) Move hint design from 'reveal the answer' to 'tiered hints.' The most suggestive result is that only the 'got the answer directly' people lost independent ability and persistence, while the 'hints only' people barely declined (cross-sectional, so not conclusive, but the direction is clear). If you're building an escape room or a nonogram (picross), a staged hint, first a direction, then one move, finally the solution, is more likely to protect both the player's growth and the feeling of having solved it themselves than a one-tap 'show solution' button.

(2) In educational or serious games, don't make 'in-session performance' your success metric. AI and strong aids raised the solve rate during work but lowered ability once removed. For a math or language learning game, the main yardstick should be a transfer test without the aid (can they solve it on their own elsewhere), not the assisted clear rate. In live ops, watch for the gap where engagement metrics rise while independent mastery falls.

(3) Don't condition players toward 'instant resolution' in tutorials and difficulty curves. The authors propose a self-reinforcing mechanism: AI shifts the reference point for how long a task 'should' take, so unaided work feels comparatively harder and dropout rises (they liken it to hedonic adaptation, where habituation shifts the baseline of how things feel). Even in hypercasual games, spoon-feeding solutions early can make players quit the moment a little thinking is required. Designs that preserve a measured 'productive struggle' work better.

(4) If you add an AI companion or buddy character, build in the judgment to *not* help. The good mentor the authors hold up promotes independence by sometimes withholding help. An in-game assistant can likewise gauge how stuck the player is and ration its help, e.g. varying how much it reveals based on the count and spacing of hint requests, rather than always giving the shortest answer.

Limitations

What I (Fukai) flag here is of two kinds: what the authors themselves admit, and what I noticed reading it. First, the authors' acknowledged weaknesses. Experiment 1 may have retained low-ability participants who produced correct answers only via AI, inflating the apparent gap (fixed in Experiment 2 via a pretest). The by-usage breakdown ('direct answer' vs 'hints') is explicitly cross-sectional, not causal. The effects were measured in single sessions of just 10-15 minutes, so cumulative effects over months or years remain a hypothesis ('this could happen'). The aggregate skip-rate difference in Experiment 2 was also not significant.

Second, what I noticed. Participants are paid US adults on Prolific, not children in a classroom and not players of an actual game. The 'no-penalty skip' is a clean proxy for persistence, but its motivation differs from retrying or paying in a game. And the AI was set to the most extreme case, instant and complete answers on demand; real game hints are often tiered, paid, or capped, and that gap is not small. Conversely, the paper's 'the hints group did not decline' result reads as a tailwind for tiered hint design. I'll also restate that this is an arXiv preprint, not a peer-reviewed final version.

Fukai's Reading

Let me mark this clearly as my own reading. I want to read this study as the learning-side version of the old game-design maxim, 'don't take interesting decisions away from the player.' An AI that hands over the answer instantly takes over the whole chain of decisions, get stuck, try, have the insight, on the player's behalf. In the vocabulary of design criticism, this is 'removing too much friction.' What good puzzle games have protected is not the stress of being unable to solve, but the pleasant tension just before solving, and that tension is exactly what produces the reward of solving on your own and the persistence to try again. AI hints can be the most powerful friction-removal tool there is, but remove too much and you erase the very source of the reward. This paper, as I take it, shows that over-removal can happen fast enough to measure in ten minutes.

Closing

For those who want to go deeper: read this alongside the classic that organized cognitive offloading (Risko & Gilbert, 2016) and Bjork and colleagues' learning research on 'desirable difficulties,' and the map comes into view. On the game side, set it next to work on tiered hint design and player-model-based adaptive assistance, and the paper's question, 'when not to help,' moves closer to implementation. Next time you add an AI assistant or a hint feature to a game, it's worth asking once: what will the player become able to do *without* it?

References

Papers and related material referenced in this article:

・AI Assistance Reduces Persistence and Hurts Independent Performance (Grace Liu, Brian Christian, Tsvetomira Dumbalska, Michiel A. Bakker, Rachit Dubey, 2026, arXiv preprint arXiv:2604.04721)

・Project page by the authors

・Background: Desirable Difficulties in learning (Bjork et al.)

Reactions (no login)

Anonymous • one of each per visitor per day