PAPER-DIGEST · 2026-07-03
Mirowski et al.: From Writing a Story to Finding One — Fabula, a Writing AI Grown With the Writers' Community — Fukai Reads
interactive narrative / story generation / participatory AI design
TL;DR
A Google DeepMind team built Fabula, a writing app for authors of fiction, screenplays and plays, and then critically co-developed it with the writers' community. At its core is a 'Drama Manager' that plans and generates a story as a hierarchy of scenes and their smallest units, 'beats,' by calling a large language model (LLM, an AI that learned patterns of language from huge amounts of text) many times to assemble the plot and the script.
Through interviews and writing sessions with 42 experts, the authors interrogate each of their design choices. In short: Fabula is strong at structure but weak at style and at producing the unexpected, and it resonated most with novices and screen writers. I chose it today because 'Drama Manager' is a term straight out of interactive-narrative research in games.
Introduction
The authors are Piotr Mirowski, Ben Wedin, Reinald Kim Amplayo, Richard Evans and colleagues (all at Google DeepMind; co-author Lion Schulz at Bertelsmann). It is an arXiv preprint (arXiv:2606.14411, cs.HC, submitted 12 June 2026, CC BY 4.0); the version I read does not state it has passed peer review, so I treat it as a pre-review preprint.
Lead author Mirowski also created Dramatron (2023), which writes scripts via hierarchical prompt chaining (from a one-line logline down to characters, plot and dialogue). Fabula reads as a continuation of that line of work.
I reached for it on a day when the AI feeds were thin on puzzle papers, and this 41-page work stood out. What interests me is less the story generation itself than the authors' use of the design process as a tool to surface conflicts between developers and writers.
Background
Recent LLMs write fluently, yet human writers have judged their output as leaning on clichés, thin on subtext and symbolism, and prone to moralistic, predictable endings (the authors cite Chakrabarty et al., 2024). Still, newer models keep improving at creative writing.
Writing tools have proliferated: AI Dungeon and Lore Machine lean toward game-like immersion but make it hard to see the story's arc or edit earlier parts; Sudowrite Muse and SAGA build structured blueprints up front; Dramatron pioneered hierarchical prompt chaining. Fabula layers 'participatory AI' — bringing writers into the design — on top.
The authors note that prior participatory work often treats user feedback merely as material to polish a tool — what others have called 'participation washing.' Instead they use the prototype itself as a 'cultural probe' (a device placed to elicit people's values) to surface value clashes between developers and writers. Why it matters: the same friction between generative AI and craft shows up in game narrative production.
Approach
The core is the Drama Manager, an LLM-based system that handles a story at three levels — story plan, beat plan, and script — with roles for creating plans, generating script lines, and evolving plans dynamically. A story is a sequence of scenes; each scene a sequence of beats. A beat is a short passage (up to ~10 lines) where something dramatically significant happens; beat changes are cued by entrances/exits, location shifts, changes of topic, emotion or power, or shifts in a character's objective.
The authors stack modules to raise quality: 'narrative desiderata' planned in advance (thematic unity, suspense, surprise, escalation, closure, intelligibility, emotional range), and, borrowing from Stanislavsky's acting theory and interactive-fiction work (Versu), an Objective / Stakes / Obstacle definition for each character so scenes are driven by intent, not arbitrary plot moves.
Advanced Drama Managers run a generate-evaluate-select search loop (self-critique: sample several candidates, have an 'editor' LLM list pros and cons, pick the best) — higher quality but higher latency. They built 26 different Drama Managers to explore the quality-versus-responsiveness trade-off, and evaluate with an 'auto-rater' (63 guidelines distilled from craft textbooks, each scored 1-5) plus 'self-play' agents. The model at study time was Gemini 2.5 Flash (later the Gemini 3 family). Put without math: they translate story quality into a checklist and have the AI grade it.
For the UI they deliberately avoid a chatbot, offering two modes — 'Gardener' (moment-by-moment co-creation) and 'Architect' (top-down structure) — framed by the idea of 'finding' rather than 'making' a story (Borges' Library of Babel), which they call convergent iteration.
Findings
With the paper's own numbers: the auto-rater's agreement with humans' overall preference was 0.72 for the naive guideline version and 0.83 for the improved one (the human-preference data had Fleiss' kappa = 0.46, p < 0.01, N = 150, three categories). The self-critique search 'yields significantly higher quality narratives according to our evaluations' but adds latency (the size of the gain is not given as a number in the parts I read).
Writers were split. Many praised the mechanical side — structure, scene-breaking, emotional arc ('terrific... how to break an idea into a perfectly workable scene,' W4) — but found dialogue and style weak, calling output 'generic,' 'corny,' 'creative 101,' and failing to imitate Brecht, Pinter or Beckett. The authors describe a 'stylistic ceiling' that converges to a safe mainstream average.
Quantitatively, the Creativity Support Index (a measure of how well a tool supports creativity, based on the NASA-TLX workload index) among 25 creatives was highest for self-declared novices and film-industry people, lowest for theatre makers. Architect mode was welcomed ('like creating your character sheet in Dungeons and Dragons,' W15) but felt 'rigid,' with edits cascading and cause-effect breaking across scenes; Gardener mode felt 'like a computer game' (W18), fun and more surprising. Cultural bias appeared too: characters defaulted to white, cisgender and male even when unspecified, and female characters skewed to 'timid, quiet' stereotypes.
Where it is useful
Concretely for game and puzzle makers. First, if you build interactive narrative or an adventure game, you can repurpose the Drama Manager as a runtime story-control layer. Gardener mode's moment-by-moment co-creation is close to interactive narrative, and the authors themselves name 'interactive storytelling' as a future direction. One participant's image (P5) — a 'playground' where you let characters loose in an AI world and watch what creates conflict — maps directly onto emergent-narrative game design.
Second, a procedural-narrative pipeline: borrow the three-level hierarchy (story plan to beat plan to script) as a data model for quest or dialogue generation. The beat-change triggers (entrances/exits, location shifts, changes in emotion or power, objective changes) correspond almost one-to-one to game state changes. If I were adding story to a Sokoban-like, I'd start from that 'state change = beat boundary' mapping.
Third, automated playtesting of narrative: the auto-rater recipe — score against 63 guidelines and force arguments for each 1-5 rating to fight position bias (the habit of picking the first or last option) — is a template for triaging generated quests and lines before human review. Fourth, the Objective/Stakes/Obstacle schema is a ready NPC-authoring template for RPGs and visual novels. Fifth, the 'absurdity dial' that experienced writers asked for — don't over-optimize for consistency; expose a knob for novelty — applies directly to roguelike and PCG (Procedural Content Generation) narrative.
Limitations
The authors' own admissions: iterating on a single Drama Manager can't capture every culture; the scene-beat structure is Western-biased and the style skews to screenwriting; Architect mode is rigid, with local edits cascading and cause-effect breaking between scenes; and an auto-rater that optimizes 'consistency' may strip out the productive 'glitches' and surprises experienced writers use as springboards.
Here Fukai points out the following. This is a pre-review preprint, and the qualitative core rests on a small sample of 25 writers. The CSI subgroups by domain are smaller still, so 'resonated most with novices' should be read as a tendency, not a firm claim. The 0.72-to-0.83 figure is agreement with a single external dataset on a task the authors themselves call subjective, and the 'significantly higher quality' of the search loop is not quantified in the parts I read.
Fukai's reading
From here it is Fukai's reading (I flag this as my opinion). I'd place this study as a careful field record of a generative-AI paradox: the more a system self-improves, the more it drifts toward the average. The evaluation loop that raises 'quality' pulls stories toward the safe and readable — yet what the writers valued was precisely the unexpected that breaks that safety. In the language of design criticism, the metric that raises quality is also the metric that flattens voice. Anyone using an auto-rater as a reward signal for game narrative should keep that one step in mind — that's how I read it.
Closing
For readers who want to go deeper: the 'Drama Manager' idea is rooted in interactive-drama research since Mateas and Stern's Façade. Start with the same team's Dramatron (2023) to see the map from hierarchical prompt chaining to this work; Evans' Versu, which drives stories through social simulation, is a fascinating companion for the Objective/Obstacle lineage. I've tried to make this article stand on its own, but the original still holds many more of the writers' raw voices.
References
Papers and materials referenced in this article:
・Related work: Co-Writing Screenplays and Theatre Scripts with Language Models (Dramatron; Mirowski, Mathewson, Evans, CHI 2023)
Reactions (no login)
Anonymous • one of each per visitor per day