PAPER-DIGEST · 2026-06-25
Munk et al.: Generating Dynamic Game Text with Small Language Models — Fukai Reads
Small language models (SLMs) and game content generation
TL;DR
Generating in-game text dynamically with large language models (LLMs — AI trained on huge amounts of text that produce writing by predicting the next word) is a busy research area, but it hits practical walls. Relying on a giant cloud model makes single-player games online-only, makes running costs unpredictable, and risks the game breaking when servers shut down. This paper proposes the opposite direction: instead of one giant general model, use very small models (SLMs, Small Language Models), each aggressively fine-tuned (adapting an existing model to a specific use through extra training) for a single narrow job, then compose them.
As a proof of concept the authors build DefameLM, one SLM that runs an entire game loop for a medieval RPG "reputational conflict" — generating smear-campaign posters from intelligence the player has collected. The bottom line: even a tiny one-billion-parameter base model, once narrowly trained and paired with a "retry until it passes" strategy, reaches adequate quality at a practical speed of roughly a few seconds on a consumer PC. This article distills the paper so you can grasp it without opening the PDF.
Introduction
The paper is "High-quality generation of dynamic game content via small language models: A proof of concept" by Morten I. K. Munk, Arturo Valdivia and Paolo Burelli of the IT University of Copenhagen; Munk is also affiliated with the game studio Raw Power Labs. It appears as an arXiv preprint (arXiv:2601.23206, posted January 2026), so I cannot confirm it has passed peer review (the process where experts vet a paper before publication). I flag that up front.
I chose it today because its question sits exactly where researcher interest overlaps with the worries of people actually shipping games. Many developers who want generative AI in their game stall not on quality but on "will it run offline?", "what will it cost to run?", and "will a cloud vendor hold my game's lifespan hostage?". This paper takes those practical anxieties head-on.
Background
Some context. There is plenty of work on generating game dialogue and quests with generative AI, but using a giant LLM directly keeps hitting a wall: it cannot maintain a coherent grasp of the game world. The authors cite a study where even a top-tier LLM, asked to play the text adventure Zork I, failed to infer the world state and form goals. Even throwaway NPC chatter ("barks" in developer slang) can drift out of context.
A popular remedy lately is an "agentic" design: split a complex task into smaller subtasks, each handled by a separate model call — e.g. break an NPC decision into "summarize recent events", "assess mood", "write the final line". But the authors note that if each node is still a giant cloud LLM, the core problems — online-only, unpredictable cost, server-shutdown risk — cannot be removed in principle.
So the paper accepts that decomposition is the right direction, but replaces the contents of each node, from "a giant LLM call" to "a small model specialized for a narrow job". SLMs have been shown to rival LLMs and human writers on short, clearly contextualized creative writing, and their ease of fine-tuning, on-device operation and low cost make them, in the authors' reading, a good fit for embedding in games.
Approach
The core proposal is "a network of specialized small models, wired like a directed acyclic graph (DAG — a flow of arrows that never loops back)". Each node does exactly one narrow job: a particular kind of generation, a retrieval, or a game-state change. The key is aggressive specialization: don't make one model do everything; the harder the task, the narrower the scope and the tighter the adherence to the training data. The authors call this the "creativity–consistency trade-off", controllable via two knobs: data variety and degree of overfitting.
The proof of concept is DefameLM. The setting is a medieval RPG of reputation and power. The player gathers compromising intelligence through infiltration, bribery or eavesdropping, hands it to a morally dubious scribe, and out comes a smear poster for the town. The model receives the sender's and target's attributes, one or two pieces of intelligence, the intended audience (peasants, nobles or guards), and a mocking "angle" (e.g. illiteracy, bad fashion sense). The output is poster text of at most 500 characters that aggrandizes the sender and belittles the target. The reputation scores themselves are updated deterministically, without AI.
Training data is generated synthetically via the DAG approach: elements like country of origin, personality and faction are chosen from conditional lists or written by a large "teacher" model (here ChatGPT-4o) and stitched into world-grounded inputs. This produced 1800 input–output pairs, 1440 for training and 360 for evaluation. The base model is Llama 3.2-1B (a one-billion-parameter class model), fine-tuned with LoRA (Low-Rank Adaptation — training only small add-on parts rather than the whole model, a cheap way to specialize).
At run time they use a plain "retry-until-success" strategy. Temperature (a value setting how much the output varies) is 0.75, so even after a failure, repeated generation eventually lands a good one. Pass/fail is decided by an "LLM-as-a-judge" scheme, with a strict bar: an output passes only if it clears all seven criteria.
Findings
The main results cover quality, size and speed. Quality first: the teacher ChatGPT-4o passes at 98%, near perfect. The fine-tuned small models come close: the 16-bit version at 92.5%±1.2% and the 8-bit at 94.2%±1.2% (numbers from Section 4.1). The 16-bit and 8-bit difference is statistically indistinguishable (McNemar test, p=0.41), and the authors conclude the 8-bit is a "drop-in replacement". The most aggressively compressed 4-bit version drops clearly to 78%±2.2%.
Size and speed next. Across compression levels, memory footprint is 2.48GB for 16-bit, 1.32GB for 8-bit, and 808MB for 4-bit. The authors argue that to coexist with a demanding game on a consumer GPU (8GB VRAM is typical), under 2GB is desirable, and the 8-bit and 4-bit fall in range. Per-token generation speed is 3.6 ms for 4-bit versus 16 ms for 16-bit — a 4.5x gap (Table 1).
And the practical crux, time-to-success. Including retries, the median is 2.1 s for 4-bit, 2.5 s for 8-bit, and 4.8 s for 16-bit (Table 2). The 16-bit and 8-bit models succeed within two attempts for most inputs. Strikingly, although the 4-bit has a lower per-attempt pass rate, its faster generation makes it the fastest overall even with retries. This is measured on an ordinary gaming PC, an RTX 3070 with 8GB VRAM.
The authors add a caveat about 4-bit. Looking only at the hard inputs (21 of 50 prompts), the 16-bit and 8-bit models tend to "struggle on the same problems" (rank correlation ρ=0.84), whereas the 4-bit correlates weakly with both (ρ=0.40 and 0.30, not statistically significant) — it appears to break in a different way. This suggests that the more you compress, the more you risk occasional deep failures on unexpected inputs.
Use cases
Concrete takeaways for game makers. First, "dynamic generation anchored to a game loop". If your game centers on a fixed kind of exchange or event (negotiation, tavern rumors, wanted posters, reputation manipulation), it is realistic to embed one small model specialized for just that scene. The narrower the scope, the cheaper to train and the steadier the quality — the paper's most practical lesson.
Second, "embedding in offline single-player". Cloud LLMs force games online and create cost and server-lifespan anxieties. If you aim for a buy-once, long-lived single-player title, an on-device model of around 1GB removes those worries entirely. The authors say they integrated a llama.cpp-based SDK directly into Unreal 5 and Unity, so connecting to existing engines is in scope.
Third, the operational pattern of "aggressive compression plus retries". If you are making a hyper-casual or graphics-heavy title with little memory to spare for AI, you can compress to 4-bit, sacrifice per-attempt accuracy, and win it back through generation speed plus retry-until-success. But as the previous section noted, 4-bit occasionally fails deeply, so the pass/fail check needs to be built carefully.
Fourth, beyond generation itself, the "blueprint for making data" is reusable. The DAG method — combining list selection and automatic generation to mass-produce world-grounded training examples — is a cheap recipe for growing your own game dataset, whether or not you use an SLM. Structuring the output as "intelligence → appeal to the audience → self-praise of the sender", so that player-unlocked elements always surface in the text, is also worth copying when you want generated text to feel rewarding.
Limitations
Start with the weaknesses the authors themselves acknowledge. The biggest is "runtime quality assessment". In this paper, pass/fail is decided by a cloud ChatGPT-4o judge. So to truly complete "retry-until-success" entirely on-device, you need a lightweight checker on the device itself, which remains future work (the authors also note the difficulty that failures are too rare, skewing the checker's training data).
The authors also qualify the evaluation method. Using the same ChatGPT-4o for both training and judging can introduce a self-consistency bias, and a coarse pass/fail metric misses fine nuance. Indeed, on manual comparison, the authors candidly write that even when pass/fail tied, the 16-bit felt more "refined". Their position is that final creative judgment needs human-in-the-loop. They also concede that structuring the output made the text somewhat monotonous and repetitive.
What I (Fukai) would point out here is that this is a proof of concept on a single kind of game loop — "medieval smear posters" — with a single model. The paper's claim is carefully limited to "one SLM can run one loop"; the original vision of many SLMs wired in a DAG is not yet demonstrated. On top of that, the subject matter is the ethically heavy domain of defamation and propaganda (the authors do flag the need for ethical consideration), and there is no human-subject study of whether players actually find it fun — both worth keeping in mind before adopting it in production. And since the training teacher is GPT-4o, the development stage still depends on the cloud, which is a separate matter from full offline operation.
Fukai's reading
From here, I flag this as my own reading. I would place this study as a quiet rebuttal to "bigger is better" in game AI. In the vocabulary of design criticism, it recasts generative AI not as an all-purpose storyteller but as a "dedicated part" that meshes with a specific game loop. By stripping away the general LLM's freedom and binding it with scope, structure and teacher data, it ironically recovers consistency and operability. Giving up freedom is what makes it usable — a paper that, to me, resonates neatly with the old game-design lesson that constraint is design.
Closing
For those who want to go deeper, here are related works that can serve as a map. If you care about the limits of LLM planning and reasoning themselves, SokoBench (measuring long-horizon lookahead via Sokoban), which we have also covered, is a companion piece. On narrative generation, DAGs and emotional arcs, the authors' cited Wen et al., 'All Stories Are One Story', is useful. As background to the move to put SLMs at the base of agents, reading NVIDIA's 'Small Language Models are the Future of Agentic AI' (also cited by the authors) reveals the map of why one would deliberately build small now.
One last note. Every number cited here comes from the original (preprint) text; I have not rounded anything by guesswork. Please read it without over-generalizing, mindful that it is a pre-peer-review paper at the proof-of-concept stage. I read it with a hot, strong cup of drip coffee in hand, drawing many lines on a printed copy.
References
Papers and related material referenced in this article:
・Related: Small Language Models are the Future of Agentic AI (Belcak et al., 2025)
Reactions (no login)
Anonymous • one of each per visitor per day
Read next
Related reviews
Heaven's Vault
An archaeology adventure where you sail rivers of space, recover artefacts from forgotten ruins, and decipher an invented hieroglyphic script one word at a time from context. Built on a branching story that remembers your choices and a language you piece together yourself. From inkle, makers of 80 Days.
Storyteller
A puzzle about arranging characters and scenes across comic panels to satisfy a given story title — love, revenge, jealousy — making each prompt come true as a few-panel tale. A compact narrative puzzle by Daniel Benmergui.

