PAPER-DIGEST · 2026-07-02

Özkan: Co-Training the Level-Generating AI and the Level-Solving AI — Fukai Reads

Co-adaptive procedural content generation with reinforcement learning (Unity ML-Agents, arXiv preprint, Oct 2025)

Reviewed by Fukai · #paper-digest #research #pcg #reinforcement-learning #level-generation #game-ai #unity #difficulty #player-modeling

日本語版を読む →

TL;DR

What happens if you stop building your level generator and your level-playing AI separately, and instead train them together? This paper by Miraç Buğra Özkan (Istanbul Technical University), posted to arXiv in October 2025 as a preprint (the text names no peer-reviewed venue), tries exactly that inside a 3D Unity world. A hummingbird agent (the player, collecting nectar) and a floating-island agent (the level generator, deciding where flowers go) are trained at the same time, each watching the other's results.

The final trained hummingbird collected all flowers in about 90.2% of 100 unseen layouts (paper, Table III). The island, which at first produced terrible placements (steep slopes, overlaps), gradually converged toward layouts that were pleasant to play. Put the generator and the solver in one loop, and the difficulty tends to calibrate itself — that is the paper's single most interesting point.

Introduction — who wrote it, and where

The author is Miraç Buğra Özkan, in Artificial Intelligence and Data Engineering at Istanbul Technical University; it is a single-author paper. The manuscript is laid out like an IEEE conference template, but the text names no specific venue and does not state that it was peer-reviewed. So I treat it as an arXiv preprint (arXiv:2510.15120, posted October 2025, possibly not yet peer-reviewed).

Honestly, I did not pick this today for its freshness. It was posted about nine months ago, outside the 'last 60 days' window I usually aim for. I chose it because the full text is readable end to end, and because the author states the system is a modification of Unity's official 'ML-Agents: Hummingbirds' learning course — meaning a capable hobbyist can reasonably reproduce it. That fits my habit of favoring implementation-near papers.

One caveat up front: this is not, strictly, a puzzle paper. It is a foraging/navigation task — a hummingbird flying over 3D terrain to collect nectar. Even so, the idea of putting the level generator and the level solver in the same learning loop is well worth carrying home for anyone who designs puzzles or stages. That is what I came to read.

Background — the limits of PCG and of RL

First, some plain definitions. PCG (Procedural Content Generation) means algorithmically creating levels, terrain, item placements, and so on. Classically this used rule-based systems or noise functions (which turn randomness into natural-looking patterns), but the author frames these as lacking adaptability and struggling to guarantee playability or balance.

Next, reinforcement learning (a framework where an agent acts in an environment and learns by trial and error to earn higher reward). Recently many methods use it to learn a policy (what to do in a given situation) from interaction. But, the author argues, most machine-learning PCG (evolutionary or search-based) decouples the generation step from the agent actually playing, making real-time coupling hard.

That decoupling is the gap this paper targets. Rather than making 'create' and 'play' separate stages, it locks them into one feedback loop: the island (generator) watches the hummingbird's (solver's) results and adjusts placements, while the hummingbird relearns how to fly in response. Both change in tandem — which the author situates within the recent line of work where the environment and the policy co-develop.

Approach — two agents and the trick of good observations

The system consists of two Unity agents: the hummingbird (solver) and the floating island (generator). Both learn via PPO (Proximal Policy Optimization — an RL method that 'clips' policy updates so they never swing too far at once, keeping training stable). I skip the equations; think of it as 'step toward what looks better, but never in too large a stride.'

The hummingbird's observations are carefully chosen. They include raycasts forward, up and down (light rays that detect flowers, obstacles and terrain), a relative vector to the nearest flower, its own velocity and orientation, the surface normal directly below (the slope direction), and even the island's chosen flower spread radius r and congestion c. The author says such auxiliary inputs stabilize learning. The reward is straightforward: plus for collecting nectar, minus for collisions, overly sparse layouts, or taking too long.

The island (generator) observes obstacle placements, the hummingbird's starting position, and last episode's metrics (average reward, nectar collected, steps to the first flower, collision count), and outputs two continuous values: the flower spread radius r and congestion c. Bad placements — overlapping, on steep slopes, unevenly spaced — incur penalties. The idea is that this steers the island toward layouts the hummingbird finds playable.

Findings — reading the numbers

The author reports not just final metrics but a chronological trail of trials. The initial, observation-poor setup trained unstably: even after 5M steps it collected only 6.2 flowers per episode on average, taking over 150 steps to reach the first flower. Adding terrain normals and orientation (Trial 2) raised average collection to 10.9 by 8M steps and cut collisions by 23% — showing, relative to the bare setup, that the auxiliary cues help.

In Trial 3, with a feedback loop added to the island, steps-to-first-flower dropped to 44, and behaviors emerged on their own: flying high to survey sparse layouts, hugging the ground in dense ones. But around 12M steps, large swings in the layout parameters (r, c) caused instability. Trial 4 penalized bad placements to gate parameter updates; the reward curve smoothed, converging to an average reward of 1.35 per step over the final 5M steps, with all flowers collected in 92% of episodes.

The author also runs an ablation study (removing components one at a time to see which parts matter). Without terrain normals, collisions rose 40%; without raycasts, reaching the first flower took twice as long; without the island's layout parameters, the agent stopped adapting and reverted to zigzagging. In Figure 8 the success rate is 92% with everything, 84% without terrain normals, 61% without layout parameters, and just 38% without ray sensors. On 100 unseen layouts the final results were 90.2% success, 12.4 nectar on average, 1.4 collisions, with best performance around congestion ~0.5 and radius ~7 (Figure 9).

Use cases — how a maker can apply this

A few concrete examples. First, if you are building a Sokoban-like level generator — where you would normally do 'generate, then vet with a separate solver' in two stages — you could instead train generator and solver in one loop. Replace the hummingbird's 'steps to first flower' with your board's 'time to first move,' 'move count,' or 'dead-end rate,' and the generator gains a handle for tuning difficulty itself.

Second, for hyper-casual PCG, the island's (radius r, congestion c) map directly onto generation parameters like enemy spawn density, item spacing, or safe-zone size. It is telling that the best performance here was at a middle value (congestion ~0.5), not an extreme. Penalizing deviation from a target value can be reused as a mechanism that automatically avoids both 'too easy' and 'too hard.'

Third, automated playtesting at scale: run a PPO-trained solver agent as a cheap proxy player to estimate difficulty from success rate and collisions before shipping. Also, the implicit curriculum the feedback loop produces (easy layouts gradually giving way to hard ones) is a template for designing tutorials and early difficulty curves. And the plainest but most useful lesson: giving your bots the right observations (relative position to the goal, the slope underfoot) alone stabilizes their behavior.

Limitations — what the author admits, and what I noticed

Start with the author's own admissions. Early in training, 'reward hacking' appeared: hovering near dense clusters to farm collision-avoidance without actually collecting nectar. Overly large radii make flowers extremely sparse, hurting search efficiency, and the island's heuristic tuning can make (r, c) oscillate and destabilize training. Also, there is only one hummingbird; extending to cooperative or competitive multi-agent settings is left as future work.

From here on, these are things I noticed as a reader. The abstract and methodology say the island (generator) is trained via PPO, yet Trial 3 in the results describes adjusting (r, c) with 'a simple hill-climbing heuristic,' and the limitations section itself lists 'learning the island's generation policy via RL rather than heuristics' as future work. What actually drives the generator is, as I read it, internally inconsistent. I cannot assert which is true, but readers should be careful.

One more: the implementation section states a '53-dimensional observation space,' but summing Table I gives only 24. The numbers disagree. On top of this, the paper is a single-author preprint with no named venue and no cross-game benchmark comparison. Its results are the record of one system's trials, not a general law. And to repeat, this is a foraging/navigation task; whether it generalizes to true puzzles is not tested here.

Fukai's reading

With the caveat that this is my interpretation: I want to place this work as an attempt to bring 'unsupervised environment design' — the lineage of PAIRED and POET, which make the environment itself a learning target — down from large-scale compute to the reach of a hobbyist Unity tutorial. Its value lies less in the 90.2% figure than in showing that an ordinary ML-Agents teaching example can turn into a loop where generator and solver co-evolve. In the vocabulary of design criticism, I read it as a tactile demonstration of half-automating the 'difficulty knob' by letting the level itself talk back to the player.

Closing

For those who want to go deeper, here are papers that can serve as a map. For the theoretical backbone of environment and policy co-developing, see Dennis et al., 'Emergent complexity and zero-shot transfer via unsupervised environment design' (NeurIPS 2020, PAIRED) and Ecoffet et al., 'First return, then explore' (NeurIPS 2021). For the big picture of PCG, Togelius et al.'s survey of search-based PCG is a classic. To follow the thread of tying generation to player experience, the experience-driven PCG (EDRL) Super Mario work is a good entry point. Read this paper as the 'hands-on plot a solo developer can work' within that larger map, and its place becomes clear.

References

Papers and related resources referenced in this article:

・Procedural Game Level Design with Deep Reinforcement Learning (Miraç Buğra Özkan, 2025, arXiv preprint arXiv:2510.15120)

・Tool: Unity ML-Agents Toolkit (the implementation base of this paper)

Reactions (no login)

Anonymous • one of each per visitor per day