PAPER-DIGEST · 2026-06-30

Feng et al.: Can LLM Agents Bargain Well in a Trading Game? — Fukai Reads

SidConArena, a mixed-motive negotiation benchmark based on Sidereal Confluence

TL;DR

What happens when you drop large language models (LLMs — AI systems trained on huge amounts of text to read and write) into an economic game where they must haggle with rivals? This paper, from a Tsinghua University team, introduces SidConArena, an evaluation environment built on the trading board game Sidereal Confluence. Players pool resources, refine them through converters, and finally fight over permanent assets in an auction.

The short version: strong models like GPT-5 and Gemini-3-Flash-Preview do post higher economic scores. But when the authors comb through the game logs, three weaknesses recur — agents misprice resources even while following the rules, they bargain passively and too politely, and they plan badly over long horizons. For anyone building games who wants to use AI as a negotiation opponent or playtester, I find these observations genuinely instructive.

Introduction

The authors are Yeqi Feng, Yuxin Chen, and Tianxing He (Feng and Chen are co-first authors with equal contribution; He is corresponding), all at the Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University. The work is posted on arXiv (a server for free preprints — papers shared before peer review) as arXiv:2606.27397, classified mainly under cs.MA (multi-agent systems) and cross-listed in cs.AI and cs.GT (computer science and game theory). It appeared as a new listing in late June 2026 and has most likely not yet passed peer review — worth stating up front.

Why pick it today? I translate papers for people who make puzzles and games, and recent AI benchmarks have leaned toward 'can it solve a problem with one right answer' or 'can it win a zero-sum game.' This paper takes on a cooperative-yet-competitive economy where winning is not simple — and because its subject is a real trading board game, it felt readable through the vocabulary of design. So I printed it out and went at it with a red pen and strong hot coffee.

Background

Start with the motivation. Many past attempts to evaluate LLM agents have been static — answer a fixed question set — which also risks 'contamination' (when evaluation problems quietly leak into training, making a model look better than it is). The authors argue the real test of an agent shows up in settings that are dynamic, only partially visible, and socially interactive.

Sidereal Confluence is a board game in which alien species trade resources to grow their civilizations. Each player wants to run their 'converters' (devices that turn one resource into another) but lacks enough raw material alone — the authors call this an 'Inherent Deficiency.' So players negotiate to swap resources. Crucially this is not zero-sum: good trades make everyone richer, a 'positive-sum' structure.

Much real economic activity has this mixed-motive shape — cooperate to create value while competing over scarce resources. A strong agent needs more than per-trade arithmetic: it must read bargaining power, allocate resources, and plan investments several moves ahead. The authors note there has been no good arena to test all of this at once.

Approach

The authors formalize the game as a 'finite-horizon partially observable stochastic game (POSG).' In plain terms: the game ends after a set number of rounds (finite horizon), each player sees only their own hand rather than the whole board (partially observable), and there is some chance involved (stochastic). Each turn splits into three phases.

First, Negotiation: players exchange natural-language messages plus binding trade offers (I give you this, you give me that), and a trade goes through only if both sides expect to gain. Second, Production: players run converters on their resources — essentially a knapsack-style combinatorial optimization of what to make. Third, the Confluence: permanent assets (colonies and technologies) are won through sealed-bid auctions (everyone bids at once without seeing others' bids), using a token called 'Ships' as currency.

The agent side is the clever part. A central dispatcher the authors call the 'Brain' routes each observation to a specialized reasoning module depending on the current phase. A 'neural-symbolic interface' (a bridge that turns the model's verbal reasoning into function calls the game engine can execute) lets free-form language coexist with strict, rule-grounded execution. There is even a web interface so humans can play in the same environment.

Findings

The model lineup is broad: compact models like GPT-4o-mini and o3-mini, general chat models like Qwen-Plus and DeepSeek-V3, and high-capability systems like Gemini-3-Flash-Preview, GPT-5, and Claude-Opus-4. Evaluation runs two ways: homogeneous self-play (every seat is the same model) and heterogeneous tournaments (different models mixed in one economy), ranked with Elo (the relative-strength rating used in chess). Note that the paper reports concrete scores in figures (Figures 3-4) rather than as numbers in the text, so I describe trends here.

The trend is clear. High-capability models such as GPT-5 and Gemini-3-Flash-Preview reach the highest terminal scores (measured from the final value of holdings), while smaller or older models lag well behind (Figure 3). In mixed tournaments the strong models stay strong, showing this is not just an artifact of playing against similar agents (Figure 4). So far, as expected.

Here is where it gets interesting. First, agents take valid actions yet misprice resources. In the authors' audit (Table 1), gpt-4o-mini trades three units of Industry for one Ship even though the environment lists their values as 3 and 1 — i.e., equal — judging it a great deal simply because it gains a Ship. The authors call this a 'Ship premium': the model associates Ships with scarce auction currency and loses track of local marginal value (what one more unit is actually worth right now).

Second, bargaining is passive and overly polite. Even holding a scarce token with several buyers, an agent accepts the first 'reasonable-looking' offer rather than anchoring, counter-offering, or playing buyers against each other (Figure 5a) — cooperative but not strategic. Third, long-horizon planning is a major weakness: each turn looks locally sensible (conserve resources, balance inventory), yet the trajectory fails to invest in productive capacity early, misses compounding, and ends weak at converting to score (Figure 5b). The authors frame it as a lack of multi-turn coordination, not invalid moves.

Use Cases for Game Makers

What can game makers take away? First, for trade- and negotiation-driven economy games (Catan, Sidereal Confluence itself, 4X and civilization-style trading), if you use an LLM as an automated opponent or playtester, this paper warns that a raw LLM will overvalue scarce 'currency-like' resources and bargain passively. Left alone it tends to become a 'nice, easily exploited AI.' For real challenge you must explicitly engineer anchoring (setting a strong opening reference price to pull the deal your way) and counter-offer behavior.

Second, designing NPC (non-player character) traders and diplomats. The 'accepts the first fair offer' failure is, flipped around, a tuning knob: a timid merchant can use the base LLM's passivity as-is, while a tough negotiator needs added aggression. Since base models drift toward politeness, I read it as safest to assume you must add toughness deliberately.

Third, difficulty tuning and resource design. Weakness at long-horizon investment means AI bots will tend to lose to humans in games where compounding and engine-building (setting up resources that generate more resources) matter — deckbuilders, 4X, idle games — which you can also use as a difficulty signal. And the 'Ship premium' is a resource-design hint: merely framing a token as 'the bidding currency' makes it feel more valuable even when the math says otherwise — useful both for making a resource feel precious and for avoiding unintended hoarding.

Fourth, as an architecture reference. The combination of a phase-aware 'Brain' dispatcher and a neural-symbolic interface that lowers verbal decisions into engine-executable function calls is an easy-to-copy pattern for embedding an LLM safely into a rule-strict game, keeping free conversation and rule-bound execution at once.

Limitations

On limitations, starting with what the authors concede. What I (Fukai) would stress here is that evaluation is confined to LLM-versus-LLM play; mixed human-LLM games are not yet run, so we do not know what happens when a human exploits an agent's quirks (like the Ship premium). The negotiation protocol is also simplified — committed trades are binding proposals, with no broken promises, conditional agreements, deferred payment, reputation, or betrayal, the messy heart of human bargaining. And models are not fine-tuned (retrained for a specific use), so results reflect 'out-of-the-box' ability rather than peak performance tuned for this economy. Rollouts are finite, leaving some random variation, the authors say frankly.

Two things I would add from reading. One: the text gives no concrete scores in prose, leaving trends to the figures, so the paper's words alone cannot tell you exactly how much stronger each model is — you must look at the figures. Two: whether the observed 'passivity' is a side effect of tuning chat models to be polite and cooperative (RLHF-style optimization toward human preference) or a genuine lack of strategic ability is not separated within this paper's scope. For using these as game AI, that distinction would matter a great deal.

Fukai's Reading

Now my own reading (this section is my opinion). I would place this study in the tradition of using a board game's rulebook as a mirror that reveals an AI's weaknesses. Sidereal Confluence is, at heart, humans around a table raising their voices, driving up prices, and going for a last-auction upset — that heat is the fun of it. What this paper quietly shows, as I read it, is that today's LLMs can be the 'nicest player' at that table but rarely the strongest. In the vocabulary of design criticism, it reads as one case where AI will not learn the unwritten rule of bargaining unless you teach it explicitly. If so, the job of building a fun negotiation AI shifts from adding politeness to designing how to grant a healthy greed — and that is exactly where I see the designer's role.

Closing

To close: if mixed-motive multi-agent evaluation now interests you, try the trading board game this paper is built on — playing it once makes the paper's concerns click in your body. Academically, reading it alongside negotiation-agent work set in the diplomacy game Diplomacy, and the broader lineage of evaluating agents in non-zero-sum cooperative games, will give you a map. You will get a feel for where the difficulty currently lies in making AI not just win, but bargain cleverly.

References

Papers and related material referenced in this article:

SidConArena: An Environment Evaluating Agents in Open-Ended, Positive-Sum Bargaining Game (Feng, Chen, He, 2026, arXiv preprint arXiv:2606.27397)

Same paper, HTML version (full text and figures)

・Related game: the board game Sidereal Confluence (the economic structure this paper draws on)

Reactions (no login)

Anonymous • one of each per visitor per day

Read next

FEATURED ESSAY · 2026-06-30

"Solvable" and "legible": the two criteria for escape-room design that GenEscape spells out

One piece today. I read, in the original English, "GenEscape: Hierarchical Multi-Agent Generation of Escape Room Puzzles" by Mengyi Shan, Brian Curless, Ira Kemelmacher-Shlizerman and Steve Seitz of the University of Washington (arXiv:2506.21839). It is, on its surface, a paper about getting text-to-image models to render escape-room puzzles as pictures. But what is worth reading for a designer is how it splits the design problem into two criteria: a puzzle must be (1) solvable—the affordances of objects must form a coherent, logically sound sequence of actions—and (2) legible—the scene must carry enough visual cues to guide the player to that intended solution. The authors iterate four agents (Designer / Player / Examiner / Builder); the Examiner, in particular, hunts down and closes unintended shortcuts. It wears the clothes of an AI paper, but it puts into words the very work a designer does in playtesting.