Don’t Add Agents. Design the Coordination Layer.

Most multi-agent systems do not fail like weak models. They fail like badly run organizations.

Too many voices. Unclear authority. Premature agreement. No disciplined way to combine evidence. A beautiful workflow diagram can still hide the same old failure mode: several agents talk, nobody owns the decision, and the final answer feels collaborative while being less reliable than a simpler baseline.

That is why the useful part of Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems is not “five agent patterns were benchmarked.” The stronger lesson is architectural: coordination should be treated as its own layer, separate from model capability, prompt style, tool access, and application logic.

Do not add agents and hope intelligence emerges. Specify the coordination layer.

The paper’s actual claim

The authors study five coordination configurations on binary prediction-market tasks:

independent ensemble
peer-critique debate
orchestrator-specialist
sequential pipeline
consensus alignment

They hold the model, tool stack, prompt template, and per-call output cap constant, then compare how the coordination pattern changes the system’s behavior. The testbed uses 100 Polymarket binary markets resolved after the model’s training cutoff, with web search disabled for the controlled experiment.

That caveat matters. This is not a claim that one architecture is universally best across models, domains, tools, or information environments. The authors explicitly frame the work as a methodology-validating first instantiation of an architectural-layer framework, not as a general law of multi-agent systems.

The result is sobering anyway: none of the configurations produced positive Alpha against the market-consensus baseline in this fixture. The best configuration, sequential pipeline, roughly matched market consensus; the others were worse. Adding agents did not magically create forecasting edge.

For builders, that is the point.

Coordination is a spec, not a vibe

A coordination layer is not “we have three agents and they discuss.” It is a concrete specification of how the system is allowed to move information and authority.

A useful spec should answer at least these questions:

Topology: Which agents can send information to which other agents?
Roles: Who researches, critiques, decomposes, verifies, aggregates, or decides?
Authority: Which agent or operator has final say for each class of decision?
Synchronization: Do agents act independently, in rounds, sequentially, or event-by-event?
Aggregation: Are outputs averaged, selected, ranked, weighted, debated, or passed through a hierarchy?
Termination: When does the system stop — fixed rounds, convergence, budget exhaustion, or external trigger?
Failure handling: What happens when an agent returns malformed output, times out, or contradicts the rest?

Without this layer, “multi-agent” is too vague to debug. Two systems may both be called agent teams while having completely different failure surfaces. One may fail because the orchestrator anchors too early. Another may fail because debate suppresses useful dissent. Another may fail because a pipeline carries a bad first-stage frame downstream and never recovers.

Once the layer is explicit, those are no longer mysterious model errors. They are architectural hypotheses you can test.

Murphy decomposition is the useful diagnostic move

The paper uses Brier score because the task is probabilistic forecasting. More interestingly, it uses Murphy decomposition to split the Brier score into components:

UNC: irreducible uncertainty of the question set
REL: reliability, or calibration error
RES: resolution, or discriminative power

That split is important because two agent systems can have similar aggregate scores for different reasons.

One system may be well-calibrated but timid: it rarely makes sharp distinctions. Another may be highly discriminative but poorly calibrated: it spots signal, then overstates confidence. A third may sound impressively aligned because all agents converge, while actually collapsing diversity and losing resolution.

This is where the “badly run organization” analogy becomes useful. A team can fail by being chaotic, but it can also fail by being too agreeable. Consensus feels mature. In forecasting and decision systems, premature consensus can be an error amplifier.

The paper’s predicted signatures make that concrete:

Independent ensemble: preserves diversity; aggregation may cancel uncorrelated errors, but correlated errors can still produce confident wrongness.
Peer-critique debate: can improve calibration through cross-correction, but may suppress minority views and reduce discriminative variance.
Orchestrator-specialist: can help when decomposition is good, but the orchestrator becomes a single point of cascading error.
Sequential pipeline: can structure work cleanly, but early-stage framing errors propagate downstream.
Consensus alignment: can collapse disagreement into one voice, often losing diversity and resolution.

This is the builder-facing value: do not only measure final answer quality. Measure the shape of failure.

The cost-quality frontier is the deployment lesson

The benchmark’s headline ranking was:

sequential pipeline — best Brier, highest cost
independent ensemble — close enough to be a strong cheap baseline
orchestrator-specialist
peer-critique debate
consensus alignment — worst in this fixture

The cost profile matters as much as the score. In the reported run, independent ensemble cost about $0.10 per market with Brier 0.159. Sequential pipeline cost about $0.36 per market with Brier 0.153. Orchestrator-specialist and peer-critique debate cost more than ensemble while performing worse on Brier. Consensus alignment cost about the same as ensemble but performed much worse.

That does not mean “always use independent ensemble” or “always use sequential pipeline.” It means the coordination layer has a Pareto frontier.

If the task is cheap, low-stakes, and benefits from diversity, an independent ensemble may be the right baseline. If the task needs staged transformation — gather evidence, analyze, forecast, verify — a sequential pipeline may justify the extra cost. If the task requires decomposition by expertise, an orchestrator can be useful, but only if the orchestrator’s authority and verification paths are designed carefully.

The bad default is paying coordination overhead without buying reliability.

Why consensus is dangerous

Consensus alignment is seductive because it produces emotionally satisfying output. The agents converge. The final answer looks coherent. The system appears stable.

But convergence is not correctness. If agents share the same blind spot, forcing agreement can erase the only useful signal: disagreement.

In this experiment, consensus alignment was the weakest configuration. The authors caution that the sample is not large enough to establish broad cross-domain law, but the failure mode is familiar to anyone who has run agent teams: a system can become more confident by becoming less diverse.

For production builders, disagreement should be treated as a resource, not merely a mess to clean up. The question is when to preserve it, when to resolve it, and who gets authority to do so.

A practical design checklist

Before adding another agent to a system, answer this checklist.

1. What failure mode are you trying to reduce?

If you cannot name the failure mode, more agents are theater. Are you reducing hallucination, missing evidence, bad decomposition, weak calibration, tool error, policy risk, or final-answer variance?

2. Which diversity should survive to the final decision?

Independent reasoning can help only if the aggregation rule preserves useful differences long enough to matter. If every agent sees the same context, uses the same prompt, and is pushed toward agreement, you may be buying redundant tokens.

3. Where does authority live?

Authority can live in an orchestrator, a vote, a score, a verifier, a human, or a deterministic rule. If authority is implicit, debugging will be miserable.

4. What is the aggregation operator?

Mean, median, weighted mean, rank-then-select, judge-select, debate-then-commit, and pipeline handoff are different architectures. They do not fail the same way.

5. What is the stopping rule?

Fixed rounds, convergence, confidence threshold, budget cap, and verifier approval create different incentives. “Let agents talk until it feels done” is not a stopping rule.

6. What telemetry proves the coordination layer helped?

Track cost, latency, disagreement, revisions, verifier interventions, fallback rate, calibration, and per-role contribution. If you only log the final answer, you cannot tell whether coordination improved the system or merely made it expensive.

7. What baseline must it beat?

Always compare against simpler baselines: single strong model, retrieval-augmented single agent, independent ensemble, and human/crowd/market baseline where relevant. A multi-agent architecture that cannot beat a boring baseline is not production architecture; it is choreography.

The real takeaway

The paper does not prove that multi-agent systems beat markets. It does not prove that sequential pipelines are universally best. It does not prove that consensus is always bad.

It proves something more useful for builders: coordination choices are measurable architectural choices. They create distinct error pathways, cost profiles, and diagnostic signatures.

That is the right level of abstraction. “More agents” is not a strategy. “Debate” is not automatically epistemic rigor. “Consensus” is not truth. “Orchestration” is not reliability unless authority, aggregation, termination, and verification are specified.

Multi-agent systems should be designed less like a group chat and more like an organization with operating rules, audit trails, and measurable failure modes.

Don’t add agents. Design the coordination layer.