Weak Reasoning Models Need Selection, Not Just More Agents
arXiv:2605.14163 frames verifier-backed agent orchestration as inference-time boosting: repeated weak proposals expose latent capability, but critics, comparators, and local soundness signals determine how much of that capability the harness can actually recover.

Agentic Systems as Boosting Weak Reasoning Models is useful because it refuses the easy slogan. The paper is not saying “more agents are all you need.” It is saying something sharper: repeated weak proposals can expose latent correct solutions, but only a verifier-backed harness can turn that latent capability into realized performance.
That distinction matters for builders. A multi-agent system is not automatically an intelligence multiplier. It is a proposal-selection machine. If the proposal pool contains a correct path and the harness can identify it, test-time orchestration can look surprisingly strong. If the pool does not contain a correct path, no amount of selection can recover an absent solution.
The core decomposition
The paper frames verifier-backed committee search as inference-time boosting for reasoning language models. But reasoning is not ordinary supervised boosting. A reasoning system does not simply combine labels from weak learners. It must generate a useful local move, identify that move, preserve progress, and repeat until it reaches a terminal solution.
The authors separate four quantities:
- Proposal coverage: does at least one useful candidate appear in the sampled pool?
- Local identifiability: can the harness recognize or rank the useful candidate without hidden ground truth?
- Progress: do selected local moves compose into a valid trajectory toward a solution?
- Diversity: do additional calls escape different failure modes, or do they repeat the same blind spot?
This is the paper’s most reusable engineering lens. When an agent system fails, “the model is weak” is too coarse. The failure might be:
- no correct proposal was generated;
- a correct proposal was generated but the selector missed it;
- local moves did not preserve progress;
- all proposals shared the same blind spot;
- the verifier signal was too weak, expensive, or unavailable.
Those are different bugs. They need different fixes.
Coverage is not identifiability
One of the cleanest points in the paper is that repeated sampling can improve coverage, but coverage alone does not create a critic or comparator.
If a candidate pool contains a correct patch, that fact is only useful if the harness has some signal for identifying it. In verifier-backed domains, that signal may come from:
- execution;
- tests;
- type checking;
- proof checking;
- constraint solving;
- structured local evidence;
- learned critics or pairwise comparators.
This is the difference between “generate many possible answers” and “recover the good answer.” A best-of-k oracle can tell us whether the right answer existed in the pool, but the oracle is not deployable when it depends on hidden tests. It is a diagnostic upper bound, not a product architecture.
For agent builders, this means best-of-N is not a strategy by itself. Best-of-N plus no reliable selector is just expensive sampling.
SWE-bench Verified as a diagnostic testbed
The empirical section uses SWE-bench Verified, with 500 software-engineering tasks. Each task provides a repository, an issue description, and visible tests; success is measured by held-out hidden tests.
The key setup detail is important: for each task, the authors generate a fixed pool of k = 8 candidate patches using independent GPT-5.4 nano proposer runs. Selector ablations reuse the same cached proposal pool, so differences in solve rate reflect selection quality rather than generation differences.
The headline result:
- single GPT-5.4 nano proposal: 67.0% solve rate;
- critic–comparator orchestration with k = 8: 76.4%;
- oracle best-of-8 upper bound: 79.0%;
- the 76.4% result matches reported standalone performance of Gemini 3 Pro and Claude Opus 4.5 Thinking in the paper.
The important reading is not “nano is secretly frontier.” The important reading is:
many hidden-test-passing patches are already present in weak-model proposal pools; the main challenge is selecting them without hidden-test access.
That is a harness result, not merely a model result.
Critics and comparators play different roles
The system uses two local selection signals:
- Critics evaluate individual patches using local evidence such as the issue, patch, visible tests, and execution traces.
- Comparators rank candidate patches against each other through pairwise comparisons.
The ablations show that neither signal is enough on its own. Critics-only selection improves over the single-proposal baseline but plateaus below the full system. Binary critics are useful for rejecting clearly flawed patches, but they are coarse. Comparator-only selection is stronger, because pairwise ranking carries information about relative patch quality, but it still benefits from first removing implausible candidates.
The full system works because it composes both signals:
- critic gate: remove obviously bad candidates;
- comparator stage: choose among plausible survivors.
This is a practical pattern: use cheap/local checks to reduce the candidate set, then use more nuanced pairwise ranking where it matters.
Treat judge outputs as weak evidence, not truth
The comparator aggregation rule is especially worth copying.
For each unordered pair of patches, the system queries both presentation orders: once with patch i shown first, once with patch j shown first. A pairwise win is counted only if both orders select the same patch after mapping back to the original indices. If the two orders disagree, or either order returns tie, the comparison is treated as a tie.
That is a small but serious design choice. It acknowledges that LLM comparators are noisy and presentation-order-sensitive. The harness does not blindly majority-vote judge outputs. It asks whether the preference is stable under a superficial perturbation.
A good agent harness should have this kind of humility. LLM judges are useful, but they are not hidden tests.
Oracle gap recovery is the metric builders should steal
The paper defines oracle-gap recovery using:
- pass@1 / single-proposal success;
- oracle best-of-k success;
- implemented system success on the same proposal pool.
Conceptually:
- pass@1 measures one-shot proposal quality;
- oracle best-of-k measures latent capability exposed by sampling;
- system success measures how much of that latent capability the harness recovers;
- the remaining gap to oracle diagnoses imperfect identification;
- failures unreachable by oracle diagnose proposal coverage / blind spots.
This turns evaluation into a debugging tool. If oracle best-of-k is much higher than pass@1 but your system is low, improve selection: critics, comparators, tests, aggregation. If oracle best-of-k itself is low, selection is not the main bottleneck. Improve proposers, decomposition, tools, retrieval, or diversity.
Without this decomposition, teams can waste time polishing the wrong part of the system.
The ceiling: shared blind spots
The paper’s blind-spot argument is the anti-hype core.
More proposals reduce finite sampling error when the proposal distribution assigns nonzero probability to useful moves. But if a latent task slice has zero probability of generating a correct move, oracle best-of-k converges to a ceiling below 100%. No selector can recover a solution that never appears.
The experiments echo this: after the critic–comparator harness recovers most of the best-of-k gain, remaining failures are mostly proposal-coverage failures rather than selector misses.
Builder translation:
- if correct candidates appear but are missed, strengthen identifiability;
- if correct candidates do not appear, strengthen generation;
- if all proposers fail the same way, add real diversity, not duplicate calls;
- if the domain lacks local soundness signals, be careful about claiming “boosting.”
Where this applies — and where it does not
This framework is strongest for verifier-backed tasks: code repair, theorem proving, program synthesis, constraint solving, typed transformations, or workflows with executable checks. These domains can provide local evidence that a move is sound or at least plausible.
It is weaker for open-ended tasks without local verifiers. You can still use committees and comparators for writing, planning, or strategy, but the “boosting” claim becomes softer. A committee of models may converge on a persuasive answer, not a correct one. Without reliable local evidence, selection can become preference laundering.
The cost side also matters. The paper explicitly notes added model calls, verification cost, latency, and system complexity. A harness that improves benchmark score but triples latency may or may not be worth shipping.
Practical checklist for agent-system builders
When designing a multi-agent reasoning harness, ask:
- What is the proposal pool?
- Does increasing k actually increase diversity, or just repeat the same failure mode?
- Can we measure oracle best-of-k on an offline benchmark?
- What local soundness signals exist: tests, execution, type checks, proof checks, constraints?
- Which candidates should be filtered by critics before ranking?
- Are pairwise comparators robust to presentation order?
- Do we treat judge disagreement as uncertainty rather than forcing a winner?
- Are failures oracle-reachable-but-missed, or oracle-unreachable?
- Is the next improvement lever selection or generation?
- Is the latency/compute cost acceptable?
This is the difference between “multi-agent vibes” and an engineered inference-time boosting system.
Bottom line
Agentic Systems as Boosting Weak Reasoning Models gives a useful theory-and-benchmark frame for why weak-model committees can sometimes approach stronger standalone models.
But its best lesson is disciplined, not magical:
Test-time orchestration works when diverse proposals expose latent correct solutions and verifier-backed selection reliably recovers them.
More agents are not enough. More candidates are not enough. The harness needs evidence, selection, progress, and failure decomposition. Otherwise you are not boosting reasoning — you are just buying more samples and hoping the loudest one is right.
Citation
- Varun Sunkaraneni, Pierfrancesco Beneventano, Riccardo Neumarker, Tomaso Poggio, Tomer Galanti. “Agentic Systems as Boosting Weak Reasoning Models.” arXiv:2605.14163v1, 2026. https://arxiv.org/abs/2605.14163