AutoTTS for Test-Time Scaling Environment Design

Most test-time scaling work starts with a hand-built intuition:

sample more reasoning branches;
continue the promising ones;
probe intermediate answers;
prune bad branches;
stop when the answer looks stable.

That is useful, but it is still mostly heuristic engineering. A researcher decides the rule, tunes thresholds, runs benchmarks, then repeats.

The interesting move in “LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling” is not just another branching rule. The paper proposes AutoTTS, a framework that changes what humans design.

Instead of directly hand-crafting a test-time scaling strategy, humans construct a discovery environment where strategies can be searched automatically.

That shift matters for agent builders.

The lesson is not “make the model think more.” The lesson is:

If you want agents to improve their own compute-allocation policy, first build a replayable environment where policy experiments are cheap, observable, and scored against the right tradeoff.

The old loop: hand-written TTS heuristics

Test-time scaling is the broad idea that an LLM can perform better if it spends more computation during inference. That can mean sampling many answers, extending reasoning chains, running tree search, using verifiers, or adaptively stopping once answers converge.

The paper frames many existing strategies inside a simple width-depth control space:

width: how many reasoning branches to explore;
depth: how far each branch is extended.

Different algorithms trace different paths through that space. Self-consistency expands width. Some methods extend depth on a single chain. Others branch early, probe, prune, or stop adaptively.

But in many cases, the strategy is still manually specified. Someone writes the branch/prune/stop rules.

AutoTTS asks: can an LLM discover those controller rules itself?

The new loop: controller discovery inside an environment

AutoTTS treats test-time scaling as controller synthesis.

A controller observes the current reasoning state and chooses the next action. In the paper’s width-depth instantiation, the state includes things like:

the original question;
which branches are active;
how deep each branch has gone;
which probe outputs have been revealed;
how much budget has already been spent.

The action space is deliberately compact:

BRANCH: create a new reasoning branch;
CONTINUE(i): extend branch i by one interval;
PROBE(i): reveal the current intermediate answer/signal for branch i;
PRUNE(i): remove branch i from the active set;
ANSWER: stop and aggregate a final answer.

An explorer LLM proposes a code-defined controller. The environment runs it, measures accuracy and token cost, records the execution trace, then feeds that history back into the next discovery round.

This is the core builder pattern:

Do not ask the agent to “be smarter” in the abstract. Give it a state space, action space, objective, feedback channel, and cheap replay loop.

Why offline replay is the load-bearing trick

Naively evaluating every candidate controller would be expensive. Each candidate might require fresh LLM generations for many branches across many benchmark questions.

AutoTTS avoids that by using an offline replay environment.

Before discovery begins, the system pre-collects reasoning trajectories and probe signals for each problem. During controller search, candidate policies do not call the base LLM again. They replay decisions against the stored trajectories.

If a controller chooses PROBE(i) at depth k, the environment simply retrieves the pre-collected probe signal. If it chooses CONTINUE(i), it advances along an already stored branch prefix.

That makes controller evaluation cheap and deterministic enough for iterative agentic discovery.

The paper reports the full discovery process costing about $39.9 and 160 minutes. That number should not be misread as the full real-world cost of deploying a TTS system — pre-collecting trajectories and maintaining the environment still matter — but it shows why replay changes the search economics.

Without replay, the explorer is trapped by expensive feedback. With replay, it can try controller designs, inspect failures, and improve.

Trace feedback beats final-score-only feedback

A subtle part of AutoTTS is that the explorer LLM does not only receive scalar outcomes like “accuracy went up” or “tokens went down.”

The history includes execution traces: how a controller allocated computation over time, which branches it explored, where it probed, what it pruned, and when it stopped.

That is important because final metrics are too blunt. If a controller performs poorly, the explorer needs to know why:

did it branch too late?
did it prune productive branches?
did it waste tokens deepening noisy branches?
did it stop before enough evidence accumulated?
did it overfit to cheap answers on the search set?

The ablation result supports this design. In Table 3, removing execution traces leads to worse average performance and higher token usage than the full method. The exact numbers are benchmark-specific, but the qualitative lesson is portable: agentic improvement needs diagnostic feedback, not just a scoreboard.

For production agent systems, this is a strong hint. If you want an agent to improve scheduling, routing, tool-use, retrieval depth, or retry policies, logging only pass/fail is thin. You want traces rich enough for policy diagnosis.

β is a product knob, not just a hyperparameter

The paper also introduces beta parameterization.

In early experiments, agents tended to propose controllers with many hyperparameters. With only a few discovery rounds, that makes search messy and brittle. Controllers can collapse into sharp thresholds that look good on the search set but do not generalize.

AutoTTS constrains each controller to expose one scalar tradeoff parameter, β, and derive internal hyperparameters from it. Larger β corresponds to larger token budget.

This does two useful things:

It makes the search space tractable.
It gives a clean knob for sweeping the accuracy-cost frontier.

For builders, β is not just a mathematical convenience. It is the kind of control surface you would want in a real product: run cheap for easy tasks, spend more for high-value tasks, and expose a predictable budget-quality curve.

The tradeoff is also obvious: by compressing the controller family into one β-controlled curve, you make discovery easier but restrict the policy space. That is probably the right engineering move for this paper’s setting, but it is still a constraint, not magic.

What the results show

The main experiments use offline replay environments built from Qwen3 models at 0.6B, 1.7B, 4B, and 8B. The controller is discovered on AIME24 and evaluated on held-out AIME25 and HMMT25 environments.

The reported pattern is that AutoTTS improves the accuracy-cost tradeoff over strong hand-crafted baselines in most settings.

One concrete example from Table 1:

On the Qwen3-1.7B held-out average, SC@64 reaches 34.3% accuracy using about 1093.5k tokens.
AutoTTS (β=1.0) reaches 40.6% accuracy using about 646.1k tokens.

The paper also reports targeted transfer tests beyond the main setting:

DeepSeek-R1-Distill-Llama-8B on HMMT25: AutoTTS (β=1) reaches 27.2% with 533.9K tokens, compared with SC@64 at 26.7% with 985.7K tokens.
Qwen3-1.7B on GPQA-Diamond: AutoTTS (β=0.5) reaches 41.6% with 151.0K tokens, compared with SC@64 at 41.3% with 510.0K tokens.

These are encouraging numbers. They suggest the discovered controller is not merely memorizing the exact search benchmark.

But they should be read with restraint.

What not to claim

Do not claim AutoTTS makes the base model better for free.

It does not update the base model weights. It changes how inference-time computation is allocated.

Do not claim it universally reduces inference cost.

It discovers an accuracy-cost controller under a particular environment and budget parameterization. Depending on β and task difficulty, you may spend more or less compute.

Do not claim it solves all agent orchestration.

The paper’s concrete instantiation is width-depth reasoning over pre-collected trajectories and probe signals, with strongest evidence on math reasoning benchmarks. GPQA-Diamond and DeepSeek transfer results are useful signals, but they are not proof that the same controller-discovery setup works for arbitrary tool-use agents, coding agents, web agents, or multi-agent workflows.

Do not ignore the offline data cost.

Replay is cheap because trajectories and probe signals are already collected. That is the right design for discovery, but production systems still need to decide how to collect, refresh, store, and validate those trajectories.

The agent-builder takeaway

AutoTTS is interesting because it points toward a reusable recipe for agent improvement:

Define the controllable computation space. What can the agent choose: branch, continue, probe, prune, retrieve, call a tool, retry, escalate, stop?
Make evaluation replayable. Fresh online calls are too expensive for broad search. Store trajectories, observations, tool results, probes, and outcomes.
Log diagnostic traces. Let the explorer see not just whether a policy won, but how it spent budget.
Expose a budget-quality knob. A single clean tradeoff parameter can be more useful than a pile of brittle thresholds.
Evaluate on held-out environments. A policy that only wins on the search set is just overfit automation.

That is why the paper’s title lands: LLMs improving LLMs. The improvement here is not mystical self-evolution. It is environment-mediated policy discovery.

Humans still design the game board. But once the board is structured well enough, an agent can search for better ways to play.

For test-time scaling, that means fewer hand-written heuristics and more replayable controller discovery.

For agents in general, the lesson is broader:

The next jump may come less from telling agents to think harder, and more from building environments where they can cheaply discover when thinking harder is actually worth it.