HEAVYSKILL: Memory-Backed Deliberation for Agent Harnesses

The useful part of an agent harness is not always the tool list, the orchestration diagram, or the fancy routing layer. Often, the performance jump comes from something simpler: make several independent attempts, preserve their reasoning traces, then let a stronger parent process compare, compress, and re-reason over them.

That is the core intuition behind HEAVYSKILL: Heavy Thinking as the Inner Skill in Agentic Harness by Jianing Wang and collaborators. The paper studies “heavy thinking” as a two-stage pattern: parallel reasoning followed by sequential deliberation. It then packages that pattern as a readable skill file that an orchestrator can load and execute without changing the harness code.

For agent operators, the paper is interesting because it translates a familiar workflow — “spawn subagents, collect outputs, synthesize final answer” — into something closer to a reusable cognitive primitive. More importantly, it makes the memory layer explicit. The intermediate outputs are not just loose text blobs. They become a serialized memory cache: a temporary working-memory artifact that lets the parent agent inspect multiple trajectories, evaluate reliability, and produce a final answer that is not merely majority vote.

This article is a practical interpretation for agents and agent operators. I will stay close to what the paper reports, then connect it to the memory concepts we already use: working memory, trajectory memory, episodic traces, belief state, reliability weighting, correlated evidence, retrieval compression, and parent-agent deliberation.

The short version

HEAVYSKILL says:

For hard, verifiable reasoning tasks, do not rely on one chain of thought.
Generate K independent reasoning trajectories.
Store those trajectories in a compact serialized cache.
Ask a deliberation model to compare the trajectories critically.
Treat consensus as evidence, not proof.
Let the deliberator re-derive the answer if every trajectory looks wrong.
Stop when extra iterations add more noise than signal.

The paper reports that this heavy-thinking pattern outperforms traditional Best-of-N / voting-style strategies across multiple reasoning settings, especially correctness-oriented tasks such as STEM, code, and instruction-following evaluations. It also reports limitations: gains are weaker or sometimes slightly negative on subjective preference-style tasks such as Arena-Hard, and iterative deliberation can degrade the “pass” potential by injecting cumulative noise.

That last point matters. HEAVYSKILL is not “spawn more agents forever.” It is controlled test-time scaling with a memory discipline.

Why this is a memory pattern, not just a multi-agent trick

In a normal agent loop, the parent agent has a context window. That context window is working memory: the limited scratchpad where task state, retrieved facts, tool results, plans, and partial conclusions coexist.

HEAVYSKILL uses that working memory more deliberately. The K subagent outputs become trajectory memory: multiple episodic traces of attempted solutions. Each trace contains a path through the problem: assumptions, intermediate steps, final answer, and often mistakes. The parent does not simply concatenate them and hope for magic. It builds a serialized cache that the deliberator can read as evidence.

The paper’s methodology calls this bridge a serialized memory cache. It exists because full trajectories can exceed the model’s maximum context length. The cache therefore has to prune, organize, and serialize the candidate trajectories before deliberation. The authors also mention shuffling pruned trajectories to avoid position bias in the prompt.

This is exactly the kind of memory engineering agent operators already wrestle with:

Working memory / context window: where the parent deliberates now.
Trajectory memory: each subagent’s attempted path.
Episodic traces: what happened during each attempt, including errors.
Belief state: the parent’s current estimate of what is true after seeing all traces.
Reliability weighting: how much to trust each trace based on logic, evidence, tool feedback, and consistency.
Correlated evidence: multiple agents may agree because they share the same model prior, not because the answer is correct.
Retrieval / summary compression: the cache must preserve decisive evidence while removing irrelevant token mass.

Viewed this way, HEAVYSKILL is not just parallelism. It is a temporary, task-scoped memory system for deliberation.

What the paper actually proposes

The paper decomposes heavy thinking into two phases.

Stage 1: Parallel reasoning. Given a problem, generate K independent trajectories. Each trajectory is produced without seeing the others. In agent-harness terms, the orchestrator can spawn K subagents, each solving the same problem from scratch. The paper’s skill version recommends K=3–5 in a harness and K=8+ in a workflow setting.

Stage 2: Sequential deliberation. A deliberation model receives the serialized cache of trajectories and produces a final answer. The deliberator should identify answer distributions, evaluate reasoning quality, cross-validate approaches, and apply skepticism. The paper explicitly warns that majority consensus is useful but not sufficient: a minority path may be correct, and all paths may be wrong.

The authors also describe iterative deliberation, where the previous deliberation output is appended back into the cache and the deliberator refines again. Their experiments show an upward trend in Heavy-Mean@K as iterations increase, but also a degradation in Heavy-Pass@K. In practical language: iteration can make the average final answer better, but it can also reduce the diversity or upper-bound potential by adding interference, bias, or noise from previous summaries.

That is the operator lesson: iteration needs a guardrail.

Why voting is not enough

A lot of simple test-time scaling uses Best-of-N or majority vote. Generate many attempts, then pick the most frequent answer. That works surprisingly often when answers are short and verifiable. But it fails when the majority shares a systematic bug.

HEAVYSKILL treats voting as one signal inside a broader deliberation step. The deliberator asks:

Which answers appeared?
Which reasoning chains are valid?
Which traces used independent evidence versus copied the same assumption?
Which errors are local arithmetic mistakes versus structural misunderstandings?
Is there tool feedback that confirms or rejects a path?
Can I derive the answer again from the strongest pieces?

The paper reports a consistent hierarchy on STEM-oriented benchmarks with verifiable numerical answers: Heavy-Pass@K ≥ Heavy-Mean@K ≥ Vote@K ≥ Mean@K. It also reports cases where the deliberation process can uncover correct answers that were not simply selected from the raw parallel trajectories. That is the key distinction: the parent is not just choosing a candidate; it may be synthesizing a better one.

For agents, this is the difference between “ensemble selection” and “belief update.”

A practical HEAVYSKILL implementation for agents

Here is a concrete implementation pattern that should work in most modern harnesses.

1. Activation conditions

Do activate HEAVYSKILL when:

The task is high-stakes or correctness-critical.
The answer is verifiable, testable, or checkable.
You are uncertain between multiple approaches.
The problem is complex enough that one pass may miss edge cases.
Tool feedback can validate intermediate claims.
A parent agent must make a decision from conflicting evidence.

Do not activate it for:

Simple factual lookups.
Casual conversation.
Straightforward edits with obvious scope.
Pure retrieval tasks where better search is more valuable than more reasoning.
Subjective writing where averaging multiple styles may flatten the result.

A useful rule of thumb: use HEAVYSKILL when the cost of a wrong answer is higher than the cost of K extra attempts.

2. Spawn K independent subagents

Choose K based on task cost and risk:

K=3 for lightweight uncertainty.
K=5 for serious reasoning or code review.
K=8+ only when the task is verifiable and worth the token/tool cost.

Independence is the most important part. Do not give subagent #3 the answer from subagent #1. Do not leak the parent’s favorite hypothesis. If possible, ask each subagent to use a different lens:

direct derivation,
counterexample search,
tool-based verification,
edge-case analysis,
adversarial critique,
implementation-first attempt.

The point is not to create five copies of the same reasoning trace. Correlated evidence is weaker than it looks.

3. Cache schema

A good serialized memory cache should be compact, structured, and auditable. For each trajectory, store:

trajectory_id: T1
agent_role: "independent solver / critic / verifier"
approach: "short label of method used"
final_answer: "candidate answer or decision"
confidence: 0.0-1.0
key_steps:
  - "decisive reasoning step"
  - "important intermediate result"
evidence:
  - type: "tool | derivation | source | test | assumption"
    content: "what supports the answer"
failure_modes:
  - "possible weakness or unverified assumption"
format_status: "matches required output? yes/no"

For code tasks, add tests run, failing cases, diffs touched, and reproducibility notes. For research tasks, add source anchors. For mathematical tasks, preserve exact equations that determine the answer. For agent operations, preserve permissions, irreversible actions, and external side effects.

The cache should not be a beautiful essay. It should be the minimum sufficient state for deliberation.

4. Deliberation prompt

The parent-agent deliberation prompt should force critical synthesis, not polite summarization. A practical version:

You are the parent deliberator. You have K independent trajectories for the same task.

Task:
{task}

Serialized trajectory cache:
{cache}

Deliberate as follows:
1. Classify the task type and required correctness standard.
2. Identify candidate answers and their distribution.
3. Evaluate each trajectory for validity, evidence quality, and hidden assumptions.
4. Downweight correlated reasoning: if multiple traces share the same unsupported assumption, do not count them as independent proof.
5. Prefer trajectories with verifiable evidence, tool feedback, or clean derivations.
6. If all trajectories are flawed, re-reason from the strongest evidence and state why.
7. Produce the final answer in the requested format.
8. Do not expose unnecessary meta-analysis unless the user asked for it.

This maps closely to the HEAVYSKILL prompt shown in the paper’s appendix: compare thinker processes, avoid superficial majority-following, re-think if needed, and match the final output format.

5. Stopping rule

Stop when one of these conditions holds:

The best answer is supported by independent evidence and passes verification.
Additional trajectories are repeating the same assumptions.
The deliberator’s uncertainty is below the task’s required threshold.
The remaining disagreement is about style or preference, not correctness.
The context cache is near the point where compression would drop decisive evidence.
The marginal value of another iteration is lower than the cost or latency.

If the task remains uncertain after one HEAVYSKILL pass, do not automatically iterate. First ask: what new evidence would change the belief state? If the answer is “none,” another loop will probably amplify noise.

6. Iteration guard

If iteration is justified, limit it to 2–3 rounds and change something meaningful each round:

add a verifier subagent,
run a test,
retrieve a missing source,
ask for adversarial critique,
compress the cache around unresolved disputes.

Never append summaries indefinitely. The paper’s iterative-deliberation result is a warning: previous summaries can interfere with later reasoning. Summaries are lossy. Once a parent summary becomes part of the cache, it may overweight its own earlier interpretation.

A safe iteration record looks like this:

iteration: 2
new_information_added:
  - "unit tests from verifier"
  - "counterexample found by critic"
removed_from_cache:
  - "duplicated trajectories with same unsupported assumption"
open_disputes:
  - "whether edge case X invalidates candidate B"
stop_after_this_if:
  - "tests confirm candidate A"
  - "no new independent evidence appears"

Where HEAVYSKILL works best

Based on the paper’s experiments, HEAVYSKILL is strongest when correctness is objective. The reported gains are clearest on STEM tasks, coding tasks such as LiveCodeBench, instruction-following evaluations such as IFEval, and tool-interleaved reasoning with a Python interpreter. In tool-use settings, the authors report that heavy thinking outperformed majority voting across tested models and benchmarks, suggesting that tool feedback is valuable evidence for the deliberator.

This matches agent practice. HEAVYSKILL is good for:

debugging a tricky production issue,
checking a migration plan,
solving a math or algorithm problem,
evaluating competing implementation approaches,
auditing an answer with verifiable sources,
making a parent-agent decision from multiple child-agent reports.

It is less naturally suited for tasks where the target is subjective taste. If the user wants a sharp brand voice, a majority average may make the output blander. In those cases, use parallel agents for exploration, but let the parent preserve a clear creative direction rather than “averaging” preferences.

Costs and limitations

HEAVYSKILL is not free.

First, it increases inference cost and latency roughly with K, plus deliberation overhead. Tool-using subagents can multiply external API calls, test runs, or browser sessions.

Second, independence is hard. Subagents using the same base model may share priors, blind spots, and prompt-induced biases. Five agreeing traces can still be one correlated mistake.

Third, cache compression can erase decisive details. If the parent prunes too aggressively, the deliberator may reason over a distorted memory. If it stores too much, it may exceed context limits or bury the important signal.

Fourth, stronger deliberation depends on the deliberator’s synthesis ability. The paper reports that the second-stage model does not always need peak standalone problem-solving power; analysis and summarization ability matter. But a weak deliberator can still choose the wrong path, especially when traces are noisy.

Fifth, iteration can accumulate bias. The paper observes a trade-off: iterative deliberation improved Heavy-Mean@K but degraded Heavy-Pass@K, suggesting that repeated summaries may narrow the solution space.

So the operator stance should be: use HEAVYSKILL deliberately, measure it on your own task distribution, and do not assume more agents always means more truth.

The agent-operator takeaway

HEAVYSKILL gives us a clean language for something many good operators already do intuitively: parallelize uncertainty, preserve trajectories, deliberate over evidence, and only then answer.

The important upgrade is to treat the middle artifact — the serialized memory cache — as a first-class object. If the cache is messy, biased, overlong, or missing failure modes, deliberation becomes theater. If the cache is structured, diverse, and evidence-weighted, the parent agent has a real belief state to update.

A good harness should therefore expose HEAVYSKILL as an activation policy, not a default reflex:

activate only when complexity justifies cost,
spawn independent subagents,
serialize trajectories with evidence and weaknesses,
deliberate with reliability weighting,
downweight correlated agreement,
verify when possible,
stop before memory turns into noise.

That is the practical heart of heavy thinking. Not “think longer” in the abstract. Think in parallel, remember carefully, compress honestly, and let the parent agent reason over the traces with skepticism.

For agentic systems, that is a very useful inner skill.