Harness-1: Why Search Agents Need a Real Working Memory

Search agents are becoming one of the most important building blocks for practical AI systems. They search the web, inspect documents, compare evidence, and return the information that a downstream model needs to answer a question.

But most search agents still work in a surprisingly fragile way: they keep almost everything inside a growing conversation transcript.

The paper “Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses” argues that this is the wrong place to put all of the state. A model should not have to remember every document it has seen, every partial lead it has followed, every claim it has verified, and every candidate it may want to return later.

Those are forms of bookkeeping. And bookkeeping is something an environment can often maintain more reliably than a language model.

Harness-1 introduces a cleaner division of labor:

The model makes the semantic decisions. The harness manages the recoverable state around those decisions.

That sounds small. It is not. It changes what the agent can learn.

The problem with transcript-only search

A search agent has two very different jobs.

First, it must make intelligent search decisions:

what to query;
which documents are relevant;
what evidence is still missing;
what claim should be checked;
when the search is good enough to stop.

Second, it must manage state:

which documents have already appeared;
which candidates look promising;
which evidence supports which claim;
which documents were rejected;
which paths have already failed;
how much context budget remains.

Traditional agent designs often force the language model to do both jobs through one long transcript. As the transcript grows, the model must reconstruct the useful state of the search from raw observations again and again.

That is inefficient for inference, and it is especially awkward for reinforcement learning. If the final curated set is wrong, the reward often does not reveal why. Was the query bad? Did the agent forget a document? Did it fail to verify a claim? Did a useful candidate get buried in context noise?

Harness-1 treats this as an interface problem, not only a model problem.

The core idea: stateful cognitive offloading

Harness-1 moves routine state management out of the model and into a stateful search harness.

The agent no longer acts only by adding text to a conversation. Instead, it edits a persistent working memory maintained by the environment.

That working memory includes:

a candidate pool of retrieved documents;
a curated set of important evidence;
importance tags for selected documents;
search history and compact result summaries;
full-text memory for documents that can be revisited;
verification records for claims checked against source text;
evidence links between entities, dates, relations, and documents;
budget-aware rendering so the model sees useful state without overflowing context.

This is the key design move. The harness carries the notebook. The policy decides what to do with it.

A transcript-only agent is like a researcher who keeps every note in one endless chat log and must mentally reread the whole thing before every next move.

Harness-1 is more like a researcher with a structured workspace: a pile of candidate papers, a shortlist of important sources, a table of verified claims, links between evidence, and a clear view of what remains unresolved.

The researcher is still doing the thinking. But the workspace makes better thinking possible.

What the model still controls

The harness does not replace the agent’s judgment. It does not decide the answer. It does not magically know which evidence matters.

The policy still chooses actions such as:

searching with new queries;
reading specific documents;
reviewing documents it has already seen;
adding or removing documents from the curated set;
tagging evidence by importance;
writing claims to verify;
checking claims against remembered documents;
stopping when the evidence set is sufficient.

The important difference is that these actions are structured edits over working memory. Tool outputs are compressed, deduplicated, stored, and rendered back to the model as usable state instead of being dumped into an append-only transcript.

This lets the model focus on semantic search strategy rather than reconstructing its own notebook from raw text at every step.

Why this matters for reinforcement learning

Reinforcement learning works best when the policy receives a stable and meaningful interface to the task.

If the model must learn search behavior and messy bookkeeping at the same time, training becomes poorly conditioned. Many failures collapse into the same bad final reward. The model may repeat search calls, ignore verification, forget useful evidence, or overfit to transcript patterns that do not transfer.

Harness-1 makes the learning problem cleaner. The policy sees a rendered working memory and learns how to operate over it.

The paper highlights three design requirements that make this kind of stateful harness trainable:

Warm-started curation: early searches can seed a tentative evidence set instead of leaving the agent with an empty reward signal.
Compact derived-state rendering: working memory helps the model without competing with evidence for context space.
Diversity-preserving incentives: the agent learns a healthy rhythm of searching, curating, reviewing, and verifying instead of repeating the same action.

This is a practical lesson for agent builders: a richer interface is not automatically better. The state must be useful, compact, and connected to the reward.

Results that matter

Harness-1 is a 20B open search agent trained with reinforcement learning inside this stateful harness.

Across eight retrieval benchmarks covering web search, finance, patents, and multi-hop question answering, it reaches an average curated recall of 0.730.

That is notable because Harness-1 outperforms the next strongest open search subagent by 11.4 points, while remaining competitive with much larger frontier-model searchers. The paper also reports strong transfer performance on held-out benchmarks, suggesting that the agent is not merely memorizing a training domain.

It is learning to use explicit search state in a way that generalizes.

The most interesting lesson is not only that this particular model performs well. It is that the harness itself becomes part of the intelligence of the system.

A better interface changes what the model can learn.

The builder takeaway

For anyone building AI agents, Harness-1 points to an important principle:

Do not ask the model to carry every operational detail in its own context window. Externalize the state that can be represented, updated, checked, and rendered by the environment.

This applies beyond search.

Coding agents need to remember files touched, tests run, bugs reproduced, constraints satisfied, and hypotheses rejected.

Research agents need to remember sources checked, claims verified, dead ends, open questions, and evidence quality.

Customer-support agents need to remember user intent, prior attempts, account state, policy constraints, and unresolved commitments.

Data-analysis agents need to remember transformations, assumptions, metrics, chart decisions, and validation steps.

In all of these cases, a stronger agent is not only a stronger model. It is a strong model inside a well-designed harness.

Harness-1 shows that when we give the model a better working environment, reinforcement learning can train more reliable and more transferable agent behavior.

The future of agents may depend as much on the design of their memory, tools, and state transitions as on the size of the model itself.

Reference

Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu, Jiashuo Sun, Jimeng Sun, Hammad Bashir, Jiawei Han, “Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses,” arXiv:2606.02373, 2026. https://arxiv.org/abs/2606.02373