Agentic Search Is a Harness Problem

The surprising headline from “Is Grep All You Need? How Agent Harnesses Reshape Agentic Search” is easy to misread.

In the paper’s LongMemEval experiments, grep often beats vector retrieval. In the inline setting, lexical search outperforms dense retrieval for every harness-model pair the authors evaluate.

That sounds like a clean “grep wins” story.

It is not.

The more useful builder lesson is sharper and less meme-friendly:

Retrieval quality in agent systems is an emergent property of the whole harness: query formulation, result presentation, context budgeting, follow-up tools, and the agent’s ability to inspect evidence.

Grep is the character that gets everyone’s attention. The real protagonist is the harness.

What the paper actually tests

The paper evaluates agentic search on a 116-question subset of LongMemEval, a benchmark for answering questions over long multi-session conversations. These tasks often require recovering user facts, preferences, time expressions, counts, state changes, or details from earlier dialogue.

The authors compare two retrieval modes:

Grep / lexical search: exact or pattern-based matching over raw text.
Vector retrieval / semantic search: embedding-based retrieval over the same underlying corpus.

They test these retrieval modes across several agent harnesses:

Chronos, a custom harness.
Provider-native CLI harnesses: Claude Code, Codex CLI, and Gemini CLI.

They also vary how search results are delivered to the model:

Inline delivery: results are appended directly into the model context.
Programmatic / file-based delivery: results are written to files, and the agent must explicitly open, grep, or read those artifacts.

That last dimension matters more than many RAG discussions admit. A retriever does not simply “return knowledge” to an agent. The harness decides what becomes immediate context, what stays outside context, and what extra actions the agent must perform before it can use the evidence.

The headline result: grep is a very strong baseline

In Experiment 1, inline grep beats inline vector retrieval for every harness-model pair in the paper’s table.

A few reported examples:

Chronos + Claude Opus 4.6: inline grep reaches 93.1%, while inline vector reaches 83.6%.
Chronos + Gemini 3.1 Flash-Lite: inline grep reaches 86.2%, while inline vector reaches 62.9%.
Codex CLI + GPT-5.4: inline grep reaches 93.1%, while inline vector reaches 75.9%.

That is a strong result, especially because vector retrieval is often treated as the default answer to “we need memory search.”

But the correct conclusion is not “vector databases are obsolete.”

Long-memory conversational QA often contains literal anchors: names, dates, quoted phrases, counts, user preferences, and events that appear close to verbatim in the source. Grep is excellent at recovering those witnesses. If the question is “what did the user say about X?” and X appears as a stable phrase, exact search has very little semantic fog to cut through.

Dense retrieval can help when the evidence is paraphrased, concept-level, multilingual, or far from the user’s wording. But on this workload, a strong lexical baseline is not a toy. It is a serious contender.

The first practical takeaway is blunt:

Do not add a vector stack before you know whether grep, BM25, regex, and metadata filters already solve the dominant search cases.

This is not anti-RAG. It is anti-ritual.

The deeper result: the harness can move the ceiling

The paper’s more important result is that retrieval mode is not measured in isolation.

The same underlying model can behave very differently depending on the harness. The paper reports that Claude Opus 4.6 with inline grep reaches 93.1% under Chronos but 76.7% under Claude Code. That is not a small interface tax. It is a large end-to-end accuracy shift with the same backbone and the same broad retrieval family.

Why can this happen?

Because the harness shapes the entire search loop:

system prompt and tool descriptions;
how queries are encouraged or constrained;
how results are formatted;
whether snippets include enough surrounding context;
whether the model can issue follow-up searches;
whether evidence arrives as a readable answer, a noisy blob, or an artifact pointer;
when the agent decides it has enough information to stop.

For standalone retrieval, you can often talk about recall@k or ranking quality. For agentic retrieval, the final answer depends on whether the model can turn retrieved material into verified reasoning.

A top-k result that the agent ignores, misreads, truncates, or fails to open is not operational knowledge. It is just a file-shaped opportunity.

Inline vs file-based delivery is context engineering

One of the paper’s best design points is that tool delivery mode is really a context engineering decision.

With inline delivery, search results flow directly into the model’s context. The model can use them immediately. This is simple and often effective. The cost is context pressure: long result sets compete with instructions, conversation history, prior tool outputs, and the model’s own scratch space. As results accumulate, the agent can suffer from context rot: relevant details are technically present but practically lost.

With file-based delivery, search results are written outside the context window. The model receives a pointer and must explicitly inspect the artifact. This enables progressive disclosure: the agent can read only the relevant spans, grep within result files, or open surrounding context on demand.

That sounds strictly better until you remember the extra requirement:

The agent must understand the file workflow.

If the model does not reliably open, inspect, filter, and integrate the file contents, file-based delivery can collapse even when retrieval itself is fine.

The paper shows this clearly. Programmatic delivery reshuffles the grep-vector comparison: vector beats grep in five of ten programmatic harness-model pairs. The sharpest reported regression is Codex CLI + GPT-5.4, where accuracy falls from 93.1% under inline grep to 55.2% under programmatic grep, while programmatic vector reaches 67.2%.

That is not a statement about grep alone. It is a statement about the interaction between retrieval mode and the interface through which the agent consumes evidence.

A file-based retrieval system is not automatically better because it saves context. It is better only if the harness teaches and rewards the agent to use the file as an inspectable workspace.

Noise does not produce a universal winner

Experiment 2 adds increasing amounts of unrelated conversation history around the relevant evidence. The configurations range from smaller session limits such as s5 and s10 up to the full haystack, where each item contains roughly 39–66 sessions.

This is closer to production memory than clean demo search. Real agent memory is full of stale preferences, near-duplicates, old plans, abandoned tasks, and facts that were true last month but false today. If a search system only works on tidy examples, it is not a memory system yet.

The paper’s scaling results are not a simple monotonic story. Grep and vector both show relatively modest degradation on average, but the ordering depends on the stack. Claude Code tends to favor grep in the reported configurations. Gemini CLI Pro favors vector throughout. Chronos has crossings where one retriever leads at some noise levels and loses at others.

That matters because it weakens any one-line rule such as “grep is more robust to noise” or “vector scales better with semantic clutter.” The right conclusion is more operational:

Benchmark the whole search loop under realistic distractors, not only the retriever on a clean corpus.

Add stale sessions. Add near-matches. Add contradictory facts. Add old user preferences. Add temporal updates. Then inspect not only whether the right span was retrieved, but whether the agent used it correctly.

Design checklist for agentic search

If I were designing or reviewing an agent search harness after this paper, I would use this checklist.

1. Start with a strong lexical baseline

Before building an embedding pipeline, measure grep, BM25, regex, and metadata filters.

Lexical search is especially strong when queries contain literal anchors:

names;
dates;
IDs;
quoted text;
filenames;
user-stated preferences;
exact task labels;
structured event fields.

If those cases dominate, vector retrieval may be an expensive way to blur evidence that was already crisp.

2. Use hybrid routing as a policy, not a slogan

“Hybrid search” should not mean dumping lexical and vector results into the same prompt and hoping the model sorts it out.

A useful hybrid policy should route by query type:

exact/entity/time-bound questions → lexical first;
paraphrased or concept-level questions → vector first;
high-stakes answers → both, then rerank or cross-check;
temporal memory questions → lexical plus structured event filters;
ambiguous questions → query expansion before retrieval.

The arbitration layer matters. Without one, hybrid search can become twice the noise for twice the cost.

3. Benchmark harness variants, not just retrievers

Do not evaluate “vector vs grep” in a vacuum.

Benchmark at least these interface variants:

inline snippets;
file-based artifacts;
summarized result pointers;
paginated result browsing;
source-span opening;
surrounding-context expansion;
result files that can be searched again.

The metric should be final answer accuracy with evidence use, not only retrieval recall@k. A retriever can find the right span while the agent still fails because the harness made the evidence hard to consume.

4. Treat result presentation as part of retrieval quality

A search result should answer a practical question:

What should the agent do next with this evidence?

Good result presentation includes enough context to be useful, enough metadata to filter, and enough structure to support follow-up actions.

For long-memory agents, consider returning:

session IDs;
timestamps;
speaker names;
surrounding turns;
source file paths;
confidence or match score;
exact matched terms;
links to open larger spans.

The model should not have to reverse-engineer the retrieval system every time it searches.

5. Give the agent follow-up affordances

Search should be iterative.

If the first query finds a candidate span, the agent should be able to:

open the source around that span;
grep within the candidate file;
filter by date, speaker, or session;
ask for adjacent turns;
compare multiple retrieved witnesses;
mark uncertainty and search again.

A one-shot top-k blob is often too brittle for agent memory.

6. Log the search trajectory

When the agent fails, you need to know where the failure happened.

Log:

the user question;
generated search queries;
retrieval backend used;
returned results;
which files or spans the agent opened;
which snippets entered context;
the final answer;
whether the evidence supports it.

Without trajectory logs, “RAG failed” is too vague to fix. Was it bad query generation? Bad embedding retrieval? Missing lexical anchor? Context overload? Failure to open the file? Premature stopping? These are different bugs.

7. Test with distractors and stale memory

Clean retrieval demos are not enough.

A production memory benchmark should include:

irrelevant sessions;
near-duplicate facts;
old preferences that were later updated;
multiple people with similar names;
temporal questions;
conflicting evidence;
paraphrased evidence;
long result sets that force selective reading.

If your agent only works when the relevant document is alone in a quiet room, it does not have robust memory. It has stage lighting.

8. Keep context budget explicit

Every retrieval design spends context somewhere.

Inline systems spend it immediately. File-based systems defer the spend, but they add action complexity. Summaries save tokens but risk deleting the key detail. Chunking improves manageability but can split evidence across boundaries.

Make the tradeoff explicit:

What enters the model context now?
What remains inspectable outside context?
How does the agent request more detail?
What gets truncated?
How do we know truncation did not delete the answer?

Context is not just storage. It is the agent’s working surface.

What not to conclude

The paper has an important limitation: its conclusions are tied to long-memory conversational QA. The authors explicitly do not claim that grep beats vector retrieval in general.

That boundary matters.

Dense retrieval and hybrid search may look better in domains where evidence is rarely literal:

scientific synthesis over paraphrased abstracts;
conceptual product search;
multilingual corpora;
code semantics;
visual-heavy documents;
broad topic discovery.

Also, “vector retrieval” is not one thing. Embedding model, chunking strategy, metadata filters, rerankers, query rewriting, hybrid fusion, and result formatting can all change outcomes.

So the lesson is not “throw away vector databases.”

The lesson is:

Treat vector search as one component in a measured harness, not as a magic replacement for search design.

The practical mental model

An agentic search system has at least four layers:

Retriever: what candidate evidence is found?
Presenter: how is that evidence shaped for the model?
Inspector: what actions can the agent take to open, filter, and verify evidence?
Controller: when does the agent stop searching and answer?

Most RAG discussions obsess over layer one. This paper is useful because it shows that layers two through four can dominate the outcome.

That is why the “grep vs vector” framing is only the hook. The durable lesson is about interface design.

Search quality is not only about nearest neighbors. It is about whether the agent can turn retrieved traces into grounded decisions under context limits.

Grep did not kill vector search. It exposed a more uncomfortable truth: many agent systems are not retrieval-limited. They are harness-limited.