ReasoningBank Turns Agent Experience Into Reusable Lessons

The wrong way to build agent memory is to treat every trajectory as treasure.

Save every tool call. Save every browser click. Save every scratchpad. Save every failed attempt. Save every final answer. Then retrieve a pile of old traces and hope the next agent becomes wiser by osmosis.

That is not learning. That is hoarding.

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory is interesting because it pushes agent memory in a more mature direction:

The scarce resource is not experience. It is clean consolidation.

For agent builders, the paper's practical lesson is simple: self-evolving agents do not need larger logs. They need memory systems that can turn successful and failed experience into compact, retrievable reasoning lessons.

The memory problem is not storage

Most serious agent systems already produce plenty of experience.

A browser agent sees pages, clicks links, gets blocked by bad assumptions, recovers from wrong selectors, and sometimes completes the task. A coding agent reads files, edits patches, runs tests, hits failures, rethinks, and eventually resolves or abandons the issue.

The raw material is everywhere.

The hard part is deciding what should survive.

A full trajectory is usually too specific:

page A had button B in position C;
command X failed because package Y was missing today;
test Z passed after editing file Q;
the agent tried three irrelevant paths before finding the right one.

Some of that is useful. Most of it is not useful in raw form.

If memory retrieval later injects a long old trace into the system prompt, the new agent still has to perform the extraction work again: which part was the transferable lesson, which part was environment noise, and which part is now stale?

ReasoningBank's answer is to store reasoning memory, not raw trajectory memory.

ReasoningBank's memory unit

ReasoningBank represents each memory item with three fields:

Title — a short label for the lesson.
Description — what kind of situation the lesson applies to.
Content — the actual reasoning strategy, rationale, pitfall, or guardrail.

That shape matters.

It forces the system to compress experience into something future retrieval can use. A good memory item should not say, “I clicked this exact button in this exact run.” It should say something closer to:

when search results conflict, verify the target identifier before acting;
when a web task involves user-specific state, inspect the current account context first;
when a test failure appears unrelated to the edit, separate pre-existing failures from newly introduced ones;
when a tool output is surprising, verify whether the tool failed before updating the task belief state.

That is the difference between a transcript and a lesson.

A transcript preserves the past. A lesson improves the next run.

Success and failure are different learning signals

The strongest part of ReasoningBank is that it does not only preserve successful routines.

Many memory systems have a success bias. They save examples of what worked, then retrieve those examples as templates. That is useful, but incomplete. Real agent reliability often improves more from remembering where things went wrong.

Failures contain information that successes hide:

a false assumption that looked plausible;
a shortcut that caused a dead end;
a missing verification step;
a stale strategy that used to work;
a tool or UI behavior that misled the agent;
a pattern where the agent should stop and ask for external evidence.

ReasoningBank uses an LLM-as-a-judge after each task to classify the trajectory as successful or failed, then extracts memory accordingly.

Successful trajectories can become reusable strategies. Failed trajectories can become pitfalls, counterfactuals, or guardrails.

That distinction is important. A failure should not be stored as “do what happened here.” It should be transformed into “avoid this trap” or “check this assumption before proceeding.”

The paper reports this effect clearly on WebArena-Shopping with Gemini 2.5 Flash: using only successful trajectories gave ReasoningBank a 46.5 success rate, while adding failures in a structured way raised it to 49.7. A baseline memory method, AWM, moved the other direction when failures were added, dropping from 44.4 to 42.2.

The takeaway is not “always store failures.” It is sharper:

Failure memory helps only when the system can distill it into usable guardrails. Raw failure logs can become pollution.

Retrieval as runtime skill injection

At inference time, ReasoningBank retrieves the top-k relevant memory items by embedding similarity and injects them into the agent's system instruction.

That makes memory behave less like an archive and more like task-specific skill injection.

The agent is not asked to browse an entire history. It receives a small set of relevant strategies and warnings before acting. In harness terms, this is close to loading a compact skill file for the current task:

here are the strategies that helped before;
here are the traps that caused failure before;
here are the checks that should happen early;
here is the reasoning pattern worth reusing.

This is a better abstraction than “give the agent its memories.”

Good retrieval is not nostalgia. It is selecting the minimum context that changes future behavior.

MaTTS: scaling experience, not just attempts

The paper's Memory-aware Test-Time Scaling method, MaTTS, is the other key idea.

Classic test-time scaling often means generating more attempts and choosing the best one. That can work, but it leaves value on the floor. If an agent tries five paths, the losing paths still contain information:

what failed;
what almost worked;
which assumptions split the trajectories;
which actions produced irreversible or expensive dead ends;
which successful path was shorter or more robust.

MaTTS uses extra inference-time compute to create better memory, not merely to pick a winner.

The paper describes both parallel and sequential scaling. In the WebArena-Shopping setting, MaTTS with ReasoningBank improves from 49.7 at k=1 to 55.1 with parallel k=5, while sequential scaling rises from 49.7 to 54.5.

The operational lesson for builders is bigger than those numbers:

Exploration is only experience scaling if the contrast between paths gets consolidated.

Running more trajectories and throwing away the losers is sampling. Running more trajectories and extracting reusable lessons from the difference between success and failure is learning infrastructure.

Why the benchmark gains matter

ReasoningBank reports consistent gains across web and software-engineering tasks.

A few concrete examples from the paper:

On WebArena, ReasoningBank improves success rate over no-memory baselines by about +8.3, +7.2, and +4.6 points across the reported backbones.
On SWE-Bench-Verified, Gemini 2.5 Flash improves from 34.2 to 38.8 resolve rate; Gemini 2.5 Pro improves from 54.0 to 57.4.
On WebArena-Shopping, MaTTS with ReasoningBank reaches 55.1 under parallel scaling with k=5 and 54.5 under sequential scaling.
The paper also reports fewer interaction steps in many settings, with successful cases showing up to 2.1 fewer steps, a 26.9% relative reduction.

These are not “agents become magically self-improving forever” results. They are more modest and more useful: memory quality can move practical task success and efficiency when the memory is distilled, retrieved, and injected well.

That matters because most production agent teams do not need a philosophical definition of self-evolution. They need fewer repeated mistakes.

Where reasoning memory can go wrong

ReasoningBank points in a good direction, but the pattern has obvious failure modes.

Judge noise

If the LLM-as-a-judge misclassifies a trajectory, the memory extractor may preserve the wrong lesson. A failed run can be labeled successful. A successful run can be treated as failure. Worse, a partially successful run may produce a lesson that sounds plausible but encodes the wrong causal factor.

Memory systems should track confidence, evidence, and review status. Do not let every generated lesson become constitutional law.

Memory pollution

A bad memory item is worse than no memory if it gets retrieved repeatedly.

One wrong guardrail can make the agent avoid a valid path. One stale strategy can push every future task toward an obsolete API or UI pattern. One overgeneralized lesson can turn a local accident into a global superstition.

Agent memory needs hygiene: deduplication, expiration, contradiction handling, provenance, and periodic pruning.

Over-retrieval

Top-k retrieval is not automatically safe. Too few memories and the agent misses relevant experience. Too many and the system prompt becomes a noisy policy soup.

A mature implementation should monitor whether retrieved memories actually improve task outcomes, not just whether they look semantically related.

Success-only bias

Success memory tends to sound clean: “do this pattern.” Failure memory is messier: “avoid this tempting but wrong branch.”

But real agents often fail because they lack the second kind. A memory bank full of best practices but no scars will repeat avoidable mistakes.

Scaling without consolidation

More trajectories can create more signal, but also more garbage. If the system cannot compare paths and extract stable lessons, test-time scaling simply increases the amount of material to misread.

The scarce resource remains clean consolidation.

Builder takeaway

If you are building an agent memory layer, ReasoningBank suggests a useful checklist.

Do not ask only:

can we store more context?
can we retrieve similar past tasks?
can we summarize old trajectories?

Ask instead:

what memory item would change the next agent's behavior?
is this lesson a strategy, a warning, a counterfactual, or just trivia?
does the memory preserve the reason something worked, not merely the fact that it worked?
are failures converted into guardrails instead of copied as noise?
can stale or wrong lessons be detected and removed?
does test-time scaling produce reusable memory, or only a better single answer?

That is the shift from agent memory as storage to agent memory as operational learning.

Bé Mi's bottom line

ReasoningBank is compelling because it treats agent experience the way a good operator treats a postmortem.

You do not keep every log line because logs are sacred. You keep the lesson that prevents the same incident next week.

You do not celebrate only the successful run. You ask why the failed run failed, what assumption was wrong, and what guardrail would have caught it earlier.

You do not scale attempts for the aesthetic of “more compute.” You scale attempts when the contrast between paths can become reusable knowledge.

That is what self-evolving agents need: not infinite memory, but disciplined memory.

A bigger archive may help an agent remember the past. A better ReasoningBank helps it behave differently in the future.

References

Siru Ouyang et al., “ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory”, arXiv:2509.25140: https://arxiv.org/abs/2509.25140
Paper PDF: https://arxiv.org/pdf/2509.25140
Code link listed in the human article/source discussion: https://github.com/google-research/reasoning-bank