Faulty Memory Consolidation in LLM Agents

Long-term memory is one of the most seductive promises in agent design.

Give an agent a persistent memory bank. Let it learn from previous runs. Ask it to distill trajectories into reusable lessons. Over time, the agent should become more capable without any parameter updates.

That is the usual story.

The paper “Useful Memories Become Faulty When Continuously Updated by LLMs” makes that story much less comfortable.

Its central finding is direct:

Contemporary LLMs are not reliable memory consolidators. When they repeatedly rewrite useful experiences into textual memory, the resulting memories can drift, overgeneralize, and eventually hurt performance.

This is not merely a warning that bad data creates bad memory. The paper constructs settings where the input experiences are useful by design: solved trajectories, expert trajectories, or ground-truth solutions. Even then, the consolidation step can corrupt the memory.

For agent builders, that distinction matters.

The dangerous operation is not experience collection. It is uncontrolled abstraction.

Episodic evidence vs consolidated abstractions

The paper separates two forms of memory:

Episodic traces: raw trajectories of what happened.
Consolidated abstractions: distilled lessons extracted from many episodes.

Most agent memory systems want the second form because it is compact. A raw trajectory contains observations, actions, errors, environment feedback, intermediate attempts, and the final result. It is useful, but expensive to keep in context.

A consolidated lesson is cheap:

“When facing task family X, use strategy Y.”
“Avoid action Z because it caused failure.”
“Prefer this workflow for similar problems.”

But this compression is lossy.

The paper’s key claim is that current LLMs often lose the wrong information during this lossy rewrite. They drop applicability conditions, introduce spurious rules, group unrelated experiences, and turn context-specific behavior into overly broad advice.

That is how useful memories become faulty.

The non-monotonic memory curve

Across ALFWorld, ScienceWorld, WebShop, AppWorld, and ARC-AGI Stream, the paper reports a recurring pattern: memory utility rises early, then degrades as consolidation continues.

On ScienceWorld, abstracted memory improves at first and then declines through later update steps. On WebShop, AWM memory decreases from 0.64 at 8 examples to 0.20 at 128 examples, while the no-memory baseline is also 0.20.

This is the important shape:

no memory → early memory helps → more consolidation → erosion/regression

A memory bank that initially helps is not necessarily stable. Continued updating can erase its own utility.

That is a serious deployment risk because many agent systems treat memory updates as harmless bookkeeping. The paper argues that this assumption is wrong.

The ARC-AGI regression result

The cleanest result is also the most alarming.

The authors select a 19-problem ARC-AGI slice that GPT-5.4 solves at 100% accuracy without memory. Then they stream those same problems through a consolidation loop using ground-truth solutions.

If consolidation were a reliable operation, it should not make the model worse on problems it already solved, especially when the input trajectories are correct.

But stream consolidation drops performance sharply. Figure 2 reports that GPT-5.4 falls to 52.6% by Round 10 on the same previously solved problems. The abstract summarizes the effect as GPT-5.4 failing on 54% of ARC-AGI problems it had previously solved without memory after consolidating from ground-truth solutions.

This isolates the failure.

The memory did not fail because the experience was useless. It failed because the abstraction process rewrote useful evidence into faulty memory.

Schedule sensitivity: same evidence, different memory

The paper also shows that consolidation is fragile with respect to update schedule.

The same trajectory pool can produce different memory states depending on whether it is consolidated in one static pass, grouped by task family, or streamed batch-by-batch. Updates for one task can overwrite memory for another. Repeated near-duplicates can overfit memory to seen instances and reduce generalization.

That means the memory bank is not just a function of experience. It is a function of the consolidation procedure.

This is a problem for production agents because update schedules are often incidental:

when the user happens to interact,
when a cron job runs,
how many episodes are batched,
whether unrelated tasks share the same memory file,
whether old entries are rewritten or appended.

If schedule changes can qualitatively alter memory, then memory updates need versioning, tests, and rollback. They should not be treated as invisible background cleanup.

Three mechanisms behind faulty abstraction

The paper identifies three mechanisms.

1. Misgrouping experiences

The consolidator may pool episodes that do not share an underlying structure. It then abstracts a rule that does not actually describe the group.

For agents, this can happen when multiple tasks share surface words but require different strategies.

2. Dropping applicability conditions

Even when grouping is correct, abstraction can remove the boundary conditions that make a lesson valid.

A memory entry like “use breadth-first search” may be harmful if the original lesson was “use breadth-first search only when the state space is small and transitions are uniform.”

The compressed version is shorter, but less true.

3. Overfitting to narrow streams

When the input stream contains repeated near-duplicates, memory can overfit to those seen cases and generalize poorly to nearby unseen tasks.

This is especially relevant for agents that repeatedly operate in a small local context. They may develop extremely confident local folklore.

Raw trajectories remain competitive

One of the most practical parts of the paper is the comparison against episodic controls.

The authors compare consolidated memory methods against the raw trajectory logs those methods are asked to compress. In many cases, abstracted memory does not significantly outperform direct in-context learning from preserved trajectories.

Table 2 reports ALFWorld and AppWorld comparisons where trajectory-log baselines often beat or match distilled memory approaches. The paper’s interpretation is important: raw trajectories retain observations, actions, intermediate failures, and environment feedback tied to their original situations. The solver can exploit that concrete evidence directly.

In other words:

Abstraction is not automatically better than evidence.

A distilled lesson can be useful, but it should be evaluated against the raw episodes it replaces.

Mitigation: make consolidation explicit and optional

The paper introduces an ARC-AGI Stream environment where agents can choose among three actions:

Retain a raw episode.
Delete an entry.
Consolidate buffered episodes into an abstract store.

This creates a two-store design: an episodic buffer and an abstract memory store.

The result supports a conservative default. Agents preserve raw episodes by default and outperform forced-consolidation counterparts. Disabling consolidation entirely while allowing episodic management can match the auto regime.

The lesson is not “never consolidate.” Long-running agents cannot keep unlimited raw history forever, and abstraction is still necessary for transfer.

The lesson is:

Consolidation should be gated, tested, and reversible — not fired after every interaction.

Design implications for agent memory systems

Here is how I would translate the paper into engineering rules.

1. Treat raw episodes as first-class evidence

Do not let a summary fully replace the underlying trajectory until the summary has been validated.

A memory entry should ideally point back to source episodes.

2. Store applicability conditions

Every distilled lesson should include boundaries:

when it applies,
when it does not apply,
examples that support it,
counterexamples or failure modes.

A rule without scope is a future bug.

3. Gate consolidation

Do not consolidate after every interaction by default.

Trigger consolidation when there is enough diverse evidence, a clear task family, and a reason to believe compression will help.

4. Version memory updates

A memory update is a behavioral change. Treat it like code:

diff it,
test it,
keep rollback,
evaluate against previous tasks.

5. Test memory against no-memory and episodic baselines

If a memory system claims improvement, compare it against:

no memory,
raw episodic demonstrations,
sampled trajectory subsets,
static vs streaming update schedules.

Otherwise the system may be measuring the value of experience while blaming or crediting abstraction.

6. Separate evidence from interpretation

Raw episodes are evidence. Consolidated lessons are interpretations.

Do not collapse both into one continuously rewritten text blob.

Caveats

The paper has important scope limits.

It evaluates text-based agentic benchmarks and a controlled ARC-AGI Stream environment. Whether the same erosion dynamics appear in multimodal, embodied, or production tool-rich systems remains open.

It studies natural-language abstraction by contemporary LLMs such as GPT-5.4-family and Qwen3.5-family models. Structured non-textual memory, parametric memory updates, and consolidator-specific fine-tuning are outside the core scope.

The authors also note API-cost constraints: many reported results are point estimates from a small number of repeats rather than full error-bar studies.

So the right conclusion is not that all agent memory is doomed.

The right conclusion is that continuous LLM-written memory consolidation is a risky operation that needs stronger controls than many current systems give it.

The builder takeaway

This paper changes how I would review agent memory architectures.

I would no longer ask only:

Does the agent have memory?

I would ask:

What is the memory update operator, and how do we know it does not corrupt useful evidence over time?

A good long-term agent should maintain both:

an episodic layer that preserves source evidence;
an abstraction layer that is gated, scoped, versioned, and tested.

The mistake is to ask one LLM rewrite loop to be archivist, analyst, editor, and judge at the same time.

That is too much authority for a lossy summarizer.

Useful memory needs humility: keep the receipts, abstract slowly, and never assume that a prettier lesson is truer than the trajectory it came from.

Source

Dylan Zhang, Yanshan Lin, Zhengkun Wu, Yihang Sun, Bingxuan Li, Dianqi Li, Hao Peng. “Useful Memories Become Faulty When Continuously Updated by LLMs.” arXiv:2605.12978v1, 2026. https://arxiv.org/abs/2605.12978