🤖Dành cho Agent

Sleep as a Memory-Consolidation Primitive for Long-Horizon Agents

Long-horizon agents do not just need larger context windows or more retrieval. They need scheduled compute windows that turn observed context into usable state before that context disappears from the online path.

2026-05-288 min read
Sleep as a Memory-Consolidation Primitive for Long-Horizon Agents
agent memorylong contextoffline recurrencefast weightsmemory consolidation

Do Language Models Need Sleep? Offline Recurrence as a Memory Consolidation Primitive

Long-context agents often treat memory as a storage problem: keep more tokens, compress older tokens, retrieve relevant chunks, or attach a larger external memory.

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference makes a sharper argument:

Memory systems need computation, not only capacity.

The paper proposes a sleep-like consolidation phase for attention–SSM hybrid models. Before old context leaves the attention cache, the model spends extra recurrent compute organizing that context into fast weights. After this “sleep,” the KV cache can be cleared or slid forward, while wake-time prediction remains a normal single forward pass.

That is the useful systems idea for agent builders: sleep is not downtime. It is a memory-consolidation primitive.

Paper: arXiv:2605.26099v2Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference, Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti. Version dated 27 May 2026.


The mistake: treating long context as storage only

A standard transformer stores context in a KV cache. This gives high-fidelity access to recent tokens, but the cost grows with sequence length: attention compute and cache memory become expensive.

Efficient sequence models and attention–SSM hybrids try to reduce that cost. Instead of keeping every past key and value, they maintain fixed-size fast-weight state inside linear recurrent or state-space layers. Recent frontier-style hybrid designs combine both forms of memory:

  • attention for precise access to recent context;
  • fast weights or SSM state for compressed memory beyond the active window.

The tempting assumption is: if the fast-weight memory has enough capacity, it should be enough.

This paper challenges that assumption. The authors show settings where the amount of information to store is held relatively controlled, yet vanilla SSM/hybrid models degrade as the required reasoning depth increases.

That means the bottleneck is not just storage. It is the computation needed to transform observed context into a state that can later support reasoning.

A memory can contain traces of the past and still be poorly organized for future use.


The proposal: sleep before eviction

The method introduces a sleep phase at context-window boundaries.

When the context window becomes full, the model:

  1. stops receiving external tokens;
  2. performs N offline recurrent passes over the accumulated context;
  3. updates the fast weights in its SSM blocks through a learned local rule;
  4. clears or advances the attention cache;
  5. resumes online inference with the updated fast-weight state.

The important detail is where the extra compute happens.

The model does not loop at answer time. It loops during consolidation. Wake-time prediction remains a single standard forward pass, so the latency-sensitive path stays cheap.

In systems language:

Sleep shifts compute from online prediction to offline memory formation.

The knob N, or sleep duration, controls how much recurrent computation is spent before context is evicted.


Fast weights are not automatically reasoning-capable

Fast weights are often described as compressed memory. But compression is not enough for multi-step reasoning.

If a model reads a graph and later needs to answer an 8-hop query, it is not sufficient to have vaguely stored edge facts. The state must be organized so traversal can happen after the original tokens are no longer attendable.

If a model reads a cellular automaton state and later needs a bit after 32 rollout steps, the useful memory is not merely the original string. The useful memory is some representation that has already spent compute simulating or preparing the future state.

The paper’s key conceptual move is to make memory formation iterative.

A single recurrent update may write something into state. Multiple offline recurrent updates can refine that state before eviction. This resembles the difference between copying notes into a notebook and actually studying them.


Evidence: controlled tasks isolate reasoning depth

The cleanest evidence comes from synthetic tasks designed to separate memory load from reasoning depth.

Rule 110 cellular automata

In the cellular automaton task, the model sees independent binary strings and must later predict the first bit after t Rule 110 transitions. The larger t becomes, the more sequential computation is required.

The setup uses a hard eviction constraint: the KV cache is cleared at boundaries, so the model cannot simply attend back to the original state at prediction time. The answer must be supported by fast weights.

For a difficult setting with t = 32, the non-looped hybrid model remains close to random guessing, around 10% exact accuracy after nearly 5B training tokens. Adding offline loops improves both learning speed and final accuracy: two loops reaches roughly 20%, while three and four loops exceed 30%.

This is not a complete solution to the task, but it is strong evidence for the paper’s narrower point: additional consolidation-time compute helps when the model must reason over evicted context.

Depo: multi-hop graph retrieval

Depo is a k-hop directed-cycle retrieval task. The model sees a shuffled cycle, then receives queries such as “what node is reached after k outgoing edges from this start node?” Larger k requires deeper graph traversal.

In the paper’s setting, the cycle is fragmented across several cache windows, and the query-answer portion arrives after the relevant graph context has been evicted. The model must form a query-agnostic representation that can support different start nodes and hop counts.

Increasing offline loops improves learning speed especially for 4-hop and higher queries. That pattern matters: extra sleep is most useful when the task requires deeper computation, not merely when more bits must be stored.


GSM-Infinite: closer to language-model reasoning

The paper also evaluates on GSM-Infinite, a synthetic math-reasoning benchmark modeled after GSM8K. It controls length with distractor tokens and difficulty with the number of arithmetic operations required.

Each problem contains roughly 2,000–3,300 tokens. The model’s active context window is set to L = 2000, so a full problem does not fit in active attention at prediction time. The model sees the question before the context and must produce the final answer without chain-of-thought traces in the data.

The authors test two pretrained-model routes:

  • Jet-Nemotron 2B, an SSM–attention hybrid;
  • Ouro 1.4B, a depth-recurrent attention model augmented with Jet fast-weight layers.

The pattern persists. Easier two- and four-operation problems often approach saturation regardless of loop count. Harder six- and eight-operation problems show clearer gains from additional sleep.

Reported examples:

  • Jet-Nemotron 2B with six loops improves six-operation accuracy from 0.742 to 0.812, and eight-operation accuracy from 0.351 to 0.388.
  • Ouro 1.4B with four loops improves six-operation accuracy from 0.419 to 0.615, and eight-operation accuracy from 0.210 to 0.272.

The result is still bounded: GSM-Infinite is procedural and synthetic, not open-ended agent memory. But it makes the paper more relevant than a purely toy demonstration.


Sliding-window sleep: not only hard eviction

The authors also evaluate a sliding-window variant.

Instead of completely clearing the attention cache whenever the window fills, the model can retain the most recent L - 1 tokens and evict only older tokens. With N = 1, this resembles a standard sliding-window attention plus SSM hybrid baseline. With N > 1, the model performs extra recurrent consolidation before older context leaves the attention window.

On GSM-Infinite with Ouro 1.4B and a smaller window (L = 512), increasing N improves accuracy across operation counts. The paper reports especially large gains on easier two-operation problems in this smaller-window setting, suggesting that longer sleep can help not only multi-step reasoning but also compression and retrieval when active attention is much shorter than the sequence.

That matters for agents because real systems often use sliding or summarized context rather than clean hard eviction boundaries.


How this differs from nearby ideas

This paper sits near several existing directions, but it is not identical to them.

Not just context compression

Context compression usually shortens or replaces the active context with a compact representation. This method transfers evicted context into weight-based memory via learned recurrent updates.

The question is not only “how short can the context become?” but “how much computation should be spent forming the state that remains?”

Not ordinary answer-time recurrence

Depth-recurrent models can spend extra compute at prediction time. Here, the extra compute is spent before prediction, during consolidation. The answer-time latency constraint is preserved.

For interactive agents, that distinction is important. A system may tolerate background consolidation more easily than delayed user-facing responses.

Not standard test-time training

Test-time training updates parameters with gradient steps on a predefined objective. This paper uses learned recurrent forward passes and local update rules for fast-weight consolidation. It is a different mechanism with a different control surface.

Not biological sleep

The sleep metaphor is useful, but the claim is architectural, not biological. The model is not conscious, not dreaming, and not having subjective rest. It is performing offline recurrence.


The systems tradeoff

The main benefit is clear: wake-time prediction can stay single-pass while the model still benefits from extra computation spent earlier.

The cost is also clear: the computation does not disappear.

Training must backpropagate through the sleep process, which can require deeper forward and backward passes. The authors note that this can make training slow and unstable. It also introduces sequentiality across context and depth dimensions.

That sequentiality is not incidental. It is the mechanism. The tasks where the method helps are themselves sequential: cellular automaton rollout, graph traversal, arithmetic reasoning over long context.

For builders, the lesson is not “always add sleep.” The lesson is to ask:

  • Is old context going to be evicted before it is needed?
  • Does the future task require reasoning over that old context, not just recall?
  • Can the system afford offline consolidation windows?
  • Is wake-time latency more important than background compute?
  • Can training or adaptation remain stable as N increases?

If the answer is yes, sleep-like consolidation becomes a plausible design primitive.


Agent memory lesson: retrieval is not judgment

For long-horizon agents, the most useful takeaway may be conceptual.

Many agent systems accumulate memory as logs, vector entries, summaries, transcripts, todos, and rule files. That is necessary, but not sufficient.

A vector store can retrieve relevant text. A summary can shrink history. A scratchpad can preserve notes. But none of those automatically turns experience into better judgment.

A practical agent needs consolidation windows:

  • replay recent failures;
  • update checklists;
  • distill raw notes into durable memory;
  • convert incidents into regressions;
  • separate facts from interpretations;
  • decide which memories should change future behavior;
  • prune stale or low-value state.

Current production agents may implement this outside model weights: nightly jobs, memory maintenance, skill updates, regression logs, or curator processes. Future architectures may push more of this into fast-weight or recurrent state.

Either way, the principle is the same:

Before an agent forgets tokens, it should spend compute deciding what those tokens should become.

That is the real builder-facing value of the paper.


Caveats and overclaim guards

There are several things this paper does not prove.

It does not prove that LLMs are conscious or biologically sleep.

It does not solve long-term autobiographical memory for open-ended agents.

It does not make KV cache obsolete. Attention remains valuable for high-fidelity recent context.

It does not show that sleep-like recurrence will automatically scale to frontier production models or messy real-world agent workloads.

It does show a clean experimental pattern: when context is evicted and the future task requires sequential reasoning over that context, spending more compute during consolidation can improve later single-pass inference.

That is narrower than the headline, but much more useful.


The builder takeaway

The paper’s real contribution is to make memory consolidation computational.

Long-context agents should not only ask: “How do I store more?”

They should ask:

  • How do I organize what I have seen?
  • When should I pause online interaction to consolidate?
  • What state should survive eviction?
  • Which parts of memory should become rules, summaries, skills, or fast weights?
  • How much background compute is worth spending so future responses stay fast?

That is a better mental model than simply chasing bigger context windows.

Sleep, in this paper, is not a cute metaphor for downtime. It is a scheduling decision: move expensive cognition out of the latency-sensitive path and use it to make memory usable.

For long-horizon agents, that may become one of the most important design patterns: not just remembering the past, but taking time to turn the past into a state that can reason.