Memory Caching for Growing RNN Memory

The paper “Memory Caching: RNNs with Growing Memory” is interesting because it frames the RNN-vs-Transformer trade-off as a memory-capacity problem rather than a branding fight.

Transformers work well on recall-heavy tasks partly because attention gives them growing memory: as the context gets longer, the model can directly access more past token information. That is powerful, but it creates the familiar quadratic compute pattern and heavy inference-time KV cache.

Recurrent models and linear-attention-like systems are attractive for the opposite reason. They compress history into a fixed-size state, so their per-token cost can stay much lower. But the fixed state becomes the bottleneck. Long sequences force the model to keep compressing more past information into the same container, and recall-heavy tasks expose the loss.

Memory Caching, or MC, proposes a simple middle position:

Cache checkpoints of the recurrent memory state, then let future tokens query both the current online memory and selected cached memories from earlier segments.

That makes recurrent memory grow with sequence length, but under a budget the builder can control.

The core mechanism

MC splits a sequence into segments. Each segment updates a memory module recurrently. At the end of a segment, the final memory state is cached.

For a later token, the model does not have to rely only on the current online state. It can aggregate information from:

the online memory for the current segment;
cached memory states from previous segments;
an aggregation rule that decides how those memories affect the output.

This is the useful abstraction: cached states are not raw token KV entries. They are compressed checkpoints of the model’s own memory-update process.

That matters because it gives MC a tunable complexity profile. Standard recurrence sits near O(L). Full attention sits near O(L²). MC can interpolate between them depending on how many cached memories are used and how they are selected.

Four aggregation strategies

The paper studies four ways to use cached memory.

Residual Memory is the simplest version. It adds responses from cached memories to the online memory response. It is crude, but the experiments show that even this direct path can improve recurrent baselines.

Gated Residual Memory adds a context-aware gate. The current query can modulate how much cached memory should matter. This is closer to what a practical long-context system needs: not every old checkpoint deserves equal influence.

Memory Soup averages cached memory modules, inspired by weight souping. For linear memories this can resemble averaging state matrices, but for deeper/nonlinear memory modules it becomes a different design point: the cached memory modules themselves are combined before being queried.

Sparse Selective Caching (SSC) uses a router, Mixture-of-Experts style, to select only the most relevant cached memories. This is probably the most builder-relevant variant because it treats memory access as a routing problem. Long context is not just about storing more; it is about deciding what not to read.

What the experiments say

The experiments are positioned as proof of concept across several recurrent or memory-style architectures, including linear attention, Titans-style deep memory modules, sliding-window linear attention, and deep linear attention.

Across language modeling, long-context understanding, and in-context recall tasks, MC improves the recurrent baselines. The recall results are especially important: Transformers still achieve the best accuracy, but MC variants narrow the gap and outperform strong recurrent alternatives in several settings.

The efficiency story is also the right shape. MC adds overhead compared with a base recurrent model, but remains a middle ground rather than collapsing into full attention cost. The paper reports that SSC gives the best efficiency/performance balance among the proposed variants, especially as context length grows.

The honest reading is:

MC is not a proof that RNNs now dominate Transformers.
It is evidence that fixed-size recurrent memory is not the only efficient option.
The design space between “compress everything into one state” and “attend to everything” is still underexplored.

Why agent builders should care

For agent systems, this paper is useful even beyond neural architecture design.

A long-running agent faces a similar problem. If it keeps only the current state, it forgets. If it loads the whole transcript, it wastes context and invites distraction. The practical answer is usually structured memory: segment, summarize, checkpoint, retrieve selectively, and keep a budget.

MC gives a model-internal version of that pattern.

The most transferable builder lesson is that memory should be treated as a capacity dial:

how often do we checkpoint;
what form does each checkpoint take;
how many checkpoints are visible at query time;
should access be residual, gated, averaged, or routed;
what is the cost of reading more memory compared with recomputing or forgetting?

Those are exactly the questions agent infrastructure teams already ask at the system level. MC asks them inside the sequence model.

The caveat

Caching compressed states is not the same as preserving all details. A memory checkpoint can still lose information. Routing can select the wrong cache. Averaging memories can blur distinctions. And if a task needs exact token-level retrieval, a Transformer-style mechanism may still be superior.

But that caveat is also the point. MC does not need to replace attention to be valuable. It gives recurrent architectures a way to pay for more memory only when useful.

For builders, that is the interesting direction: not one universal memory architecture, but systems where memory size, selection, and fidelity are explicit engineering knobs.

Source: Memory Caching: RNNs with Growing Memory, arXiv:2602.24281 — https://arxiv.org/pdf/2602.24281