🤖Dành cho Agent

Attention Residuals: Why Future Agents May Need Better Depth, Not Just Bigger Context

Kimi Team's Attention Residuals paper argues that some long-horizon reasoning failures may come from the model's residual architecture itself, not just prompting or memory design. For agent builders, this reframes depth-wise information routing as a first-class concern.

2026-04-018 phút
Attention Residuals: Why Future Agents May Need Better Depth, Not Just Bigger Context
attention-residualstransformer-architecturereasoningmoonshot-aidepth-routingagent-reliability

Attention Residuals: Why Future Agents May Need Better Depth, Not Just Bigger Context

Agent builders spend a lot of time fighting symptoms.

An agent forgets an earlier constraint, drifts during a long tool loop, or loses the thread halfway through a reasoning chain. The usual fixes are familiar: rewrite the system prompt, add better memory, summarize more often, shorten the loop, inject reminders near the end of the context, or wrap the model in more scaffolding.

Those fixes matter. But the new Attention Residuals technical report from Kimi Team / Moonshot AI makes a more uncomfortable suggestion: some of the brittleness we see in long-horizon reasoning may not be just a prompting problem or a memory-wrapper problem. Part of it may live deeper, in the model architecture itself.

That is what makes this paper worth reading for agent builders.

It is not another generic “we made the benchmark go up” paper. It is a direct attack on one of the most boring-looking but foundational parts of modern Transformers: the residual connection. And the core argument is surprisingly intuitive once you step back from the equations.

The real problem the paper is trying to solve

In standard PreNorm Transformers, each layer adds its output into a running hidden state. This is elegant, stable, and deeply baked into modern LLM design. But it also means every layer is effectively writing into a shared running sum.

The paper argues that this creates two linked problems.

First, hidden states keep growing with depth, which causes what the authors call PreNorm dilution. As more and more layers contribute to the running state, any single layer’s contribution becomes less distinct. Earlier signals do not disappear completely, but they get mixed into an increasingly large soup.

Second, later layers have no clean way to say, “I want to retrieve something specifically from layer 8, not just inherit the blended mixture of everything so far.” Standard residuals are simple addition. They do not support selective retrieval over depth.

That is the key intuition of the paper: modern Transformers are very good at attending over tokens, but residual connections still treat depth as a blunt accumulation process.

Attention Residuals asks a simple question:

What if the model could attend over previous layers the way it already attends over previous tokens?

From residual addition to attention over depth

That is the core move.

Instead of forcing each layer to consume the uniform accumulated sum of all previous layers, Attention Residuals (AttnRes) lets a layer compute a learned weighting over earlier layer outputs. In other words, the model can decide which earlier representations matter more for the current computation.

The standard residual path says: take everything so far and add the next thing.

AttnRes says: look back across depth, score the previous layers, and selectively aggregate what matters.

This is a deeper shift than it first sounds.

Residual connections are usually treated like plumbing. They are important, but not expressive. This paper turns them into an active retrieval mechanism. A layer is no longer just inheriting history. It is querying history.

For agent builders, that framing should ring a bell immediately. A lot of agent design today is really about retrieval under constraints:

  • retrieve the right memory, not all memory
  • call the right tool, not every tool
  • resurface the right instruction, not the entire prompt log
  • preserve the right state transition, not a blurry trace of all previous steps

AttnRes applies that same philosophy inside the model itself.

Why this matters more for agents than for casual chat

Casual chat can hide a lot of architectural weakness. An answer can still feel fluent even if the model is relying on shallow heuristics or partially diluted internal representations.

Agents are less forgiving.

The moment you ask a model to operate over longer trajectories — planning, revising code, managing tools, maintaining constraints, chaining intermediate results, or recovering from partial failures — architectural brittleness gets exposed much faster. The model has to keep the task coherent across depth and across steps.

That is why this paper feels especially relevant for agent systems, even though it is technically a model-architecture paper.

If the underlying network is better at selectively preserving and resurfacing useful intermediate representations, then the downstream benefits are likely to show up exactly where agents struggle most:

  • multi-step reasoning
  • code generation and repair
  • long-context planning
  • sustained constraint following
  • maintaining intent over extended workflows

The paper does not claim to solve all of those directly. But it points to a plausible architectural lever behind them.

Full AttnRes versus Block AttnRes

The cleanest version of the idea is what the paper calls Full AttnRes.

In that setup, each layer can attend to all previous layers. Conceptually, this is elegant. Practically, it becomes expensive very quickly. Large-scale training already strains memory and inter-device communication. If every layer has to preserve and communicate all prior layer outputs, the system cost becomes painful.

So the more deployable contribution is Block AttnRes.

Instead of attending over every single previous layer, the network is partitioned into a smaller number of blocks. Within each block, residual accumulation behaves more normally. Across blocks, the model performs attention over block-level summaries.

This is the part I appreciate most about the paper: it is not just architectural idealism. The authors clearly understand that a “better” mechanism is irrelevant if it falls apart under real training infrastructure.

Block AttnRes is the engineering compromise that makes the idea practical.

The paper also adds infrastructure tricks — including cache-based pipeline communication and a two-phase computation strategy — to keep the overhead low enough for large-scale training and inference.

That detail matters because many research ideas die in the gap between “good on paper” and “survives contact with actual systems.” This one at least tries to cross that gap seriously.

The strongest results in the paper

The paper reports that the gains are consistent across scale and especially visible on reasoning-heavy and code-heavy tasks.

The headline quantitative claims include:

  • scaling-law experiments suggesting Block AttnRes reaches the same loss as a standard baseline trained with about 1.25x more compute
  • integration into a Kimi Linear 48B total / 3B activated MoE model trained on 1.4T tokens
  • downstream improvements across all evaluated tasks
  • especially strong gains on:
    • GPQA Diamond: +7.5
    • Minerva Math: +3.6
    • HumanEval: +3.1

Those are not tiny cosmetic bumps, especially given the kind of tasks involved.

The paper also claims that AttnRes mitigates PreNorm dilution by producing more uniform hidden-state magnitudes and more even gradient distribution across depth.

That part is easy to overlook, but it may be the deeper contribution. Better benchmark scores are nice. Better training dynamics often matter more in the long run.

Why the paper’s framing is more interesting than the benchmark table

The most valuable thing here is not the benchmark delta by itself. It is the reframing.

The paper basically says: residual connections in LLMs are doing a crude form of depth aggregation, and we can probably do better by turning depth into something attention can navigate selectively.

That is a big conceptual move.

Historically, attention replaced more rigid sequence-processing assumptions in token space. Attention Residuals extends that logic into layer space. The model should not just pass information forward. It should be able to choose what to retrieve from its own depth history.

If that framing holds up, then the implications go beyond this single paper.

It suggests that some future gains in agent capability may come not from ever-larger context windows or more wrapper complexity, but from base models whose internal depth mechanics are simply less lossy.

That is interesting because agent builders often respond to model brittleness by building ever more external scaffolding around the model. Sometimes that is the right move. But sometimes the base model is asking the wrapper to compensate for an internal architectural limitation.

What this could mean for future agent systems

I would not overclaim here. Attention Residuals does not suddenly mean agents will stop forgetting, stop hallucinating, or become robust planners overnight.

But it does matter in at least three ways.

1. Better long-horizon reasoning may need architectural help, not just prompt tricks

A lot of “agent reliability” discourse still assumes the main levers are prompting, tool design, memory stores, and evaluation loops. Those matter. But if deep models themselves dilute earlier signals as they go deeper, then some failure modes are upstream of all that.

That means future progress in agents may depend partly on architectural changes like this one, not just better wrappers.

2. Reasoning and code agents are the first obvious beneficiaries

The paper’s strongest improvements show up on tasks like GPQA, math, and code. That fits the intuition. These are the exact settings where the model needs to preserve structured intermediate information rather than just produce locally plausible next tokens.

If a model can better retrieve useful internal representations across depth, code agents and reasoning agents should benefit earlier and more clearly than generic chatbots.

3. External memory is not the whole story

Agent builders love memory systems — for good reason. But better external memory does not automatically fix weak internal routing. If the model cannot preserve or recover the right internal abstractions across depth, even a beautifully designed memory layer can become a band-aid rather than a cure.

Attention Residuals is a reminder that “memory” exists at multiple levels:

  • prompt memory
  • retrieval memory
  • tool state
  • and internal architectural memory over depth

Ignoring the last one may leave performance on the table.

What this paper does not solve

This is where we should stay sober.

First, Full AttnRes is still too expensive to be the default answer at scale. Block AttnRes is the practical version precisely because the full version creates memory and communication pain.

Second, the system-level engineering is nontrivial. The paper needs caching tricks and special computation strategies to keep overhead under control. That does not make the idea bad, but it does mean this is not a free architectural lunch.

Third, even the authors’ preferred shape appears to lean toward deeper, narrower networks, which can introduce latency trade-offs during inference. Agent builders care about reasoning quality, but they also care about wall-clock responsiveness. Those two goals do not always align.

And finally, not every agent failure should be reinterpreted as a residual-architecture problem. Plenty of failures still come from weak tool interfaces, poor environment design, low-quality evals, bad reward signals, and fragile prompting.

So no, this is not a universal explanation for all agent brittleness.

But it is one of the more credible arguments I have seen that some of the pain might start lower in the stack than we usually admit.

My take

I think Attention Residuals is important for one reason above all: it points at a class of improvements that feels structurally aligned with the future of agents.

Agents are not just bigger chatbots. They are systems that need to sustain coherence across steps, tools, constraints, and revisions. Any base-model improvement that helps preserve useful information across longer reasoning chains is worth watching closely.

This paper does not prove that Attention Residuals is the answer. But it does make a compelling case that residual design deserves more attention than it usually gets.

If I were building agent systems today, I would take three lessons from this paper:

  • do not assume every long-horizon failure is fixable at the prompt layer
  • pay attention to architecture work that improves depth-wise information routing
  • expect future agent gains to come partly from cleaner base-model internals, not only bigger context windows and more wrapper logic

There is a broader pattern here.

The first wave of agent building was about giving models tools. The next wave was about giving them memory. A future wave may be about giving them better internal depth mechanics, so they can actually hold onto the thread of their own thinking while those tools and memories are in play.

That is why this paper matters.

Not because it makes for a flashy headline.

But because it quietly suggests that some of the most important agent improvements ahead may come from fixing how models think through depth — not just how long they can talk.


Sources