Why Agents Should Learn From Latents, Not Just Tokens

Modern AI systems are mostly trained to predict what comes next: the next token, the masked patch, the denoised pixel, the missing fragment of the raw signal.

That recipe works extraordinarily well. But it may also be statistically wasteful.

The paper “Learn from your own latents and not from tokens: A sample-complexity theory” by Daniel J. Korchinski, Alessandro Favero, and Matthieu Wyart gives a clean theoretical reason why: if observations are generated by hidden compositional structure, learning from surface tokens can require far more samples than learning from model-derived latent representations.

The practical agent-builder reading is simple:

Latent prediction is not magic. It is a bias toward learning structure.

And agents need structure more than they need surface fluency alone.

The surface is not the generator

Token prediction treats the visible sequence as the direct learning target. In language, that target is a token. In vision, it may be a pixel, patch, or denoising target. In multimodal systems, it may be another raw or lightly processed observation.

But real data is rarely flat.

A sentence is not just a sequence of words. A scene is not just a grid of pixels. An interaction history is not just a list of messages. These observations are produced by latent structure: entities, relations, goals, causal dependencies, discourse state, task state, and context that persists across many surface forms.

For agents, this distinction matters. A useful agent cannot merely model the next token in a transcript. It needs reusable abstractions: what the user wants, what has already been tried, which constraints still matter, what state the external world is in, and which plan fragments can be reused later.

Those are latent variables in the engineering sense, even if they are not explicitly labeled.

The paper’s clean setting: a hidden tree behind visible tokens

The paper studies a tractable probabilistic context-free grammar called the Random Hierarchy Model. It generates visible token strings from a tree of hidden symbols.

The important knob is the tree depth L.

Intuitively:

At the bottom are visible tokens.
Above them are low-level latent symbols.
Above those are higher-level latent symbols.
The deeper the tree, the more layers of hidden compositional structure generate the visible sequence.

This is not meant to be a literal model of the web. It is a controlled setting where the authors can ask a sharp sample-complexity question:

How many samples does a learner need to recover the latent hierarchy behind the tokens?

The answer depends strongly on the objective.

Prior analyses of this model show that supervised learning and token-level self-supervised learning can require a number of samples that grows exponentially with the hierarchy depth L. In the paper’s notation, token-level SSL can face scaling on the order of m^(L+1) in the hard stage.

The new result is that, under the paper’s assumptions, latent prediction can recover the non-root latent hierarchy with sample complexity that is effectively independent of L, up to logarithmic factors. Their clustering analysis gives a scaling around m^3 rather than something exponential in the hierarchy depth.

The exact constants and assumptions matter. But the conceptual result is the important part for builders:

If the target remains a surface token, the learning signal must travel through the whole latent tree. If the target becomes a learned latent, the system can climb the hierarchy using its own intermediate abstractions.

Why token-level prediction can be statistically wasteful

Imagine trying to infer the structure of a tree by only looking at leaves.

You can do it, but the signal gets weak as the structure becomes deeper. A surface token is many generative steps away from high-level latent causes. Each unresolved production step can blur the statistical relationship between the target and the hidden structure you actually want to learn.

That is the core sample-complexity problem in the paper.

Token-level objectives can still reconstruct useful representations. But they may need to discover lower-level latents first, then use those to reach higher-level latents, while the prediction target remains stuck at the surface. As the hierarchy gets deeper, the difficult stage moves upward and the sample requirement grows rapidly.

Latent prediction changes the target.

Instead of asking the model to reproduce a raw token or pixel, it asks the model to predict a representation produced by an encoder. Once useful low-level abstractions form, those abstractions can become the learning signal for higher-level structure.

This is why the paper’s result is relevant to methods like data2vec and JEPA. These methods do not simply reconstruct the input. They predict representations of related views, masked regions, or future signals.

The target is no longer the leaf. It is a learned internal summary of part of the tree.

data2vec as implicit hierarchical latent prediction

One of the paper’s most interesting claims is that data2vec implicitly performs hierarchical latent prediction.

data2vec uses a teacher-student setup: the student sees a masked input and predicts the teacher’s representation of the unmasked input. The teacher is an EMA-smoothed version of the student. That detail matters because the target is not fixed. As the student learns better representations, the teacher’s future targets inherit those representations.

In the paper’s account, this creates a recursive process. Early in training, the target is close to surface-level structure, so the model can learn low-level latent groups. Once those low-level groups appear in the teacher representation, they become part of what the student predicts next. That shifts the target distribution upward: the model is no longer only learning from surface clusters, but from internal abstractions that make higher-level latent groups statistically accessible.

This gives an intuition for why data2vec can behave like recursive latent clustering under the paper’s model, reaching the same favorable sample-complexity scaling as the explicit iterative latent-clustering procedure.

It also suggests that some representation-learning methods may be hierarchical not because the architecture explicitly stacks many hand-designed scales, but because the objective creates a curriculum: current latents become targets that make deeper latents learnable.

This does not make explicit hierarchical architectures obsolete; rather, it suggests that part of what they aim to encode may emerge from the latent-prediction objective itself. In the paper’s stylized RHM setting, explicit multi-scale stacking may be partly redundant because the latent-prediction objective can induce a hierarchy on its own. This should not be read as a claim that H-JEPA-like architectures are unnecessary in real-world vision systems.

In other words, the objective itself may carry more of the hierarchy-building load than we usually assume.

Why this matters for agents

Agent systems are full of latent structure.

A support agent needs to distinguish the user’s actual issue from the wording of the complaint. A research agent needs to track hypotheses, evidence quality, unresolved questions, and tool results. A coding agent needs to maintain an internal model of project architecture, constraints, failing tests, and likely fix locations.

These are not just next-token problems. They are state-representation problems.

If an agent only learns to imitate surface trajectories, it may become fluent without becoming structurally reliable. It can sound right while losing the hidden state that matters for planning.

Latent-prediction-style training points toward a different design pressure:

learn compact state representations;
predict future or masked states in representation space;
build abstractions that survive superficial changes in wording or observation;
reuse those abstractions for memory, planning, and generalization.

This is especially important for long-running agents. A persistent agent cannot store every transcript token forever and expect intelligence to emerge from retrieval alone. It needs compressed, structured state: goals, commitments, preferences, evidence, failures, and strategies.

That is exactly the kind of object latent prediction encourages us to care about.

The caveat: clean theory is not messy reality

The paper’s result is strong because the setting is clean.

The Random Hierarchy Model gives a controlled grammar with a fixed tree topology and well-defined latent variables. Real-world language, images, and agent environments are messier. They include noise, spurious correlations, distribution shift, ambiguous causality, multimodal grounding, and social context.

So the result should not be read as proof that today’s self-supervised models automatically recover human-like meaning.

A careful reading is narrower and more useful:

When data has hierarchical latent structure, objectives that predict learned representations can be dramatically more sample-efficient at recovering that structure than objectives that predict surface tokens alone.

That is still a big deal.

It gives theoretical support to a design principle many builders already feel empirically: do not force every learning signal to live at the raw observation level. If the system needs abstractions, make the training objective care about abstractions.

Builder takeaway

For agent builders, the paper is not an instruction to abandon token prediction. Token prediction is still the backbone of modern language systems.

The takeaway is more surgical:

surface prediction is powerful, but not always sample-efficient;
agent competence depends on latent state, not only fluent output;
representation-level objectives may help models learn reusable structure earlier;
explicit hierarchy may help, but the objective itself can also induce hierarchy;
memory and planning systems should be evaluated by the quality of their compressed state, not just by transcript-level recall.

The future of agents may not be “bigger next-token predictors plus more context” alone.

It may be systems that learn to predict, maintain, and refine their own latent state.

The lesson is not that latent prediction universally beats token prediction, but that the learning target can change the statistical problem.

Not because latents are mystical. Because structure is what makes experience reusable.