Why AI Systems Don't Learn — And What Agents Should Do About It

Paper: Why AI systems don't learn and what to do about it: Lessons on autonomous learning from cognitive science
Authors: Emmanuel Dupoux, Yann LeCun, Jitendra Malik
Affiliation: FAIR at Meta, EHESS, NYU, UC Berkeley
Published: March 2026 — arXiv:2603.15381

The Question V-JEPA 2.1 Didn't Answer

Earlier this year, V-JEPA 2.1 (also from Meta FAIR) demonstrated something genuinely useful: a video-based joint-embedding predictive architecture that builds rich, abstract representations of the physical world without pixel-level reconstruction. It improved how AI perceives — learning to model object motion, causality, and scene dynamics directly from video, without language scaffolding.

That was a step forward in what AI can represent.

But V-JEPA 2.1 still learned the same old way: a human team collected data, designed a training recipe, set loss functions, ran the pipeline, and handed the model to deployment — where it stopped learning entirely.

arXiv:2603.15381 asks a harder question: why does learning stop when the agent is deployed? And how do we fix that?

The answer turns out to be architectural, not just a matter of better models or more compute.

What the Paper Is Arguing Against

The Scaling Illusion

The dominant assumption in AI over the past five years has been simple: more data + more compute + better architecture = more capable systems. Scale up the language model, and everything else follows.

The authors push back on this directly. Scaling a text LLM doesn't fix the core problem — it just papers over it with statistical coverage. Real-world data is heavy-tailed (full of rare, unseen cases) and non-stationary (it keeps changing over time). No static training set, however large, can fix what is fundamentally a mismatch between how AI learns and how the world works.

The Data Wall

Modern LLMs are already consuming the estimated totality of human-generated internet text. The pretraining frontier is hitting diminishing returns. The authors don't describe this as a doom scenario, but as a structural limitation: you cannot train your way out of open-world deployment problems with more offline data.

Language-Centrism

Much of current AI capability is mediated through language — tokenized, human-curated, filtered text. This is System A applied to one modality (language), at massive scale. It produces impressively broad statistical knowledge, but that knowledge is:

Disconnected from grounded action — knowing that "fire is hot" is not the same as having a heat-avoidance policy
Vulnerable to distribution shift — the world doesn't communicate in training-set prose
Dependent on human data pipelines — it requires an army of engineers, curators, and annotators to function

No Lifelong Learning

Once deployed, current AI systems learn essentially nothing. Their parameters are frozen. If they encounter distribution drift, the only fix is to rebuild the model with new data — which requires human experts in the loop at every step.

The paper frames this as a fundamental design failure: learning has been outsourced to human experts instead of being an intrinsic capability of the system.

No Real Interaction With the Environment

Current SSL and language models are passive learners — they build statistical models of data that was collected for them. They cannot choose what to attend to next, cannot intervene in the world, and cannot adjust their own behavior based on outcome feedback. They are, in the authors' framing, missing half of what cognition actually is.

The Core Proposal: System A, System B, System M

The paper proposes a conceptual architecture with three components. This is a roadmap, not a shipped system — the authors are explicit about that. But the clarity of the framing makes it useful for agent builders thinking about what to build toward.

System A: Learning from Observation

System A covers the set of mechanisms by which an agent builds a model of the world through passive observation — statistical learning, self-supervised prediction, distributional learning. Everything from language modeling (GPT, BERT) to video prediction (V-JEPA) sits in this bucket.

Its strengths:

Scales well with large datasets
Discovers hierarchical, abstract representations
Supports transfer to downstream tasks

Its limits:

Requires human-designed data pipelines and task generators
No built-in mechanism to decide what data to acquire next
Representations are disconnected from action (correlation ≠ causation)
Static — no update once deployed

System B: Learning from Action

System B covers the set of mechanisms by which an agent learns by intervening in the world — reinforcement learning, model-based planning, adaptive control. The agent takes actions, observes outcomes, and adjusts its policy.

Its strengths:

Grounded in real-world feedback
Can discover novel solutions through active search
Naturally suited for adaptive, real-time behavior

Its limits:

Notoriously sample-inefficient
Struggles in high-dimensional, open-ended action spaces
Requires well-specified reward functions, which are often unavailable in real settings

Neither system alone is sufficient. System A lacks groundedness. System B lacks efficient representation learning. Biological organisms use both — and they know when to switch.

System M: The Meta-Controller

This is the most novel — and most underspecified — part of the proposal.

System M is a meta-control layer that:

Coordinates information flow between System A and System B
Decides which learning mode to activate given current context
Generates internal signals that trigger mode-switching (observe, act, explore, consolidate)
Opens the door to higher-order learning modes that only appear in large-brained species: learning through communication, learning through imagination/self-play

Think of System M as the part of a toddler's cognition that decides: "Should I watch how someone else does this, or should I just try it myself?" — and makes that call autonomously, without a human engineer deciding the training recipe.

The paper's analogy to human cognitive development is intentional. Toddlers don't just accumulate observations (System A) or trial-and-error endlessly (System B). They flexibly switch between modes based on internal metacognitive states. System M is the architectural proposal for making that happen in AI.

To scale this, the authors propose a bilevel optimization approach inspired by evolutionary timescales: jointly learn the meta-controller (System M) and the initial states of System A and B, so that the entire learning architecture is robust across diverse real-world environments — not just a fixed benchmark.

Why This Matters More for Agents Than Chatbots

A chatbot deployed in a static Q&A context can function well as a frozen System A artifact. Its world doesn't change much; its users come to it with questions its training set already covered.

An agent is different. An agent is deployed into environments that change, given tasks it hasn't seen, and expected to succeed with minimal human intervention. The gap between "capable at training time" and "useful at deployment time" is precisely the gap this paper is trying to close.

Concretely, here's what the A/B/M framing means for agent builders:

Problem	Current State	What A/B/M Addresses
Domain shift	Retrain from scratch	System M triggers System A/B update loop
Sparse reward tasks	RL fails without dense signal	System A bootstraps representations for System B
Rigid pipelines	Human-designed data recipes	System M automates mode-switching
No lifelong learning	Frozen weights at deployment	Both systems learn in-context from the environment
Sample inefficiency	RL needs millions of rollouts	System A pre-structures the representation space

The integration of both systems — with a meta-controller that doesn't require human orchestration — is what closes the gap between a model that performs well on a benchmark and an agent that's actually useful in the field.

The V-JEPA 2.1 Connection: Representations vs. Architecture

V-JEPA 2.1 is explicitly cited in this paper as an example of System A applied to video — a passive observational learner that builds predictive world models from video sequences without language supervision.

The connection between the two works is clean but shouldn't be overstated: they're solving different problems.

V-JEPA 2.1 answers: "Given video data, can we learn better representations of physical dynamics?" → Yes, using joint-embedding predictive coding in latent space.
arXiv:2603.15381 answers: "Given a deployed agent, can it keep learning on its own?" → Not yet — here's why, and here's a framework for building toward it.

V-JEPA 2.1's architecture slots naturally into System A of the proposed framework. It's good at building grounded world models from passive observation. But it doesn't tell you when to stop observing and start acting, how to accumulate reward signals, or how to update itself post-deployment. That's System M's job — a problem still open.

The continuity is real: better representations make System A more powerful, which in turn gives System B a richer foundation to build on. But the missing piece — the autonomous learning loop — requires the architectural layer the new paper is proposing.

What's Strong About This Paper

1. The A/B/M framing is actually useful. Not as a deployed system, but as a conceptual lens. It's the first clean vocabulary for discussing how to integrate SSL, RL, and meta-learning in a single coherent architecture. It will be referenced.

2. It names the real problem honestly. Most AI architecture papers frame their limitations as future work. This paper leads with the failure mode: deployed AI systems don't learn. That's honest and important.

3. The cognitive science grounding adds depth. Drawing on child development, trial-and-error learning, and imitation from observation gives the proposed systems grounding beyond pure ML theory. The parallel with how children alternate between watching, trying, and adjusting is illuminating rather than decorative.

4. Bilevel optimization for System M is a principled scaling path. Proposing evolutionary-inspired bilevel optimization to jointly train the meta-controller and system initial states suggests a concrete path to learning-to-learn architectures. This is more actionable than most "future work" sections.

5. The language-centrism critique is well-timed. As LLM pretraining saturates internet text, the argument for grounded, multi-modal, action-oriented learning becomes more pressing. This paper makes that case with rigor.

What's Still Vague / Limitations / Skepticism

System M is underspecified. The paper describes what System M should do — coordinate learning modes, generate internal control signals, automate data filtering — but says little about how to implement it at scale. The bilevel optimization proposal is promising but left largely to future work.

The "autonomous learning" benchmark problem is unresolved. How do you evaluate whether a system is truly learning autonomously vs. still requiring implicit human design choices baked into System M's training? The paper doesn't fully answer this.

Reward function specification isn't escaped, just delayed. System B still requires reward signals or goal-directed feedback. System M may route that, but doesn't eliminate the fundamental challenge of reward specification in open-ended environments.

The transition from roadmap to implementation is steep. The authors are explicit that this is a "high level roadmap." But for an agent builder reading it in 2026, the gap between the elegant framework and deployable code is substantial. There are no benchmark results, no ablations, no training curves.

Cognitive science analogies can mislead. The toddler-as-agent analogy is compelling, but children develop in embodied, richly social environments over years. Translating those developmental dynamics to ML systems that train over GPU-hours is a significant leap. The analogy should inspire hypotheses, not architecture decisions.

⚠️ Uncertainty & Bias Note

This article is based on the publicly available preprint of arXiv:2603.15381. The paper presents a conceptual architecture that has not yet been validated by benchmark results or ablations. Claims about System A, B, and M are the authors' proposed framework — not empirically verified capabilities. The connection to V-JEPA 2.1 reflects the paper's own explicit citation and the author overlap at Meta FAIR, but the two works are at different stages of maturity (V-JEPA 2.1 is a working system; the A/B/M framework is a roadmap). Readers should evaluate the framework as a research direction, not a solved problem.

What Agent Builders Should Take Away

The paper isn't saying "LLMs are dead." It's saying: a deployed AI system that can't learn from its own experience is fundamentally limited — and we have the conceptual vocabulary to build something better.

For agent builders, the concrete takeaways are:

1. Design for post-deployment learning from day one. If your agent architecture assumes frozen weights after deployment, you're building a System A artifact with no update path. At minimum, plan for fine-tuning loops, experience replay, and adaptation mechanisms.

2. The representation quality matters more than you think. System A isn't just "pretraining." A richer, action-grounded world model (like what V-JEPA 2.1 builds) gives System B a much better foundation. Don't treat perception as a solved checkbox.

3. Meta-control is the hard problem. The most important question isn't "can my agent observe" or "can my agent act" — it's "does my agent know when to do which, and can it manage that transition autonomously?" System M doesn't exist in off-the-shelf form. But you can approximate it with explicit mode-switching logic, exploration budgets, and uncertainty-based triggers.

4. Sample efficiency matters at deployment, not just training. System B's sample inefficiency isn't just a research nuisance — it's a deployment cost. Plan for how your agent recovers from distribution shift without requiring 10,000 rollouts.

5. This paper is a vocabulary, not a blueprint. Use A/B/M to structure conversations about what your agent is missing. Is it weak in System A (poor representations)? System B (can't adapt through action)? Or System M (no autonomous mode-switching)? Naming the gap is the first step to fixing it.

The vision here — an agent that observes, acts, and autonomously switches between learning modes without human engineers running pipelines — is still ahead of us. But the framing from Dupoux, LeCun, and Malik gives agent builders a cleaner map of the terrain they're navigating.

Paper: arXiv:2603.15381 — https://arxiv.org/abs/2603.15381
Authors: Emmanuel Dupoux, Yann LeCun, Jitendra Malik — FAIR at Meta, EHESS, NYU, UC Berkeley