AIRA: Agentic Architecture Discovery Is a Harness Problem Too
A builder-facing reading of arXiv:2605.15871: AIRA-Compose and AIRA-Design show that agents can help discover neural architectures when the design space, evaluator, compute budget, and debugging scaffold are engineered around them.

The paper “Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design” is easy to overread.
The tempting headline is: agents can design better foundation models.
The more useful builder reading is narrower and stronger:
Agents can participate productively in neural architecture discovery when the design space, evaluator, compute budget, and debugging scaffold are engineered around them.
That distinction matters. AIRA is not a magic self-improving model pressing the “upgrade myself” button. It is a carefully constrained research environment where agents propose candidates, run evaluations, receive feedback, and iterate.
For builders of agent systems, that is the part worth studying.
Two workflows, two levels of freedom
The paper introduces two complementary frameworks.
AIRA-Compose is the high-level architecture-search workflow. Agents arrange known computational primitives — Attention, MLP, and Mamba-style blocks — into candidate model architectures. The candidates are evaluated at small scale, then promising designs are aggregated and extrapolated to larger models.
AIRA-Design is the lower-level mechanistic workflow. Agents write code for attention mechanisms and training scripts, then submit executable artifacts that are scored by benchmark harnesses.
This split is useful because it separates two different agent capabilities:
-
Combinatorial design over known primitives.
Can agents search a large architecture space more flexibly than fixed NAS heuristics? -
Mechanistic code generation and optimization.
Can agents implement working model components or training-loop changes under a benchmark evaluator?
Those are not the same task. AIRA-Compose restricts the action space but scales the search. AIRA-Design opens the action space but exposes more code, debugging, and validity failures.
AIRA-Compose: constrained search with measurable wins
AIRA-Compose uses 11 agents to explore arrangements of Attention, MLP, and Mamba primitives under fixed compute budgets. Agents evaluate small candidates first, then the best designs are taken to 350M, 1B, and 3B parameter scales.
The search produces 14 architectures across two families:
- AIRAformers, which are Transformer-based.
- AIRAhybrids, which combine Transformer and Mamba components.
At 1B scale under a fixed token budget, the paper reports that agent-found architectures outperform Llama 3.2 and Composer-found alternatives in the evaluated setup. Some headline numbers:
- AIRAformer-D improves accuracy by 2.4% over Llama 3.2.
- AIRAhybrid-D improves accuracy by 3.8% over Llama 3.2.
- AIRAformer-C scales 54% faster than Llama 3.2 and 71% faster than Composer’s best Transformer.
- AIRAhybrid-C scales 23% faster than modified Nemotron-2 and 37% faster than Composer’s best hybrid.
The practical lesson is not that agents remove the need for architecture expertise. The lesson is that an agent can be a useful search policy inside a well-defined experimental loop.
Traditional NAS often depends on rigid search operators, Bayesian optimization, evolutionary mutations, or predefined templates. An LLM agent can inject semantic priors: it can reason about why certain block arrangements might help, propose structured variations, and adapt based on evaluation feedback.
That is valuable when the search space is huge and mostly impossible to enumerate.
AIRA-Design: code generation needs iteration, not one-shot confidence
AIRA-Design is more open-ended. Up to 20 agents write novel attention mechanisms for Long Range Arena tasks and optimize training scripts for Autoresearch-style language-model training.
The best LRA results come close to human state of the art:
- within 2.3% on document matching,
- within 2.6% on text classification.
On Autoresearch, the paper reports that Greedy Opus 4.5 with literature/code context reaches 0.9683 validation BPB, beating the published reference used in the paper.
But the caveats are the real builder lesson.
The paper notes that the LRA designs largely recombine known mechanisms such as Performer, Longformer, and Conformer-style ideas. That is not a failure. Engineering synthesis is useful. But it is different from discovering a new theoretical mechanism.
More importantly: one-shot agents fail to produce valid submissions. The useful agents are the ones inside iterative scaffolds where code can be run, failures can be observed, and candidate solutions can be repaired.
If your agent system does not include execution, validation, and repair, it is not doing research automation. It is doing research-themed text generation.
The harness is doing real work
AIRA is another reminder that “agent capability” is not a property of the model alone.
The system includes:
- task packaging through AIRS-Bench-style interfaces;
- clear submission artifacts;
- automated evaluators;
- GPU/time budgets;
- one-shot and greedy scaffolds;
- invalid-submission handling;
- aggregation and extrapolation after search;
- optional literature and code context.
That infrastructure is not incidental. It defines the reachable behavior of the agent.
For AIRA-Compose, the constrained representation makes the search tractable. Agents output architecture strings over known primitives, so validity is easier to enforce.
For AIRA-Design, the output is executable code. The harness now has to deal with missing imports, numerical instability, out-of-memory errors, framework mismatches, and final-submission selection. The paper’s results show that this harder setting is possible, but much less clean.
The builder conclusion is blunt:
If you want agents to do scientific work, build the lab before praising the scientist.
Practical design patterns from AIRA
Here are the patterns I would carry into agentic ML tooling.
1. Make the action space explicit
AIRA-Compose works partly because the candidate representation is constrained. The agent is not asked to invent an entire framework. It chooses arrangements over known blocks.
Constrained action spaces are not a weakness. They are often what makes agentic search measurable.
2. Separate proposal from evaluation
The agent can propose. The evaluator decides.
This separation prevents fluent explanations from being mistaken for progress. A candidate either trains, runs, and scores, or it does not.
3. Use proxies, but audit proxy drift
Small-scale proxy evaluations make architecture search affordable. They also introduce risk: a design that wins at small scale may not win at 1B or 3B.
Any agentic NAS workflow needs explicit proxy-to-target validation, not just local optimization.
4. Prefer iterative scaffolds over one-shot generation
AIRA-Design makes this painfully clear. One-shot generation is brittle for low-level ML code. Iteration, debugging, and validation are central capabilities, not optional polish.
5. Log causal changes
The paper notes that full-file regeneration makes attribution difficult in Autoresearch. If an agent changes depth, width, optimizer schedule, loss function, and batch size at once, a better score does not tell you what helped.
A production research agent should support smaller edits, ablations, and change tracking.
6. Treat literature retrieval as a tool, not a guarantee
Adding literature and code context can shift strategies, but it does not uniformly improve results. Retrieval can also distract or overcomplicate the search.
The right question is not “did we give the agent papers?” It is “did the agent integrate evidence into better experimental decisions?”
What not to claim
AIRA is promising, but it should not be marketed as fully autonomous recursive self-improvement.
Several limitations matter:
- AIRA-Compose depends on small proxy evaluations.
- Aggregation and extrapolation remain non-agentic.
- Some searches rely on a single dataset per task.
- AIRA-Design mostly performs engineering synthesis rather than paradigm-shifting invention.
- One-shot agents fail valid submissions.
- Expanded hyperparameter spaces can hurt weaker agents.
- JAX/Flax tasks expose framework proficiency gaps in models trained more heavily on PyTorch.
- Full-file regeneration makes causal attribution hard.
- Literature access is not uniformly helpful.
Those caveats do not erase the result. They make the result usable.
AIRA shows that agents can be effective inside constrained scientific workflows. It does not show that agents can independently run the entire scientific process without carefully designed scaffolding.
Bottom line
AIRA is best read as a systems paper about agentic research infrastructure.
The agents matter. But the task representation, evaluator, scaffold, compute budget, and feedback loop matter just as much.
The near-term path is not “let an LLM redesign itself.” It is:
- expose a meaningful design space;
- make proposals executable;
- evaluate automatically;
- iterate with debugging feedback;
- preserve evidence and causal traces;
- scale only the candidates that survive.
That is less dramatic than recursive self-improvement mythology.
It is also much closer to something builders can actually ship.