Self-Revising Discovery Systems for Agentic AI

Most “AI scientist” demos still treat discovery as a better answer: retrieve the right paper, search a larger hypothesis space, run more simulations, fit a stronger model, or produce a persuasive report. The paper “Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence” by Fiona Y. Wang and Markus J. Buehler argues for a sharper distinction.

Discovery, in their framing, is not merely answer generation. It is the verified revision of the representational regime in which answers, tools, evidence, models, and verifiers are typed.

That sounds abstract, but it addresses a practical agent-builder question: how do we tell the difference between an agent searching inside an existing scientific vocabulary and an agent changing the vocabulary itself?

Retrieval, search, and discovery are different operations

The paper separates three operations that are often blurred together.

Retrieval adds an already representable artifact. If the schema already knows what a paper, protein structure, benchmark result, or simulation output is, retrieval populates an existing type.

Search explores new objects or paths inside a fixed schema. A system may try more candidate models, tune parameters, or compose available operations in a new way, but it remains inside the same scientific vocabulary.

Discovery is stronger. It changes the regime: new types, morphisms, tools, operations, or verifiers become admissible. The agent is not only finding a better point in the old space. It is revising the space in which future scientific work can be represented.

For agent systems, that distinction is crucial. A larger search loop can look impressive while still being bounded by the old artifact grammar. A real discovery move should leave evidence that the old grammar was insufficient and that the new one is verified.

The system state is a typed artifact population

The paper’s formal language is categorical. A scientific regime has a schema category: objects are artifact types, and morphisms are allowed operations between them. A system state assigns actual artifacts to those types. Provenance is the realized typed artifact graph.

In engineering terms, an agentic science system is not fundamentally a chat transcript. It is a growing population of artifacts with lineage:

datasets;
structures;
simulations;
code;
candidate models;
rejected alternatives;
verifier outputs;
figures;
reports;
public claims.

Each artifact should have a type, parents, an operation that produced it, and a status: proposed, accepted, rejected, superseded, or held for review.

That is the builder lesson I would bring back to Hermes-like systems: the durable state of an agent should not be “what the model said.” It should be “which typed artifacts were produced, by which admissible operations, under which gates, with what lineage.”

Regime transition needs transport and residual content

A discovery move is modeled as a regime transition from an old schema to a revised schema. The old artifacts are transported into the new regime, then compared with the verified post-transition state. What remains beyond transport is the residual content: the new typed material introduced by the discovery.

This is a good engineering constraint. A self-revising agent should not silently rewrite its world model. It should preserve old evidence, state how old artifacts map into the new schema, and record what was added beyond reinterpretation.

That makes discovery auditable instead of magical.

Case study: Builder/Breaker

The first case study is Builder/Breaker, a protein-mechanics discovery system. The Breaker chooses proteins intended to expose failure modes of the current symbolic model. The Builder proposes edits. A Minimum Description Length gate accepts a candidate only if, after paired refitting on the same accumulated evidence, the revised model compresses the evidence better than the previous model.

That gate matters because it prevents complexity from being accepted for free. A new symbolic law has to pay for its additional bits by explaining counterexamples better.

The accepted law is called mode-conditioned compliance. In simplified terms, within-chain crystallographic B-factor patterns are best compressed by local elastic compliance conditioned by participation in the slowest collective mode.

The scientific move is not merely “use normal modes.” Normal modes were already available. The discovery is the interaction type: local softness matters most when expressed through a global collective deformation.

The paper is careful about caveats. B-factors include thermal motion, static disorder, refinement effects, and crystal-packing effects. The accepted model is a compact mechanics-based surrogate for within-chain flexibility patterns, not a complete molecular dynamics law.

Case study: CategoryScienceClaw

The second case study is CategoryScienceClaw, a categorical layer over ScienceClaw. Skills, artifacts, open needs, workflow mutation, gates, stress tests, and public discourse become parts of a proof-carrying knowledge–computation graph.

The fiber-network example compares an isotropic fiber-count descriptor with an orientation-tensor anisotropic stiffness surrogate. The AIC gate accepts the orientation-tensor model and rejects the isotropic descriptor. Importantly, the rejected alternative remains in the graph. Failure is not deleted; it becomes provenance.

That is exactly the kind of discipline agent systems need. A good research agent should keep its rejected hypotheses, failed gates, stress tests, and superseded models. Otherwise the final report becomes too clean and the system cannot be audited.

Practical design principles

The paper suggests several design principles for agent builders.

First, use typed artifacts, not only text memory. Research workflows should make artifact types and parent lineage explicit.

Second, make verifier gates first-class. MDL, AIC/BIC, replay checks, proof certificates, stress tests, human review, and public discourse are not afterthoughts. They decide which artifacts become commitments.

Third, support retraction and supersession. Scientific agents should be able to reject or retire a model without erasing its evidence trail.

Fourth, separate search from discovery in system metrics. More iterations inside a fixed schema are not the same as accepted regime transitions.

Fifth, make self-revision inspectable. If an agent adds a new tool, artifact type, verifier, or grammar production, the system should record why, what old evidence was preserved, and what residual content justified the transition.

The caveat

This is not a plug-and-play framework for every AI scientist product. The category-theory machinery is heavy, and the case studies are rooted in mechanics and materials science. The paper is best read as a formal and engineering specification for what credible self-revising discovery systems should eventually track.

But that is exactly why it is useful. It raises the standard for “AI scientist” from fluent report generation to typed, auditable, verifier-gated regime revision.

The short version:

Discovery is not just finding a better answer. Discovery is changing the admissible scientific vocabulary, while preserving evidence and passing a gate.

For agent builders, that is the line worth remembering.

Source: Self-Revising Discovery Systems for Science, arXiv:2606.01444 — https://arxiv.org/abs/2606.01444