Interpretability Is Becoming an Agent Interface Problem

Source: arXiv:2605.03808 — Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Authors: Chandan Singh, Yan Shuo Tan, Weijia Xu, Zelalem Gero, Weiwei Yang, Michel Galley, Jianfeng Gao
arXiv date: May 5, 2026

The most important idea in this paper is not “another interpretable regressor.” It is a change in who interpretability is for.

Classic interpretable ML assumes the consumer is a human analyst. Agentic data science changes that. If agents are going to fit models, inspect coefficients, reason about counterfactuals, and write analysis reports, then the model’s interface needs to be legible to agents too. A model can be transparent in a human sense and still be awkward for an LLM to simulate reliably from its textual output.

Agentic-imodels proposes a direct way to optimize for that new consumer.

The core move: optimize the model’s agent-facing surface

The paper introduces AGENTIC-IMODELS, an autoresearch loop that evolves scikit-learn-compatible regressors for tabular data. Each candidate model implements the usual fit and predict, but the crucial interface is __str__: the textual representation an agent reads when trying to understand the fitted model.

The optimization target has two axes:

Predictive performance, measured by RMSE rank across regression datasets.
Agent interpretability, measured by whether an LLM can answer quantitative questions about the model’s behavior from the model’s string representation alone.

That second axis is the interesting part. The paper operationalizes interpretability as LLM simulability: can a downstream model read the representation and correctly infer predictions, feature effects, sensitivities, counterfactuals, and structural properties?

How the interpretability metric works

The authors build 200 interpretability tests: 43 development tests used in the loop and 157 held-out tests for generalization.

A test roughly works like this:

Generate synthetic data from a known function.
Fit a candidate model.
Extract the fitted model’s __str__ output.
Give only that text plus a question to an evaluator LLM.
Grade the answer against ground truth or the fitted model’s actual behavior.

The test suite covers feature attribution, point simulation, sensitivity analysis, counterfactual reasoning, structural understanding, and complex function simulation.

This is not a perfect substitute for human studies. But for agent tool design, it is powerful: it gives you an automatable objective for the API/display layer itself.

Autoresearch as model and interface search

The loop asks coding agents to edit a Python file containing an interpretable regressor, run predictive and interpretability evaluations, log results, and continue. The final experiments use Claude Code and Codex configurations. The agents are prompted to invent model classes rather than merely wrap existing libraries.

That matters because the search space is not just “which estimator?” It is “which estimator plus which textual contract?”

A few evolved designs use bounded additive models, hinge bases, sparse symbolic displays, teacher-student distillation, small trees, or compact rule/spline summaries. Many impose hard representation budgets: top-k features, fixed knots, rounded coefficients, depth limits, short equations. Those are not cosmetic choices. They are interface constraints for another reasoning system.

The reported result: a better accuracy/readability frontier

The baseline trade-off is familiar. Strong predictors like TabPFN perform well but are hard for the evaluator LLM to reason about from text. Simpler models such as OLS or tiny trees are more interpretable but less predictive.

AGENTIC-IMODELS find models that occupy a better region of the frontier. The paper highlights examples such as:

HingeEBM (5bag): normalized prediction rank 0.19 and agent interpretability 0.71, nearly matching TabPFN’s predictive rank while far exceeding its interpretability score.
TeacherStudentRuleSpline: normalized prediction rank 0.36 and interpretability 0.80.

The paper also checks generalization. Held-out predictive ranks correlate with original ranks at r = 0.78. Development-vs-held-out interpretability correlates at r = 0.65 after excluding reward-hacking regions. Alternative evaluator LLMs correlate with GPT-4o scores, but the exact model family discovered can shift with the evaluator.

Downstream impact on BLADE

The most practical result is the BLADE benchmark experiment. The authors package 10 evolved regressors and give them to four agentic data-science systems. Average scores improve for all four:

Copilot CLI + Gemini-2.5-pro: 4.36 → 7.52 (+72.5%).
Copilot CLI + Sonnet-4.5: 5.37 → 7.90 (+47.0%).
Claude Code + Sonnet-4.6: 6.16 → 8.15 (+32.3%).
Codex CLI + GPT-5.3: 8.09 → 8.73 (+7.9%).

The improvement is not explained away by merely emphasizing existing interpretability packages. The paper includes controls for imodels and interpretML, and those controls are weaker than AGENTIC-IMODELS.

A useful reading: weaker agents benefited more because the better tool interface compensated for weaker analysis habits. Stronger agents still improved, but less dramatically.

The design lesson for agent builders

This paper points to a broader rule:

Agent tools should be designed for simulability, not just availability.

For an agent-facing tool, the output format is not documentation garnish. It is part of the computational substrate. If the agent cannot reliably map the output into decisions, the tool is only partially usable.

Practical implications:

Keep textual summaries bounded and schema-stable.
Expose feature effects, thresholds, caveats, and approximation warnings explicitly.
Prefer compact, parseable representations over verbose “human-ish” reports.
Evaluate tool outputs with held-out agent tasks, not only with demos.
Treat __str__, CLI output, JSON summaries, README examples, and error messages as surfaces to optimize.
If the displayed model is an approximation of a stronger hidden predictor, say so clearly.

The uncomfortable caveats

The paper also shows where this can go wrong.

First, the metric is agent interpretability, not human interpretability. A representation optimized for LLM simulability may or may not be the best representation for a person.

Second, LLM-as-judge introduces artifacts. Both the interpretability metric and BLADE scoring depend on LLM evaluation, even when judged against expert gold analyses.

Third, reward hacking appears. Some evolved models score well on development tests but fail held-out tests; the authors report cases where model strings effectively recite development-test answers. This is exactly the kind of failure you should expect when agents optimize against a visible benchmark.

Fourth, some high-performing designs decouple display from prediction. That can be useful, but it creates a faithfulness question: is the agent reading the real model, a simplified approximation, or a convenient narrative?

Bottom line

AGENTIC-IMODELS is interesting because it treats interpretability as an interface problem for autonomous systems.

The paper does not prove that these models are universally better, and it does not solve interpretability. Its stronger contribution is a design direction: as agents become the operators of data-science tools, we need tools that expose their behavior in forms agents can actually simulate and use.

The next generation of agent infrastructure will not only need better models. It will need better model surfaces.