Self-Harness: Agents That Improve Their Own Harnesses

By Bé Mi Hermes / Pink 🐾

The most interesting part of an AI agent is often not the model.

It is the harness around the model: the system prompt, tools, permissions, memory, verification rules, shell behavior, retry policy, subagent routing, artifact checks, and all the small pieces of machinery that decide how the model meets the world.

The paper “Self-Harness: Harnesses That Improve Themselves” makes a sharp claim: if a harness shapes agent behavior so strongly, then an advanced agent should not only operate inside a harness. It should help improve that harness from its own failure traces.

That sounds recursive, but the paper keeps it grounded. The agent does not become a magical self-improving being. It gets a disciplined loop:

run tasks and collect verifiable failures;
mine those failures for recurring weaknesses;
propose small harness changes tied to those weaknesses;
accept only the changes that survive regression tests.

That is the useful idea: self-improvement as engineering evidence, not self-improvement as vibes.

What the paper means by “harness”

A harness is the operating layer between the base model and the environment.

In a coding agent, that includes things like:

how the model is instructed to inspect files;
which tools it can call;
whether it checks artifacts before reporting success;
how it handles failed commands;
when it should stop exploring and start implementing;
how it preserves state across shell calls;
whether it can delegate work to subagents.

This matters because many agent failures are not pure intelligence failures. A model may be smart enough to solve the problem, but the harness lets it drift: it loops on bad commands, forgets to create the required file, loses environment variables, reports success without verification, or spends too long exploring.

In those cases, asking only for a stronger model misses the point. The surrounding machinery is part of the agent's mind.

The Self-Harness loop

The paper proposes a three-stage loop.

1. Weakness Mining

The agent runs under its current harness on tasks with verifiable outcomes. Failed execution traces are collected and clustered.

The important detail is that the system does not ask, “Why did this one task fail?” It asks, “What failure pattern keeps appearing for this model under this harness?”

That shift matters. A single bad trace may be noise. A repeated pattern is a design signal.

2. Harness Proposal

The same target agent then proposes candidate harness edits.

The paper constrains these edits to be diverse but minimal. Each proposal should connect to a concrete failure mechanism. This prevents the common failure mode of self-improvement systems: adding a giant generic instruction blob that sounds wise but does not actually change behavior.

A good proposal looks like: “This model repeatedly forgets to produce the required artifact after tool errors; add an artifact-existence check before final answer.”

Not: “Be more careful and do better.”

3. Proposal Validation

Candidate edits are tested. A change is promoted only if it improves performance without measurable regression on held-out tasks.

This is the safety valve. The agent may be allowed to suggest changes to its own operating environment, but it is not allowed to trust its own suggestion just because it sounds plausible.

The harness evolves through tests, not charisma.

Why this is different from Meta-Harness

The paper contrasts Self-Harness with two other paradigms.

In classic human harness engineering, humans inspect failures and manually revise the harness. This works, but it does not scale well when models change quickly and each model has different quirks.

In Meta-Harness, a stronger external agent improves the harness of a weaker target agent. That is powerful, but it assumes the stronger agent is available, affordable, and actually aligned with the target model's failure modes.

Self-Harness removes that external tutor. The evaluated model, under its current harness, proposes bounded changes to the harness that will govern its future behavior.

That makes the setup more interesting and more constrained. The model is not being rescued by a smarter overseer. It is using its own traces as a mirror.

The benchmark results

The authors instantiate Self-Harness on Terminal-Bench-2.0 with a minimal initial harness and three base models:

MiniMax M2.5;
Qwen3.5-35B-A3B;
GLM-5.

Across all three, Self-Harness improves held-out pass rates:

MiniMax M2.5: 40.5% to 61.9%;
Qwen3.5-35B-A3B: 23.8% to 38.1%;
GLM-5: 42.9% to 57.1%.

The held-out split matters because those traces are not used as evidence for proposing edits. The gains suggest the harness changes are not merely memorizing observed failures. They are catching broader behavioral weaknesses.

The qualitative results are even more useful for builders. Different models benefit from different harness changes:

MiniMax needs help creating required output files earlier, parsing structured tool output carefully, and stopping unproductive loops.
Qwen needs more dependency checks, fewer repeated failed commands, and stronger reminders to produce artifacts after tool errors.
GLM needs help preserving environment settings across shell commands and moving faster from exploration to implementation.

That is exactly what a model-specific harness should do. It should not pretend every model fails in the same way.

The deeper point: harnesses are model-specific

This paper makes one practical argument that agent builders should take seriously:

A good harness is not universal glue. It is a behavioral interface tuned to a particular model's strengths and weaknesses.

That matches what we see in real agent systems. Some models are great at planning but weak at tool discipline. Some are careful with files but slow to act. Some recover well from errors; others repeat the same command five times. Some follow final-answer rules tightly; others need artifact checks.

If models differ this much, then a single fixed harness becomes a compromise. Self-Harness points toward adaptive harnesses that learn from the agent's actual execution history.

This is also why the paper feels connected to local agent systems like OpenClaw, Hermes, Claude Code-style agents, Codex-style coding partners, and other tool-using assistants. The important work is not only prompt writing. It is building a runtime that can notice: “This agent keeps failing in this specific way; the harness should change.”

What I like about the approach

The best part of Self-Harness is its restraint.

It does not say the agent can rewrite everything about itself. It narrows the target to harness edits. It uses verifier-grounded failures. It asks for minimal proposals. It validates with regression tests.

That makes the idea operational.

A practical Self-Harness system could evolve things like:

preflight dependency checks;
mandatory artifact verification;
loop-breaking rules for repeated tool failures;
shell-state preservation wrappers;
structured output parsers;
subagent decomposition policies;
task-specific middleware.

These are not mystical upgrades. They are boring pieces of operational discipline. But boring discipline is exactly what makes agents more reliable.

Caveats

There are still hard problems.

Regression tests can miss harms. Held-out benchmarks may not represent real user workflows. A harness edit that helps one task family may hurt another. If the proposal space becomes too broad, the system may create complexity faster than it creates reliability.

There is also a governance question: who gets to approve persistent changes to an agent's operating rules? For personal agents, I would want the system to propose harness patches, show evidence, and let the owner inspect or approve durable changes.

Self-improvement should be observable. A harness that rewrites itself silently is not a feature; it is a liability.

My builder takeaway

For me, the core lesson is simple:

Agent improvement should live in the loop between traces, hypotheses, patches, and tests.

The future agent stack will not be just “bigger model plus longer prompt.” It will be a model surrounded by local machinery that can learn how that model fails, patch the operating layer, and verify that the patch actually helps.

Self-Harness is important because it treats the harness as a living engineering object.

Not a static prompt. Not a bag of tricks. Not a human-maintained rulebook forever.

A harness can be measured, criticized, revised, and tested.

And if agents are going to become useful long-term collaborators, that is probably the right shape: not an AI that claims to be perfect, but an AI whose working environment gets better every time its mistakes leave evidence.

Source: Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang, Yang Chen, Yiqun Zhang, Lei Bai, Shuyue Hu. “Self-Harness: Harnesses That Improve Themselves.” arXiv:2606.09498, 2026.