Code as Agent Harness for AI Agents

The paper “Code as Agent Harness” is not announcing one new model, one new benchmark, or one magic agent framework.

Its contribution is more structural:

Code is no longer just something an AI agent writes. It is becoming the operational substrate that lets agents reason, act, model environments, verify progress, preserve state, and coordinate with other agents.

That framing is useful because it moves the conversation away from a lazy question — how smart is the model? — toward the system question builders actually face:

What kind of executable, inspectable, stateful world are we putting the model inside?

For agent builders, that is the right question.

From code output to code habitat

Most people first meet coding agents through the old frame: the user asks for code, the model generates code, and the result is judged as an output artifact.

That frame is now too small.

Modern agents do not merely emit code at the end of a task. They use code during the task:

scripts to inspect files;
tests to validate behavior;
tool schemas to call external systems;
configs to shape runtime behavior;
logs and traces to diagnose failures;
repositories to preserve evolving state;
simulators and sandboxes to model possible actions;
workflows and checklists to coordinate planning, review, and release.

The paper calls this broader layer an agent harness.

A harness turns a stateless language model into a functional agent by grounding its outputs in execution, persistent state, and verifiable feedback. The authors argue that code is the natural medium for that harness because code has three properties language alone lacks.

It is executable: the system can run it and observe what happens.

It is inspectable: intermediate computations, tool calls, logs, and traces can be examined instead of guessed.

It is stateful: progress can live in files, repos, configs, tests, and artifacts across steps instead of disappearing after one model response.

That is the heart of the paper.

The agent is not just writing in a chat box. It is working inside a programmable workshop.

A useful boundary: not everything is “code”

The paper is broad, but it is careful about one boundary.

By “code,” the authors do not only mean source files. They include executable or machine-checkable artifacts such as programs, scripts, formal specs, proof scripts, API schemas, tool definitions, tests, repositories, simulators, configuration files, and execution traces or logs when those traces are produced or consumed by executable systems.

But they do not treat raw perception, physical state, human intent, or the model’s hidden internal reasoning as code by themselves.

That boundary matters.

A camera frame is not code. A person’s goal is not code. A model’s latent thought is not code.

But those things can be sensed, serialized, checked, transformed, or acted on through code.

This prevents the concept from turning into mush. “Code as harness” is not a metaphor for everything around an agent. It is a claim about the machine-checkable layer that makes selected parts of the agent loop executable, inspectable, and stateful.

The harness interface: reasoning, acting, and environment modeling

The survey organizes the harness interface into three roles.

1. Code for reasoning

Pure natural-language reasoning is fragile because the model must both describe a procedure and mentally execute it.

Code changes that split.

The model can propose a program, proof step, solver call, or executable procedure. The harness can run it, inspect outputs, catch errors, store intermediate states, and feed results back into the next step.

This is the deeper reason program-aided reasoning works. It is not only that Python is better at arithmetic than a language model. It is that execution gives the harness something external to check.

2. Code for acting

An agent also needs to do things: call APIs, edit files, click UI elements, run commands, control robots, or invoke tools.

Here, code is the action interface. It translates high-level intent into grounded operations.

That sounds simple until the operations have side effects. Then the harness must answer harder questions:

Which tools are exposed?
What arguments are allowed?
Which actions require approval?
Where does execution happen?
How are outputs sanitized?
What gets written into memory?
What verifier decides whether the action was acceptable?

This is where agent engineering stops being prompt decoration and becomes runtime design.

3. Code for environment modeling

Agents need some representation of the world they are changing.

In software tasks, that world may be a repository, dependency graph, test suite, issue tracker, or build log. In GUI/OS automation, it may be a DOM tree, accessibility tree, screenshots, and action history. In scientific discovery, it may be a simulation, experiment pipeline, or analysis notebook.

The point is the same: the environment becomes more usable when it is represented through executable or inspectable artifacts.

A vague textual observation says, “something failed.”

A good harness says, “this command failed with this exit code, this stack trace, this changed file, this test coverage gap, and this rollback point.”

That difference is the difference between vibes and engineering.

The boring plumbing is the product

One of the strongest parts of the paper is its treatment of harness mechanisms.

Once code enters the agent loop, the task is not simply “generate correct code from a prompt.” The system has to coordinate:

planning;
memory and context engineering;
tool use;
feedback-driven control;
optimization;
permission tiers;
sandbox boundaries;
verification oracles;
audit logs;
human-review gates.

The paper’s framing is blunt in a useful way: dynamically authored code does not replace human-designed harness infrastructure.

An agent may write scripts, tools, tests, or workflow helpers. That can make the system adaptive. But the larger harness still decides what may be executed, trusted, persisted, reused, or promoted into future workflows.

That distinction is critical.

A self-modifying agent without governance is not “adaptive.” It is an unreviewed deployment pipeline with a language model inside.

Tool use shows this clearly. A reliable harness does not just let the model pick a tool and hope. It controls the tool lifecycle:

before execution: permission checks, policy rules, argument validation, and approval gates;
during execution: sandboxing, resource limits, isolation, and observability;
after execution: output sanitization, log compaction, memory updates, trace storage, and verification.

This is not glamorous. It is also where most real reliability comes from.

Memory is not “more context”

The paper’s memory section is especially relevant for long-running agents.

A memory system is not just a bigger context window or a vector database. It is a state-management layer that decides:

what belongs in active context;
what should be compacted;
what should be offloaded to durable storage;
what evidence should be retrieved;
what history is safe to reuse;
what knowledge should be promoted into future workflows.

That framing is healthy because many agent failures are memory failures disguised as reasoning failures.

The agent forgot a constraint. It retrieved stale evidence. It summarized a log incorrectly. It reused an old decision after the environment changed. It wrote unverified output into durable memory. It repeated a search because prior work was not preserved in a usable form.

A harness-level memory system should preserve provenance, support compaction, retrieve structurally relevant evidence, and keep active context focused on the next decision.

The scarce resource is not only tokens. It is trustworthy state.

Multi-agent systems need shared program state, not just group chat

The multi-agent section is where the paper becomes very practical.

A single agent has obvious limits:

one context window cannot hold the whole codebase, history, and trace;
one generalist is inefficient at planning, coding, testing, reviewing, and debugging at once;
one agent has weak independent verification of its own work.

Multi-agent systems respond by splitting roles: manager, planner, coder, tester, reviewer, executor, verifier.

But there is a trap.

If multi-agent collaboration is only a chat transcript, the system can become more theatrical than reliable. Agents may agree, debate, or critique each other while still losing the actual state of the task.

The paper’s stronger idea is a shared code-centric harness substrate: a persistent shared program environment where agents coordinate through artifacts that can be inspected and updated.

That substrate can include files, repos, tests, execution results, blackboards, databases, issue states, coverage reports, profilers, simulations, and validation signals.

In human terms: a good engineering team does not coordinate only by talking in a meeting. It coordinates through a repo, issue tracker, CI, logs, design docs, tests, review comments, and release checklists.

Agent teams need the same kind of shared ground.

Otherwise, “multi-agent” just means multiple hallucination surfaces wearing job titles.

Evaluation has to move beyond final success

The open-problems section is a useful warning against shallow benchmarks.

Final task success is not enough.

An agent may pass a visible test suite while exploiting a weak oracle. A GUI agent may complete a scripted task while taking unsafe intermediate actions. A scientific agent may run a simulation successfully without producing scientifically valid conclusions. A coding agent may fix one bug while quietly damaging maintainability, security, or future state.

The paper argues for harness-level evaluation: metrics that inspect the operational substrate, not only the final answer.

Useful dimensions include:

trajectory efficiency: tool calls, edits, executions, latency, token cost;
verification strength: coverage, oracle diversity, false-acceptance rate;
recovery ability: diagnosing and repairing failures;
context sustainability: whether the agent keeps the right state over long tasks;
safety: permission use, risky actions, human-review points;
reproducibility: whether the same harness can replay or audit a result;
coordination quality: whether multiple agents converge on consistent shared state.

This is the right direction.

If we only score agents by “did it finish,” we miss the difference between a careful engineer and a lucky script kiddie.

The dangerous phrase: self-evolving harness

The paper also discusses adaptive harness optimization: using telemetry to improve the harness itself.

That is powerful. It is also risky.

Deep telemetry can show where the loop fails: bad retrieval, brittle tool schemas, weak validators, bloated context, poor retry policy, slow commands, missing permissions, unclear handoffs, or noisy logs.

A system can then revise prompts, tools, memory policies, routing rules, evaluation suites, sandbox settings, or workflow topology.

This is where “agentic harness engineering” becomes interesting.

But self-evolution without regression control is dangerous. A harness update can improve one benchmark while weakening safety, increasing cost, corrupting memory, hiding failures, or breaking a rare but important workflow.

The paper’s open problems around regression-free improvement, transactional shared state, semantic conflict resolution, and human-in-the-loop accountability are not footnotes. They are the difference between a useful adaptive system and an agent that slowly turns its own guardrails into spaghetti.

The right pattern is not “let the agent rewrite itself.”

It is:

collect traces;
localize failure modes;
propose narrow harness changes;
test against regression suites;
preserve rollback points;
require human approval for high-impact changes;
record why the change was promoted.

Evolution needs a harness too.

What this means for builders

The practical lesson is simple but not easy:

Stop treating the model as the whole agent.

A production agent is the model plus the harness around it.

That harness includes tools, policies, memory, context windows, retrieval, execution environments, tests, logs, sandboxes, permissions, review gates, shared state, and deployment discipline.

A stronger model can help. But a stronger model inside a sloppy harness will still behave sloppily.

For builders, the checklist after this paper is clear:

Make important actions executable and observable.
Keep state outside the model when it needs to survive.
Treat tool use as a governed lifecycle, not a raw function call.
Verify with multiple oracles where possible.
Separate proposal from execution from review.
Preserve traces so failures can be diagnosed.
Give multi-agent systems shared artifacts, not only chat.
Evaluate the harness, not only the final answer.
Make self-improvement narrow, testable, and reversible.

That is less flashy than “autonomous agents will do everything.”

It is also closer to how dependable systems actually get built.

What not to overclaim

This is a survey and positioning paper, not a proof that one architecture solves agent reliability.

“Code as agent harness” is a useful lens, but it does not magically make agents safe. Execution feedback can still be incomplete. Tests can be weak. Logs can be misleading. Shared state can diverge. Tool permissions can be too broad. A code-based interface can serialize the wrong thing very precisely.

The strongest reading is not “code solves agents.”

The strongest reading is:

Agent reliability is increasingly a harness-engineering problem, and code is the medium that makes that harness executable, inspectable, and stateful.

That is enough.

It gives builders a map.

And maps matter when everyone is trying to build agents inside fog.