A new review paper is pushing a subtle but important reframing of AI agents: the thing doing most of the real work may not be the model’s raw output, but the code wrapped around it.
That code layer, which the authors call the harness, sits between a language model and the outside world. It is where tools get called, memory gets updated, execution gets sandboxed, permissions get checked, and feedback loops are stitched together. In that view, an agent is not just a model that talks convincingly. It is a system that can reason over time, act on tools, and collaborate through software that makes those behaviors durable.
The paper, from researchers at the University of Illinois Urbana-Champaign, Meta, and Stanford, argues that this harness-centric framing is more than terminology. It changes where capability comes from, where bottlenecks appear, and what teams should measure when they ship agents into production.
The harness is the missing layer
The review organizes agents into three layers: model capabilities, harness infrastructure, and environment interfaces.
That structure matters because it separates what a model can potentially do from what an agent can actually accomplish in a deployed setting. A model may be strong at language prediction or code generation, but without a harness it remains stateless. It can emit text, yet it cannot reliably persist memory, route work through tools, or recover from partial failure.
The harness is the operational layer that turns outputs into action. It coordinates execution loops, tool selection, testing, permission boundaries, memory, and the logic that decides when the system should ask for help, stop, retry, or escalate. It is also where collaboration emerges: not from the model alone, but from the surrounding software that lets multiple steps, tools, and even multiple agents interact over time.
That is why the paper treats code itself as a medium of agent behavior. Code is executable, so it can be checked. It is testable, so failures can be reproduced. It is stateful, so it can sustain long-running tasks that extend beyond a single prompt-response cycle.
The practical implication is straightforward: if you want better agents, model upgrades are only one lever. A lot of the useful work moves into infrastructure.
Why this changes product architecture
For teams building agent products, the paper’s framing points to a different roadmap than the usual “pick a stronger model” instinct.
The first investment is the harness itself: orchestration code, tool wrappers, state management, and reliable environment interfaces. If the agent needs to search a database, write to a ticketing system, call an internal API, or run code in a container, those integrations are not peripheral. They are the product.
That shifts attention to the quality of the surrounding runtime. Sandboxed execution becomes a core design choice rather than a nice-to-have. Permission models have to be explicit. Tool contracts need to be stable and observable. Long-running workflows need durable state and recovery paths. If the harness is brittle, the whole agent becomes brittle, even if the underlying model is excellent.
It also changes how teams think about rollout. A feature flag on a prompt is not the same as a controlled deployment of an agent that can take actions. Product owners need to know whether the harness can constrain scope, whether every tool call is logged, and whether the system can be paused or rolled back when it starts behaving unexpectedly.
In other words, agent development starts to look less like chatbot tuning and more like systems engineering.
Metrics will have to move off model output alone
The paper also implies a measurement problem. If the harness is where agents actually do work, then evaluating only model quality leaves out a large share of system behavior.
That means success metrics should expand beyond answer accuracy or benchmark scores on model outputs. Teams will need to track harness health: tool reliability, execution latency, retry rates, sandbox failures, permission denials, state corruption, and the quality of handoffs between model and environment.
Interface reliability matters too. If an agent depends on a web action, a file operation, or an internal service call, the real question is not just whether the model can propose the right step. It is whether the interface allows that step to happen safely and consistently.
For long-running or stateful agents, end-to-end evaluation becomes especially important. A model can look strong in isolation and still fail once exposed to partial completion, unexpected tool output, or a broken environment contract. The harness is where those failures surface.
That makes infrastructure metrics first-class product metrics.
Safety and governance live in the harness
The paper’s most consequential point may be that the harness is also where governance happens.
Security boundaries, data handling rules, audit trails, and compliance controls are all embedded in the execution layer, not in the model weights. If an agent can access private files, call external services, or act on behalf of a user, the harness decides what it may touch and how those actions are recorded.
That is why sandboxing matters so much. A model that can write code or call tools needs a constrained runtime with clear limits on file access, network access, and side effects. Verifiable tests become part of the safety story, because the execution path can be checked instead of assumed. Auditable logs matter because post-hoc explainability is far less useful than a record of what the system actually did.
The paper’s framing also suggests a more realistic view of risk. Many agent failures will not come from a model suddenly becoming malicious or brilliant. They will come from ordinary software problems: a permissive tool wrapper, a missing guardrail, a stale state snapshot, or a feedback loop that keeps retrying the wrong action.
That puts the burden on architecture, not just policy.
What to watch next
If this view keeps spreading, the next wave of agent work may look less like larger frontier models and more like standardization around the software layer.
Expect more attention on harness interfaces: how tools are described, how permissions are expressed, how state is serialized, and how execution traces are exposed for testing and review. Benchmarks may also begin to move toward code-level reasoning and full-system evaluation rather than isolated prompt tasks.
That would be a meaningful shift. It would mean judging agents not only by what they say, but by whether the code around them can reliably turn those statements into controlled action.
For an industry that has spent much of the past two years talking about agent capability as if it were mostly a model scaling problem, the message here is more grounded: the next bottlenecks may sit in harnesses, sandboxes, and interfaces. And that is where much of the real product work now appears to be.



