Self-Harness boosts tool-using agent performance up to 60% by rewriting its own rules

A Shanghai team replaces ad hoc harness tuning with a self-improving loop that uses traces and regression tests.

ByKhalid Al-HarbiBusiness Desk, The Executives Brief

1 day ago·5 min read

Self-Harness boosts tool-using agent performance up to 60% by rewriting its own rules

Executive summary

Researchers at the Shanghai Artificial Intelligence Laboratory introduced Self-Harness, a framework that lets LLM-based agents systematically update their own operating rules. For decision-makers, it means agent reliability can improve automatically, but it comes with extra compute and evaluation overhead.

Tool-using AI agents got a reality check: the performance bottleneck is often not the base model, but the “harness” wrapped around it. Self-Harness, introduced by researchers at the Shanghai Artificial Intelligence Laboratory, tackles that directly by letting an LLM-based agent examine its own execution traces and iteratively edit its operating rules. In their evaluations on Terminal-Bench-2.0, they report relative performance improvements ranging from 33% to 60% across different models on held-out tasks, driven by harness updates that only get accepted when regression testing shows no unacceptable regressions.

So what exactly is changing, and who cares? The harness is the surrounding system that provides context and enables an agent to interact with the environment, including system prompts, tools, memory, verification rules, runtime policies, orchestration logic, and failure-recovery procedures. The paper’s central claim is that many common agent failures stem from the harness rather than the model, like agents “reporting success without checking” (for example, running code to confirm tests pass) or repeatedly retrying a failed action instead of switching strategies. Instead of relying on manual, ad hoc debugging and intuition, Self-Harness turns those failures into measurable feedback and uses an iterative loop to propose, validate, and merge harness edits.

This matters for enterprises because most organizations do not (and usually should not) build their own frontier AI language model from scratch. But they often can, and should, customize the agent harness for their specific tasks. In practice, that customization has been painful. Harness engineering for agents is still largely tuned through manual trial and error, which gets increasingly expensive as models evolve rapidly. The source points to Hangfan Zhang, lead author of the Self-Harness paper, saying that while “in many cases, an experienced engineer with deep domain knowledge can still propose better changes than an LLM can today,” the deeper issue is that the current harness-engineering paradigm lacks a systematic feedback loop. Edits often come from intuition, a few observed failures, or ad hoc debugging, not from a verifiable empirical process.

Self-Harness is designed to remove the need for constant human intervention and stronger external models to perform harness tuning. The framework uses a three-stage iterative loop that converts behavioral evidence into harness updates:

First is weakness mining. Starting from an initial harness, the agent runs a set of tasks and produces execution traces with verifiable outcomes. The agent categorizes failed traces and tries to detect model-specific failure patterns.

Next is harness proposal. Using a “proposer” role, the system generates diverse but minimal harness modifications, with each change tied to a specific failure mechanism to avoid overly general corrections.

Finally comes proposal validation. Candidate modifications are evaluated through regression tests. An edit is promoted only if it improves performance without causing measurable degradation on held-out tasks. If multiple candidates pass, they are merged into the next harness version for the next iteration.

That loop is not just academic. The paper’s example describes an automated issue-fixing agent that reads internal documentation, writes patches, and opens pull requests. If the company changes documentation style, the agent might fail in ambiguous ways. Self-Harness aims to make those failures actionable by using failure traces to pinpoint where the agent is misusing the new documentation format, then producing a targeted harness edit, while an evaluator checks whether the edit improves failing cases without regressing others.

In the experiments, the researchers evaluated Self-Harness on Terminal-Bench-2.0, a benchmark that tests general tool-based execution, including artifact management, command use, verification behavior, and recovery from execution errors. They applied Self-Harness with MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. Crucially, to isolate harness effects, they started from a minimal harness built on DeepAgent SDK with only the benchmark-facing system prompt and default filesystem and shell tools. They kept the model backend, tool set, benchmark environment, and evaluator unchanged while only the harness varied.

The performance results are where the headline promise lands. On held-out tasks, performance jumped from 33% to 60% relative improvements for different models. The explicit acceptance rule matters because it is what prevents “fix one thing, break another” behavior from slipping into the harness. The source gives concrete examples of what that looks like in practice:

For MiniMax M2.5, the baseline harness could get stuck endlessly exploring dataset configurations until the execution environment timed out, failing to produce deliverables. Self-Harness identified that failure mode and wrote a “loop breaker” into the runtime policy, forcing the agent to stop and redirect after 50 tool calls. It also added a rule to create required artifacts as early as possible.

For Qwen-3.5, the baseline behavior included hitting a file overwrite error and then blindly retrying the same command repeatedly, eventually deleting necessary files out of confusion before stopping. Self-Harness introduced strict command-retry discipline by forbidding exact duplicate commands, plus a mechanism to immediately recreate missing artifacts if a file error occurred.

For GLM-5, the system struggled to preserve environment changes across different commands, often wasting time on massive downloads or finalizing tasks even when sanity checks were failing. The self-generated harness added rules to persist PATH variables across shell sessions, limit external compute, and repair failed sanity checks before concluding a run.

Now for the part executives should read twice: automated harness evolution has hidden costs. Replacing human engineering with trial-and-error does not come for free. Zhang explained that Self-Harness replaces part of the human engineering burden with repeated proposal generation, parallel candidate evaluation, and regression testing. That can mean more API tokens, more latency during optimization, and more infrastructure for running evaluation tasks. In other words, Self-Harness may reduce the need for constantly paying expert hours to tweak brittle policies, but it shifts cost into compute and measurement.

For board members and operators, the second-order implication is that “agent reliability” becomes more like a software engineering process with test gates, not a vibe-check exercise. It also raises procurement questions: if internal agent harnesses can improve automatically, teams will demand consistent evaluation harnesses, standardized regression suites, and enough infrastructure to support iterative tuning without creating runaway expense. As agent use moves from demos to production workflows, the harness layer becomes the control plane. Self-Harness is a bet that the control plane can be self-maintaining, with measurable acceptance criteria. The strategic stakes are simple: in a world where model releases keep accelerating, the teams that can iterate harness logic quickly and safely will ship more dependable automation, and the teams that cannot will keep paying for manual debugging just to keep agents from face-planting in the newest edge cases.

Executive ActionsLocked

This story's Key Insights and Take-aways are locked.

Create a free account to unlock Executive Actions for one credit.

Always free for Executives Club members. Join the Club

Taggedai-agents llm-harness self-improvement terminal-bench-2-0 regression-testing enterprise-ai model-evaluation tool-based-execution

Self-Harness boosts tool-using agent performance up to 60% by rewriting its own rules

This story's Key Insights and Take-aways are locked.

More in Business

SpaceX sells $25B in debt under two weeks after IPO, despite $90B in orders

Accenture’s $4.18bn play fails as AI fears spark a 20% worst-ever stock plunge

SpaceX stock jumps 3% after it overtakes Amazon’s market cap