Harness-1 hits 73% recall, beating GPT-5.4 with a 20B open-source search agent
A new “state-externalizing” agent design lifts retrieval accuracy, and it’s already available under Apache 2.0.

Researchers at UIUC, UC Berkeley, and Chroma released Harness-1, a 20-billion parameter open-source search agent built on OpenAI’s gpt-oss-20B model. It posts a 73% average recall score, surpassing GPT-5.4 and showing that the environment around the model can matter as much as model size.
A new open-source search agent just posted a result that should worry anyone betting retrieval quality is mainly a “bigger model” problem. Harness-1, built by researchers at the University of Illinois at Urbana-Champaign (UIUC), UC Berkeley, and open-source vector database platform Chroma, achieves 73% average recall on a curated dataset. That beats GPT-5.4 at 70.9% and outpaces the next most accurate open-source search agent, Tongyi DeepResearch 30B, by 11.4 percentage points.
The other punchline is how quickly teams can start playing with it. The model and its environment are available immediately under the highly permissive Apache 2.0 license, and the model code and weights are on Hugging Face. The researchers say Harness-1 is a redesign of how AI runs complex retrieval tasks, not just another model checkpoint. And because it’s open, the “wait and see” phase gets shorter for competitors and enterprise builders who have been waiting for a credible alternative to expensive proprietary systems.
So what is Harness-1 actually doing differently? The researchers frame the core issue as “search amnesia,” where typical search agents forget their original queries, loop over rejected documents, or lose track of the specific claims they are trying to verify. In many existing systems, the fix is brute force: engineers force models to constantly reread and append their own actions into an ever-growing transcript, leaning on massive context windows to keep enough state around.
Harness-1 takes a different approach. It introduces what the team calls a “state-externalizing harness,” an active surrounding environment that carries the bookkeeping for the agent instead of relying on the model’s working memory. The environment maintains a recoverable working memory that includes a candidate pool of documents, an importance-tagged curated evidence set, compact evidence links, and verification records. Meanwhile, the policy still handles the “human-like decisions”: what to search, which documents to keep, and when to stop. In other words, the model is freed up to be selective and verifying, while the software structure handles persistent state.
This is not just theory. The researchers evaluated Harness-1 and competitors across eight complex search benchmarks that go beyond trivia. The tests require the AI to act like a real researcher sifting through dense sources, including open web searches, complex financial filings from the SEC, technical patent databases from the USPTO, and multi-hop question-answering tasks that require logically piecing together scattered clues from multiple documents. Harness-1 dominates the open-source competition in its ability to find and curate the right facts, and it also “goes toe-to-toe” with large proprietary systems.
The comparison set matters. The researchers report that Harness-1 outperforms GPT-5.4, Sonnet-4.6, and Kimi-K2.5, which are described as being thought to have hundreds of billions or trillions of parameters. Only one frontier model, Opus-4.6, narrowly edges it out in overall average performance. One caveat the team calls out: while GPT-5.5 has been out for more than a month, they did not test against it because it was not available when they were building their system.
Under the hood, the results come from how the model is trained to use the harness instead of learning to memorize the entire agent session. The team contrasts the typical approach of training search agents as policies over massive, ever-growing transcripts, where reinforcement learning (RL) must optimize both semantic reasoning and raw memorization of a search state. Harness-1 changes the division of labor: because the harness handles routine bookkeeping, the training process focuses on teaching the structured interface behaviors. The pipeline starts with supervised fine-tuning (SFT) using a narrow set of 899 filtered trajectories generated by a GPT-5.4 teacher agent operating in the exact same harness environment. This SFT stage is described as teaching mechanical rhythms, like how to format tool calls, tag documents by importance, and verify claims before promoting them into the curated set.
After SFT, Harness-1 undergoes reinforcement learning using an algorithm called CISPO over full search episodes capped at 40 turns. The researchers designed a terminal reward function that explicitly separates discovery from selection. The model is rewarded for finding relevant documents and for successfully promoting them into the final answer set, with penalties when it finds an answer but fails to curate it. They also add a “tool diversity” bonus because, without it, the policy tends to collapse into a lazy strategy: spamming queries but bypassing reading and verification.
Finally, there’s the data-efficiency headline that will be interesting to anyone tracking cost and scaling constraints. The entire model was trained on roughly 4,400 unique items, including the 899 SFT trajectories. The point is less about raw parameter count and more about squeezing more retrieval competence out of less data by putting the model in a structured environment designed for state management.
There is also a second thread here: Harness-1 doubles as proof-of-efficacy for a separate infrastructure effort called Tinker, a distributed, web-based AI model training and fine-tuning API developed by Thinking Machines. The researchers used Tinker specifically to train and run inference for Harness-1. That matters strategically because it suggests the industry’s next leap might not be only about architecture. It can also be about operational tooling that makes interactive, stateful agent execution reliable at scale.
For enterprise decision-makers, this is the real reckoning: if retrieval quality can improve by redesigning the “harness” and state management around an agent, then vendors competing on model size alone may be overconfident. For boards, it reframes where investment risk sits. The winners may be the teams that build the environment, tooling, evaluation harnesses, and licensing model access needed to turn agents from impressive demos into verifiable, repeatable research workflows. Harness-1 is available now, so the competitive pressure starts immediately.
This story's Key Insights and Take-aways are locked.
Create a free account to unlock Executive Actions for one credit.
Register to UnlockAlways free for Executives Club members. Join the Club
More in Technology

Apple’s Siri AI demo returns after 2024 delays, and analysts say it could drive hardware sales
At WWDC 2026, Apple overhauled Siri with Apple Intelligence models tied to Google, plus new AI features and child safety updates.

FCC waives Amazon Leo 50% launch deadline, keeps July 30, 2029 end-date
The regulator removes the end-of-July clock, but holds the line on the first-generation constellation schedule.

Perplexity CEO Aravind Srinivas says it will IPO in 2028 no matter what
His CNBC comments set a hard timeline for Perplexity’s public-market push as Anthropic moves toward an IPO.
