Harness as Environment

How harness design determines whether agents actually adapt.

#AI #Agent #Harness #Memory #Adaptation #RAG #Skills #LLM

The previous post explored why agent memory systems are splitting into semantic, episodic, and procedural layers, but stacking memory layers — no matter how sophisticated — does not guarantee that a system actually adapts. A vector database packed with thousands of past experiences will not change an agent's behavior if the structural mechanism for converting records into adaptation is missing. That mechanism is the environment that defines what the agent can modify, how to judge whether a change succeeded, and how far to roll back when it fails. We call it a harness. A harness is not a test runner or a CI pipeline — it is the execution environment that each memory layer needs underneath it to function as a substrate for adaptation.

Karpathy's autoresearch shows the harness in its thinnest form. prepare.py is locked; only train.py can be modified. A five-minute wall-clock budget caps cost, and a single metric — val_bpb — decides whether a change survives. What can be changed, how to measure success, what to revert on failure — just three conditions, yet without them no amount of model capability produces system-level adaptation. The same structure drove the Arize RAG case. A Claude-based loop started at 39% Recall@5 and ran 17 experiments over eight hours — chunking strategies, BM25/RRF weights, HyDE, multi-query expansion, a reranker — reaching 75%. The termination condition was not "all stories completed" but "Recall@5 ≥ 80%," so the loop chased the target on its own inside an environment where the evaluation function was solid and the modification boundary was clear. The harness did not make the model smarter; it made changes trackable and failures recoverable.

A recent DeepMind result follows the same structure. Rather than building a dedicated proving system like AlphaProof, a simple loop of Gemini plus Lean compiler feedback resolved 9 long-open Erdős problems and 44 OEIS conjectures. The target artifact is a formal proof rather than Python code, and the feedback comes from a compiler rather than a numeric metric, but the harness skeleton is identical: the model produces an output, an immediate and precise verdict comes back, the loop repeats. That a general model with a simple loop got this far on open problems, instead of a purpose-built system, is close to a bitter-lesson result. As models grow more capable, there is a growing consensus that micro-managing constraints — step-by-step instructions, elaborate prompt scaffolding — hurt more than they help, and recent prompting guides reflect this. But the problem with harnessing that depends on a model's behavioral characteristics is that it shifts every time the model changes. Feedback quality, by contrast, is model-agnostic. The Lean compiler does not care which model version generated the proof, just as val_bpb and Recall@5 do not. The thickness of a harness is not the thickness of its constraints — it is the thickness of its feedback.

The Brain / Hands / Session separation emerging in long-running agents extends the same idea. Hands — the execution environment — are designed to be disposable, while Session preserves thoughts, tool calls, observations, and results as an append-only log that outlives any single container. When the execution environment goes down, a new one reconstructs state from the session log. The Ralph loop implements this at the filesystem level — prd.json for the plan, progress.txt for working notes, AGENTS.md for rules accumulated during execution. The model itself is stateless, but the harness holds state, making adaptation possible across sessions.

Treating a skill as "a prompt that describes a capability" cannot explain why it remains reusable across sessions. Skills that actually generalize are closer to procedures that survived a harness's validation. In DSPy, prompts are not written but compiled: a Signature declares the input-output contract, and an optimizer iterates under an evaluation function to produce the best-performing artifact. MACLA compressed 2,851 ALFWorld trajectories into 187 hierarchical procedures via Bayesian confidence estimation. SkillOpt treats skill documents as text-space parameters, running rollouts, reflection, and held-out validation. The progression from DSPy to SkillOpt is the harness automating an ever-wider scope — without touching a single model weight, an evaluation-edit-promotion loop over external state is enough to improve the system. Whether it is AGENTS.md, a Claude Code skill, or SkillOpt's best_skill.md, the only difference is whether a human or a harness performed the validation. How skill and harness stay conceptually distinct yet overlap in implementation is taken up separately in Skill and Harness.

Omni-SimpleMem applies this logic to the design of memory architecture itself. The AutoResearchClaw pipeline drove memory benchmark performance up by several multiples in roughly 50 experiments — not by tuning hyperparameters, but by restructuring the data pipeline and the architecture itself. It did not find the right answer inside a fixed pipeline; it discovered the pipeline design itself, under the same conditions: an immediate scalar metric, a modular architecture, fast experiment cycles, and version control. Yet a harness does not guarantee that adaptation goes well. Because the evaluation function determines the direction of convergence, a narrow score target leads to a system that gets cleverly worse along every dimension the score ignores. How carefully the evaluation set is crafted, how wide the rollback boundary is drawn — the thickness of the environment decides whether adaptation is medicine or poison.

Building a feature in code, finely polishing that code, and shipping it — that model is over. Code and prompts themselves matter less now. What matters more is the environment, the data, and the evaluation criteria. Instead of the supplier finishing software in their own environment and handing it off, the picture is this: set up the customer's environment so an agent can adapt in place, let custom software — a skill that is also a harness — get built right there, and ship it through an audit. Even a remarkably thin, lightweight agent can overwhelmingly satisfy both performance and cost if it auto-optimizes inside the customer's actual environment. The reason Claude Code became lovable is precisely that moving the agent runtime from server to local created an environment where this kind of adaptability could flourish (permissions included).

Take the traditional RAG delivery model. You would build an excellent engine and release it as SaaS, or develop an SI solution tuned to the customer's situation. But now, if you create a versatile, well-adapting seed and just make it growable inside the customer's environment — and provide the right guidance — you can compress timelines dramatically while producing a better-fit product. AutoML and AutoRAG existed before, but the search process was cumbersome and the systems grew enormously bloated just to become adaptable. Now agents can be slim and easily adaptable, and can explore the search space itself.

The question that matters now is not "how capable is the model?" but "how harness-ready is the environment?" Whether observable state — logs, metrics, evaluation sets, diff history — is in place; whether comparable metrics can consistently judge improvement across experiments; whether rollback units are small enough to make failure cheap; whether a promotion policy exists for graduating repeated successes into skills and pruning deprecated procedures. Without these, a system accumulates noise over time instead of learning.

And this is not a set-it-and-forget-it affair. After deployment, continuous performance evaluation and drift detection have to keep running, and rollback, version control, permission management, security, and observability all have to hold it up. Paradoxically, this is exactly where code matters again. The code that implements features matters less, but the code that defends the robustness of all these frameworks matters enormously. On the infrastructure side, too, the weight grows on the tools and the safe workspace an agent operates in, and on the memory that accumulates execution — semantic, working, episodic, procedural. A slim agent inside a well-designed harness outperforms a capable agent without one, on both performance and cost. The moat is not the model's capability — it is the thickness of the environment that lets the model keep adapting on site.

References

Karpathy, Andrej. "autoresearch." GitHub, 2026.

"How Arize Skills Improved RAG Recall from 39% to 75% in 8 Hours." Arize Blog, 2026.

Addy Osmani. "Building Reliable Long-Running AI Agents." Addyo Substack, 2026.

Khattab, Omar, et al. 2023. "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines." arXiv 2310.03714.

"MACLA: Multi-Agent Collaboration via Layered Abstraction." arXiv 2512.18950.

"SkillOpt: Optimizing Natural Language Skill Descriptions for AI Agents." arXiv 2605.23904.

"Omni-SimpleMem: Autonomous Discovery of Multimodal Lifetime Memory Systems." arXiv 2604.01007.

DeepMind. "Advancing Mathematics Research with AI-Driven Formal Proof Search." arXiv 2605.22763, 2026.