AI May 22, 2026 10 min read

Weights, Prompts, Codes as Parameters

Weights, prompts, and code as parameters at different layers of a learnable policy space

#AI #Agent #LLM #POMDP #DSPy #Harness #Self-Improvement #RL

A useful entry point for understanding contemporary agent systems is that nearly every problem takes the form of a POMDP in some broad sense. An agent does not fully observe the true state of the world; instead, it constructs a belief about the current state through partial signals — limited context windows, retrieval results, execution logs, tool call outputs, user feedback, external memory, code execution results. The classical formulation of POMDPs places the core challenge exactly here: selecting actions based on observation history and belief state under conditions where the full state is never given (Kaelbling, Littman, and Cassandra 1998). The standard reinforcement learning framework similarly treats the agent as a system that improves its policy through interaction with an environment, emphasizing that action selection is adjusted through reward and accumulated experience (Sutton and Barto 2018). From this perspective, agent performance is not explained by the volume of knowledge inside the model alone — it depends on how incomplete observations are compressed into a state representation, what actions are selected over that representation, and how those results feed back into the next judgment.

In this framing, weights, prompts, and code are products of different layers, but all of them can be seen as parameters in a broad sense sitting on the policy space of a learnable system. Model weights are an implicit policy shaped by large-scale training. Prompts are natural-language parameters that condition that policy for a specific situation and task. Code is a procedural parameter that externalizes reasoning procedures, action procedures, state updates, and verification conditions in executable form. Code as Agent Harness describes how code moves beyond being a final output of an LLM to becoming the operational substrate that organizes an agent's reasoning, actions, environment modeling, and execution-based verification (Ning et al. 2026). The same paper characterizes the harness as a software layer combining tools, APIs, sandboxes, memory, validators, permission boundaries, execution loops, and feedback channels — and identifies code as the central medium of that harness.

The sequence in which these spaces opened — weights → prompts → code — followed from differences in adjustment cost, feedback speed, verifiability, and blast radius at each layer. The weight space was the first to be mathematically formalized through loss functions and backpropagation, then industrialized through the combination of large-scale data and GPU training infrastructure. DQN demonstrated that combining deep learning with reinforcement learning could be a powerful path to policy learning in complex control problems; PPO became a representative algorithm for more stable policy optimization (Mnih et al. 2015; Schulman et al. 2017). At this stage, what was learnable was primarily the weights inside the model. Prompts and execution procedures outside the system were treated more or less as fixed conditions.

The prompt space opened after large models had acquired sufficiently general capabilities — once it became clear that the behavioral distribution could be shifted dramatically through input conditions alone, without touching weights. ReAct interleaved reasoning traces with environment actions, enabling the model to update its plans and interact with external knowledge sources; Tree of Thoughts proposed exploring multiple reasoning paths and selecting among them through self-evaluation, rather than single left-to-right generation (Yao et al. 2022; Yao et al. 2023). Reflexion introduced storing verbal feedback in episodic memory to improve decision-making in subsequent trials without any weight updates; Self-Refine showed a single LLM iterating through generation, feedback, and revision as a test-time improvement structure (Shinn et al. 2023; Madaan et al. 2023). This line of work reveals that prompts and linguistic memory can function not as simple input sentences but as policy conditions repeatedly adjusted by feedback.

Where prompts become unambiguously learnable policy variables is in DSPy. DSPy replaces LM pipelines that depend on hard-coded prompt templates with declarative modules and a compiler-based self-improving pipeline that optimizes the pipeline to maximize a given metric (Khattab et al. 2023). In this approach, a prompt is not a sentence to be hand-tuned but a part of a program combining modules, demonstrations, instructions, augmentations, and reasoning techniques. The critical shift is that a prompt, while still a natural-language surface, can be compiled repeatedly against data and metrics. The prompt space is therefore understood as a fast-adaptation manifold operating outside model weights.

The code space opened in earnest when the model's code generation capabilities, safe execution environments, testing, logging, rollback, and permission boundaries all matured together. Where prompts adjust textual conditions, code adjusts an executable policy surface that interacts directly with environments. AutoHarness showed that a code harness around an LLM can be automatically synthesized through a small number of iterative code refinements and environment feedback; Meta-Harness proposed an outer loop with access to source code, scores, and execution traces that searches over harness code (Lou et al. 2026; Lee et al. 2026). Arize's RAG recall improvement case demonstrates how chunking, retrieval, reranking, and indexing strategies in a code-based pipeline can all operate as procedural variables explored by an evaluation loop — moving from 39% to 75% recall in eight hours (Arize AI 2026).

Framed as a hierarchical manifold, this becomes clearer. At the lowest layer is the high-dimensional parameter manifold created by model weights. Above it is the conditioning manifold created by prompts. Further out is the execution-procedure manifold created by code and harness. Each layer changes the agent's behavioral distribution, but the cost, speed, and verifiability of movement differ. Fast problems are handled in the prompt space; recurring procedural failures are handled in the code space; broad generalization deficits remain as long-horizon updates in the weight space. Self-improvement is less about optimization within a single space and more about reading a failure signal and choosing which manifold to move in.

The distinction between parameter and hyperparameter, policy and meta-policy, is not absolute in this view. What looks like a fixed condition at one layer appears as an adjustable variable in the learning loop one level up. Inside model training, learning rates or routing rules look like hyperparameters; from the perspective of meta-learning or AutoML, they become optimization targets again. Similarly, a task-execution policy looks like a given rule at one layer, but in a self-improvement loop it becomes input to the meta-policy deciding how that policy should change. The distinction is created by which abstraction boundary you happen to be standing at. The moment a condition that was outside the boundary enters the learning loop, it becomes a parameter in the broad sense.

Self-improvement is the process of search and adaptation occurring across this hierarchical policy space. The agent acts according to the policy expressed in current weights, prompts, and code; through the imperfect feedback obtained from those actions, it estimates which representations need to change so that the next execution improves. Signals like test pass rate, retrieval performance, execution failure rate, cost, latency, human preference, and regression presence are not complete reward functions, but they can reveal relative ordering among candidates. When this comparative signal accumulates, the agent moves beyond correcting individual answers and adjusts the way it approaches problems. The core of self-improvement is subsuming more representations as learnable variables and reconfiguring them according to feedback.

Harness is the executable boundary where this self-improvement can occur. A harness is where weights, prompts, and code converge; it bundles observation, action, state storage, verification, retry, and rollback into a single closed loop. Code operating as a harness means more than an agent outputting code — it means that code specifies what the agent (or another agent) will observe, what it will execute, which failures it will detect, and under what conditions it will stop. An agent modifying code is therefore simultaneously a functional change and a coordinate shift in policy space. Voyager's combination of automatic curriculum, a reusable code skill library, and iterative prompting that incorporates environment feedback and execution errors illustrates how experience is externalized into reusable policy fragments (Wang et al. 2023).

Multi-model orchestration extends this structure to the dimension of model selection. Sakana AI's Fugu is described as a multi-agent orchestration system that coordinates multiple foundation models; what matters is the emphasis placed not on a single model solving all problems directly, but on a structure that learns which model to call when and how to compose collaborative arrangements (Sakana AI 2026). MoE can be read as tightly coupling this policy selection inside a model; routing over an external pool of models is a looser system-level version of the same problem. Evolutionary algorithms and meta-learning take a further step, making the procedure for searching policy space itself a learning target. OpenAI's work on Evolution Strategies reads as background evidence that evolutionary search can be an alternative optimization path to reinforcement learning (Salimans et al. 2017).

Adaptation and control are both necessary inside this structure simultaneously. Adaptation draws conditions that were outside the boundary into the set of learnable variables, extending the adjustment target to include prompts, code, harness, evaluation loops, and search procedures. Control provides mechanisms — sandboxes, tests, version management, rollback, permission boundaries, human approval — to prevent that movement from diverging. Adaptation without control risks overfitting to a weak oracle or producing regressions; control without adaptation prevents the structure from changing to meet environmental shifts. A good agent harness is a structure that enables controlled adaptation: an executable closed-loop structure over a policy space where weights, prompts, and code combine, and where search and adaptation can repeat.

References

Arize AI. 2026. "How Arize Skills Improved RAG Recall from 39% to 75% in 8 Hours." Arize AI Blog.

Kaelbling, Leslie Pack, Michael L. Littman, and Anthony R. Cassandra. 1998. "Planning and Acting in Partially Observable Stochastic Domains." Artificial Intelligence 101 (1–2): 99–134.

Khattab, Omar, et al. 2023. "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines." arXiv 2310.03714.

Lee, Yoonho, et al. 2026. "Meta-Harness: End-to-End Optimization of Model Harnesses." arXiv 2603.28052.

Lou, Xinghua, et al. 2026. "AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness." arXiv 2603.03329.

Madaan, Aman, et al. 2023. "Self-Refine: Iterative Refinement with Self-Feedback." arXiv 2303.17651.

Mnih, Volodymyr, et al. 2015. "Human-Level Control through Deep Reinforcement Learning." Nature 518: 529–533.

Ning, Xuying, et al. 2026. "Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems." arXiv.

Sakana AI. 2026. "Sakana Fugu: A Multi-Agent Orchestration System as a Foundation Model."

Salimans, Tim, et al. 2017. "Evolution Strategies as a Scalable Alternative to Reinforcement Learning." OpenAI Research.

Schulman, John, et al. 2017. "Proximal Policy Optimization Algorithms." arXiv 1707.06347.

Shinn, Noah, et al. 2023. "Reflexion: Language Agents with Verbal Reinforcement Learning." arXiv 2303.11366.

Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. 2nd ed. Cambridge, MA: MIT Press.

Wang, Guanzhi, et al. 2023. "Voyager: An Open-Ended Embodied Agent with Large Language Models." arXiv 2305.16291.

Yao, Shunyu, et al. 2022. "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv 2210.03629.

Yao, Shunyu, et al. 2023. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." arXiv 2305.10601.