The Three-Layer Problem: Why AI Agents Learn Differently Than Models
When engineers talk about continual learning, they almost always mean one thing: updating model weights. Fine-tune on new data, ship a new checkpoint, done. But Harrison Chase, founder of LangChain, argues in a recent blog post that AI agents expose a much richer picture — one where learning happens simultaneously at three distinct layers, and the most actionable improvements often have nothing to do with the model itself.
The three layers Chase identifies: the model (weights and training), the harness (the code and instructions that drive every instance of the agent), and the context (instructions, skills, and memory that live outside the harness and configure it per-user or per-tenant). Each layer has different dynamics, different failure modes, and different practical levers for builders.
The harness is where the action is
The most underappreciated layer is the harness — and a new paper from researchers at Stanford, Google, and elsewhere puts hard numbers behind why. Meta-Harness (arXiv:2603.28052, March 2026) describes a system that optimizes harness code automatically by giving a coding agent full access to the execution traces, scores, and source code of every prior harness candidate. The key insight: prior optimization methods compress everything into a short summary and a scalar score. That works for small problems, but harness engineering produces failures that are hard to diagnose without seeing the raw trace.
The Meta-Harness system gives the proposer up to 10M tokens of diagnostic context per iteration — compared to 26K for all prior methods surveyed. The result: discovered harnesses outperform the best hand-engineered baselines on TerminalBench-2, a benchmark for agentic coding tasks. On text classification, Meta-Harness improved over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improved accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2.
The paper's comparison table is revealing:
| Method | Context available |
| Self-Refine | Current output + self-critique |
| OPRO | ~20 prior instruction strings + scalar scores |
| TextGrad | Current candidate output + evaluation result |
| Meta-Harness | Full source, scores, and execution traces of all prior candidates |
The jump from scalar scores to full traces is the meaningful difference. Harness failures are often architectural — a wrong assumption about what state to persist between turns, a missing tool definition, a retry loop that never terminates. You can't diagnose those from a number.
Context-level learning: the per-tenant layer
Below the harness sits context — the layer most commonly associated with "memory." This is where per-user and per-tenant learning happens. Chase points to OpenClaw's SOUL.md as an example of agent-level context that evolves over time, and Hex's Context Studio, Decagon's Duet, and Sierra's Explorer as enterprise examples at the tenant level.
OpenClaw's implementation, called "dreaming," is worth looking at closely because it's one of the few production-grade context-learning systems with public documentation. Dreaming runs as a background consolidation pass that tracks short-term recall events — every memory search hit is recorded with recall count, scores, and query hash. On a configured cadence, candidates are scored across four signals: frequency (how often it was recalled), relevance (average recall scores), query diversity (how many distinct intents surfaced it), and recency (temporal decay with a 14-day half-life). Only candidates that pass all configured threshold gates get promoted into durable long-term memory.
The design is explicit about what it's optimizing for: avoiding the accumulation of one-off details into persistent context. Memory stays focused on durable, repeated patterns. This is a non-trivial engineering choice — most systems just append.
Context learning can happen in two modes: offline (after the fact, processing recent traces in a batch job) or in the hot path (the agent decides mid-task to update its own memory). OpenClaw supports both.
The model layer: real but constrained
The model layer is where most research attention goes, and it's real — RL techniques like GRPO and supervised fine-tuning are the standard tools. But catastrophic forgetting remains an open problem: updating a model on new tasks degrades performance on things it previously knew. Chase notes that most teams training models for specific agentic use cases do it at the agent level, not the per-user level, largely for this reason.
LoRA at the per-user granularity is theoretically possible but rarely deployed in practice. The economics don't work yet.
What this means for builders
The taxonomy Chase lays out is useful precisely because it decomposes a problem that gets muddled in marketing. "Our agent is always learning" could mean updating model weights (expensive, slow, still unsolved), improving the harness code (tractable, often high-leverage), or accumulating per-user context (table stakes for enterprise at this point).
The Meta-Harness paper suggests the harness layer has barely been scratched. Automated optimization of harness code — treating the agent's scaffolding as a first-class engineering artifact rather than a one-time setup — is an underexplored frontier. The jump in TerminalBench-2 performance from discovered harnesses is a signal that the current generation of production agents may be running on suboptimal scaffolding.
For builders: audit what percentage of your agent's "learning" is happening at each layer. If you're only thinking about model updates, you're leaving the harness and context layers on the table.