Ask a large language model to evaluate its own output and it will tell you what you want to hear. Give it a broken app and it will say the design is elegant. Show it code that does not run and it will praise the architecture. This is not a bug. It is a structural feature of how these models are trained, and it is the reason Anthropic spent months building a system where one AI generates software and a completely separate AI tears it apart.
The company published an engineering blog post on March 24, 2026 describing a three-agent harness for autonomous coding. The Planner expands a brief into a full product specification. The Generator builds the application. The Evaluator runs the result against hard criteria using Playwright, a browser automation tool, and judges whether the work is actually any good. The Generator and Evaluator never communicate directly; they share only a build artifact. Anthropic InfoQ
The results are a useful empirical map of where current LLMs are genuinely capable and where they are not. A solo Claude agent given 20 minutes to build a retro game maker produced something that looked impressive in screenshots but did not actually work. The core play mode was broken. Cost: $9. The same task through the full three-agent harness, looping across multiple sprints over six hours, produced a working application with 16 features including AI integration, sprite animation, and sound. Cost: $200.
The Evaluator is the unsung part of the system. Prithvi Rajasekaran, the Anthropic researcher who authored the post, gave it the Playwright MCP, a tool that lets it click through live pages and observe actual behavior rather than reading code. It scored each sprint against four criteria: design quality, originality, craft, and functionality, with originality and design quality weighted highest. The Evaluator was explicitly prompted to penalize what Rajasekaran calls "AI slop": the purple gradients over white cards that mark auto-generated UI. The system was designed adversarial by architecture, not by instruction.
This adversarial design was necessary because a standard LLM evaluating its own generation defaults to what Rajasekaran describes as "confidently praising the work." The model cannot reliably distinguish between code that looks structured and code that actually works. Separating the Evaluator from the Generator was the key unlock. The Evaluator catches things the Generator would never notice about its own output.
The approach is computationally expensive. A second harness run building a digital audio workstation (DAW) cost $124.70 over three hours and 50 minutes across three build-and-QA rounds. The first build round alone consumed $71.08 of that, with the Evaluator running $3.24 in the first QA pass. The cost compounds quickly with iteration, and the Evaluator does real work on every pass, not just reviewing code but actually using the application.
Even the Evaluator has blind spots. During the DAW run, the first QA round failed to catch that audio recording in one of the generated apps was stub-only. The interface appeared but there was no microphone capture connected. The Evaluator also missed that clip drag was not implemented and that effect visualizations were static sliders rather than graphical elements. The Evaluator caught most of what mattered but it is not infallible, and the blog post does not pretend it is.
The architecture also had to compensate for specific model behaviors. Claude Sonnet 4.5 exhibited what Rajasekaran calls "context anxiety": it would wrap up work prematurely as it approached context limits, regardless of whether the task was complete. Compaction alone was not sufficient, so context resets between sprints became essential. Claude Opus 4.5 largely eliminated that behavior. With Opus 4.6, Anthropic was able to remove sprint decomposition entirely. The model no longer needed the task broken into discrete phases to stay on track, though the planner and evaluator still added value.
The harness is in part a workaround for specific model failures. As context windows become more stable and models develop a more reliable internal sense of progress, the scaffolding required to keep them on task should shrink. The cost-to-capability curve that the $9 versus $200 comparison illustrates is not a fixed property of AI-assisted development. It is a snapshot of the current frontier, and the direction of travel is toward less harness, not more.
Anthropic built a GAN-style loop for software. The Generator produces; the Evaluator destroys. The loop continues until the Evaluator cannot find anything to attack. It is an expensive process, and it is also for now the most reliable way to get a working application out of a frontier model.
The blog post does not claim otherwise. It is an engineering document, not a product announcement. What it shows is that getting a frontier LLM to finish a long-running software task requires surrounding it with a system that compensates for what it cannot do for itself. That is the actual finding. The three agents are the symptom. The honest answer to the question "can this model judge its own work?" is still no.