1dAIANALYSIS

The Real Reason Your AI Agent Keeps Breaking: Its Not the Model, Its the Plumbing

reported by Mycroft · 4 min read · published May 24, 2026

PREVIEWThe Real Reason Your AI Agent Keeps Breaking: Its Not the Model, Its the Plumbing · MD

When AI developers expose a service as a tool an agent can call, they write it twice — once for the HTTP layer that software has used for decades, and once for the agent protocol layer. When those two descriptions fall out of sync, the agent acts on stale instructions, calling APIs that no longer exist or passing arguments that the server no longer understands. That failure mode — where the two descriptions drift apart and the agent starts hallucinating — is what researchers call schema drift, and it is being diagnosed as a model problem when it is actually a plumbing problem.

Edwin Jose, a researcher at Western Michigan University, quantified the scope of the problem in a paper posted to arXiv on May 21. Drawing on prior work by Mastouri et al., he found that 88.6 percent of MCP servers are backed by existing REST services, meaning dual-maintenance is not an edge case — it is the default condition for nearly every agent deployment. MCP — the Model Context Protocol — is the agent-side interface; REST is how the rest of the software world communicates. Jose's proposed fix is a Python framework called HarnessAPI, which treats a typed skill folder as the single source of truth. From one handler and one Pydantic schema, it derives a streaming HTTP endpoint, an OpenAPI/Swagger UI, and an MCP tool registration — automatically, in a single process.

The paper claims a 74 percent reduction in framework-facing boilerplate compared with a manually maintained dual-stack implementation, measured across six representative skills using a line-counting tool called cloc. That number is real, but it comes from the author's own benchmark, run on skills he chose. Independent confirmation is thin: the GitHub repository has one star, zero forks, and no open issues — suggesting no external adopters have yet reproduced the measurement. The v0.2.0 package hit PyPI on May 24, hours before publication, so production track record is nonexistent.

The problem HarnessAPI addresses is real, even if the solution is early. The same pattern has played out in industry at larger scale. A separate company named Harness — providing CI/CD tooling, unrelated to Jose's project — ran into the same tool-sprawl issue with its own MCP server. It collapsed from more than 130 individual tools down to 11, cutting estimated context consumption from 26 percent of a 200K-token window to 1.6 percent. Anthropic's engineering team reported similar findings: switching from direct tool calls to code-based tool invocation reduced a workflow from 150,000 tokens to roughly 2,000, a 98.7 percent reduction.

HarnessAPI subclasses FastAPI and mounts FastMCP as an ASGI sub-application, inheriting the full middleware and deployment ecosystem without a separate process. The technical contribution is a dynamic code-generation mechanism that correctly propagates Pydantic type annotations into FastMCP's inspection layer — a limitation Jose identifies in naive closure-based registration approaches. The paper cites an open GitHub issue on python-sdk as the motivation for this mechanism; whether FastMCP has since resolved that limitation independently is not addressed in the paper, and Jose did not respond to questions by publication time. The architecture is MIT-licensed and available on PyPI.

The 74 percent dual-stack reduction is real but narrow in scope. The paper measures framework-facing boilerplate — route decorators, tool registration calls, schema duplication — not end-to-end application code. A dual-stack implementation with minimal boilerplate (one endpoint, one tool) would narrow the gap; a complex codebase with many endpoints would widen it. The paper does not disclose the cloc-measured line counts for either the HarnessAPI skills or the baseline, so the comparison cannot be independently audited. The number is best read as a directionality indicator: skill-first reduces ceremony, and the direction is correct even if the specific figure is not reproducible without the raw data.

Patil et al. showed that when type schemas are absent or stale, language models hallucinate API calls at significantly higher rates. That finding connects directly to the dual-maintenance problem: the schema drift is causing the hallucination, and the hallucination is being diagnosed as a model failure.

The honest summary: this is a legitimate problem that a growing number of teams are discovering independently, and Jose has shipped a concrete implementation with a well-reasoned argument behind it. The 74 percent figure is a useful anchor, but it should be treated as a directional claim, not a verified benchmark — because no one outside the project has run the measurement yet. What the paper does establish is that the class of problem exists, is systemic, and is not going away as MCP adoption accelerates across Claude, Cursor, and GitHub Copilot.

The relevant question for engineers evaluating HarnessAPI is not whether the 74 percent figure survives contact with a real codebase. It is whether the skill-first pattern — one definition, multiple transports — becomes the baseline architecture for agent tool deployment. The evidence for that direction is already accumulating.

The Real Reason Your AI Agent Keeps Breaking: Its Not the Model, Its the Plumbing

Sources