IBM Says 88% of AI Agent Pilots Fail at the Governance Layer, Not the Model
Most large enterprises will operate more than 1,600 AI agents by the end of this year. Seven in ten of those enterprises say the governance they have in place is not fit for purpose. And 88 percent of AI agent pilots never reach production — not because the model fails, but because the layer around the model does.
That layer is what IBM Haifa is trying to redesign.
The research center published a paper on May 20 describing what it calls CUGA: a policy system that enforces five runtime checkpoints on generalist AI agents. The checkpoints intercept an agent before it plans a task, inside its system prompt, at the moment it calls a tool, outside its reasoning loop, and at the point where it returns output. Each one is a gate where a governance rule can say no.
The paper's framing is architectural: governance should be built into the agent, not inspected after it ships. The analogy IBM uses is DevOps — the practice of weaving security and testing into software development rather than bolting it on at the end. "Governance by construction" is the phrase in the title. The argument is that AI agents are now numerous enough and consequential enough to require the same shift.
The numbers IBM brings to that argument are large. By year-end, large enterprises will operate more than 1,600 AI agents each, according to IBM's Think 2026 survey data. Seventy percent of those enterprises describe their existing governance as not fit for purpose. Eighty-eight percent of AI agent pilots never reach production, with failure clustering at the layer around the model rather than in the model itself. Those figures describe an operational crisis. The models work.
Independent surveys tell a consistent story. OutSystems found that 96 percent of enterprises have some level of agent adoption, with the average organization running 12 agents today and expecting to reach 20 by 2027. The governance problem is not hypothetical. It is arriving with the agents.
The governance failure rate clusters at the layer between the model and the tools it calls. An agent that can call internal tools, query a database, send a message, or approve a transaction needs rules for all of those actions. Those rules are not model problems. They are infrastructure problems, and they look more like access control lists than like model weights.
CUGA's five checkpoints are an attempt to make that infrastructure explicit and enforceable. The system intercepts an agent upstream of planning, inside the system prompt, at the tool-call boundary, outside the reasoning loop, and at output. Each checkpoint evaluates a policy — a stored rule that can permit, restrict, or redirect the agent's next step.
The system is open source, built on LangGraph, a framework for structuring multi-step agentic workflows. It uses Milvus as a vector database for semantic policy retrieval, meaning it can match governance rules to situations the original authors did not explicitly anticipate. It supports open-weight models including GPT-OSS-120B for on-premises deployment, relevant for enterprises that cannot send data to external APIs.
On benchmark performance, CUGA placed first on AppWorld — a test suite of 750 real-world tasks across 457 APIs — and third on WebArena, which evaluates agents on web-based tasks. A pilot in the business process outsourcing domain for talent acquisition showed accuracy approaching specialized agents without the months of custom development typically required. Enterprise-grade compliance and auditability were reported, though not independently verified.
The most concrete example IBM points to is a Blue Pearl case study: 127 deprecated Java APIs to update. The original estimate called for 14 developers working nine months. Using an orchestrated agent system with proper governance and tool access, the migration completed in three days. The time difference is large enough to be worth noting, though it comes from IBM's own reporting and has not been independently audited.
The paper was submitted to arXiv on May 20 with 10 authors from IBM Haifa. It has not yet undergone peer review. CUGA is being presented at the Conference on AI Safety 2026.
The structural claim — that governance belongs in the agent, not around it — has resonance beyond IBM's own stack. The 88 percent pilot failure rate, if it generalizes, describes an industry that has been throwing models at production problems while ignoring the operational layer that determines whether those models actually run. The shift would move governance from an IT afterthought to a first-class engineering discipline, analogous to how DevOps changed security from a final checkpoint to a continuous part of the development cycle.
Whether CUGA itself is the answer is a separate question. It is one architecture, open source, demo-tested, not production-hardened at scale. The more durable point is the framing: when a technology graduates from capability to infrastructure, the interesting engineering questions stop being about what the system can do and start being about who is allowed to ask it to do things, and under what conditions. That is not a model problem. It is a construction problem. The question is whether the industry starts building it that way, or keeps discovering the gap only after the agents are already in production.