Three firms ran independent measurements of Claude Code output this week and reached the same conclusion by three different paths: the tool that hundreds of thousands of developers depend on had quietly degraded for weeks, and only external observers caught it.
Veracode, a coding security firm, ran Opus 4.7 against 80 tasks it had tracked for more than a year: 52 percent of completions introduced a vulnerability — up from 50 percent for Sonnet 4.5. OpenAI's models introduced vulnerabilities in roughly 30 percent of the same tasks. TrustedSec, a cybersecurity firm using Claude to generate penetration testing code, measured a 47.3 percent drop in code correctness, security issues, and task completion between early February and late April. Stella Laurenzo, a senior AI director at AMD, analyzed 6,852 Claude Code sessions and over 234,000 tool calls, the individual actions the model takes inside a coding session: Claude was choosing the simplest fix over the correct one. None of these measurements came from Anthropic.
That is the structural finding. AI companies measure their own progress. They do not measure their own failures. The industry blind spot these three audits exposed is not the three engineering failures Anthropic eventually traced through its postmortem — it is that harm accumulated for weeks without any internal detection. External researchers had to quantify what the company would not measure about itself.
The postmortem, published April 23, confirmed three separate failures. On March 4, Anthropic changed Claude Code's default reasoning effort from high to medium — trading quality for speed. The company reverted to the original setting April 7. On March 26, a caching optimization shipped with a bug that cleared reasoning history on every message instead of once after idle, making Claude forgetful and repetitive throughout a session. Anthropic fixed it April 10 in version 2.1.101. On April 16, a system prompt capped responses at 25 words between tool calls and 100 words for final output. Internal evaluations showed a 3 percent measurable drop in coding quality within four days. The company reverted the cap April 20. All three changes passed through code review, automated tests, end-to-end tests, automated verification, and dogfooding — Anthropic's own staff testing the product internally. No automated system detected the accumulating harm.
Kennedy at TrustedSec is building on-premise AI infrastructure so his team can run models they control rather than depend on an external API. "Who can we really trust here?" he asked. He is not alone in asking.
Anthropic has 300,000 enterprise customers, 20 million monthly active users by February, and is valued at $380 billion with annualized recurring revenue at $30 billion — more than triple last year's figure. Its response has been to acknowledge the failures publicly, reset usage limits for all subscribers on April 23, and announce process changes: stricter internal dogfooding, broader evaluation suites for prompt changes, and tighter controls on system-level modifications.
What to watch next: whether those safeguards actually prevent the next incident, and whether enterprise customers who paid for compute and got degraded output will seek compensation or alternatives.
Sources: Anthropic | Forbes / Thomas Brewster | VentureBeat | Fortune