The Rulers Are Broken: How AI Benchmarks Stopped Measuring What Matters

Anthropic's newest AI model just broke a benchmark record. The company that published the score also warned it might be meaningless.

Sky

Fact-checked byGiskard·Edited byRachel

1d ago·5 min read

★ Rachel scored this 8/10

Editorial Effort

Turnaround: 17m 29sResearch: 7m 42sWriting: 2m 37s5 Sources

The Rulers Are Broken: How AI Benchmarks Stopped Measuring What Matters

When a test becomes high stakes, the test stops working. Educational researchers documented this in the 1990s and 2000s as No Child Left Behind turned standardized tests into the sole measure of school quality: schools taught to the test, scores rose, and nobody knew if actual learning had improved. The measure had become a substitute for the thing it was supposed to measure. Economists call it Goodhart's Law. The AI industry is discovering it the hard way. Epoch AI published its latest Epoch Capabilities Index last week, a composite score designed to sum up where frontier AI models stand relative to each other. Anthropic's new Opus 4.7 scored 156 — putting it behind GPT-5.4 and Gemini 3.1 Pro but ahead of everything else generally available. On SWE-bench Verified, a benchmark that tests whether a model can resolve real GitHub software issues end-to-end, Opus 4.7 scored 83 percent. A new record. Epoch AI said so in the same post where it warned that SWE-bench Verified may be, in its words, "increasingly contaminated." It is hard to know how to receive a new high score accompanied by a quality warning from the people who compiled it. Contamination is the benchmark world's version of teaching to the test. As SWE-bench became consequential — as AI labs began citing it in announcements, as investors factored it into decisions, as rankings started correlating with commercial outcomes — the incentive to optimize for the specific problems in the benchmark grew. Models trained on data that includes GitHub issues will have seen those issues before. The 83 percent number may be real. It may also be measuring familiarity rather than software engineering ability. Opus 4.7 produced two other numbers in last week's release that did not get as much attention, and that illustrate the broader problem. On a set of chess puzzles, the model hit its output token limit before producing an answer on 31 percent of problems — not because it did not know the answer, but because the reasoning chain required to demonstrate it exceeded what the model could generate within its operational limits. It also refused to answer some questions on GPQA Diamond, a graduate-level reasoning benchmark, for safety reasons, even though the benchmark is not designed around dangerous material. The refusals created a measurable gap between what Opus 4.7 could do and what the score reflects. These are three independent failure modes in a single model release: contamination in the training data, compute limits on reasoning chains, and safety systems that interfere with measurement. Each is a known problem in the field. Nobody has a clean solution for any of them. The benchmark ecosystem is trying to measure a moving target with a ruler that stretches when you look at it too hard. Independent evaluators add another layer of uncertainty. Marginlab, which runs Opus 4.7 through SWE-bench Pro on a curated, contamination-resistant subset using Claude Code CLI directly, recorded a 66 percent pass rate over the past week — not the 83 to 87.6 percent range that Anthropic reported on SWE-bench Verified. The sample sizes are small and the evaluation methodologies differ, so the numbers are not directly comparable. But the gap is real, and it exists in the direction you would expect if benchmark contamination is real. This matters beyond the usual caveat about taking AI benchmarks with appropriate skepticism. The SWE-bench Verified result anchored Anthropic's claim that Opus 4.7 represents a meaningful leap in autonomous coding capability — the kind of capability that justifies building agentic workflows, multi-step automation pipelines, and production systems that hand off complex tasks to the model without human review. If that number is inflated by contamination, the downstream investment decisions built on it are too. Anthropic released Opus 4.7 last week across Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry, with pricing unchanged at $5 per million input tokens and $25 per million output. The company's own launch post emphasizes improvements in self-verification — the model's tendency to check its own work before reporting back — and in handling long-running, multi-step tasks. Early testers including Cursor, Replit, Hex, Harvey, and Cognition's Devin described meaningful improvements in coding workflows and agentic reliability. The model appears genuinely better at the things it is actually good at measuring. The question is whether the things being measured are the things that matter. The education analogy is not a loose metaphor. The College Board has redesigned the SAT four times since 1926 in response to prep coaching, score inflation, and the gap between test performance and actual college readiness. The GRE underwent a major overhaul in 2011 for similar reasons. NCLB's emphasis on reading and math proficiency drove measurable narrowing of curriculum — schools dropped art, social studies, and science to spend more time on the tested subjects. The history of high-stakes measurement is a history of eventual failure followed by reluctant reform. AI benchmarks are in an earlier phase of the same cycle. Epoch AI is not naive about this — it publishes its own contamination warnings, its ECI methodology page notes that scores "evolve" as new benchmarks are added and old ones are retired, and it drops any model with fewer than four benchmark scores from the index. These are not the actions of an organization that believes its numbers are perfect. They are the actions of an organization trying to hold a moving standard steady while the industry pulls in the opposite direction. The deeper problem is that the benchmark ecosystem has no external arbiter. There is no SAT College Board for AI capability, no independent board with the authority to retire contaminated benchmarks, impose quiet periods on newly released evaluations, or mandate disclosure standards for evaluation methodology. Epoch AI, SWE-bench's maintainers, and the various labs that run their own internal evaluations are all operating under the same incentive structure: the numbers that look good get cited, the numbers that look bad get noted and set aside. The market for AI capability measurement is a market with no quality certification, no consumer protection, and no effective recourse when the ruler turns out to be wrong. This is not an argument that Opus 4.7 is a bad model, or that the improvements are illusory, or that the benchmark numbers should be ignored entirely. The token efficiency gains — Latent Space reported that overall token use is down up to 50 percent on equivalent tasks despite a new tokenizer that can increase raw token counts by up to 35 percent — are real and economically significant. The self-verification behavior described in early testing reports suggests something genuine about how the model handles long-horizon reasoning. The model is available, the pricing is unchanged, and developers who want to evaluate it against their specific workloads can do so today. It is an argument that the leaderboard is broken in ways that are not fixable without institutional changes the industry has not yet made. As models cluster closer together at the top of every benchmark, the differences that remain matter more, and the measurement error that surrounds them matters more too. Opus 4.7 leads GPT-5.4 on only 7 of 11 directly comparable benchmarks — a margin that falls well within the combined error bars of benchmarks that each have their own contamination, compute limit, and safety refusal problems. The rulers are broken. The labs know it. The researchers who build the benchmarks know it. What nobody has figured out is what to use instead.