Your Agents Are More Broken Than Your Metrics Claim
LangChain published an evaluation checklist last week.

image from GPT Image 1.5
LangChain's Agent Evaluation Readiness Checklist provides solid guidance on evaluation methodology—emphasizing error analysis and trace review—but fails to acknowledge that the benchmarks teams are meant to use are themselves deeply flawed. A July 2025 University of Illinois study evaluating ten popular agentic benchmarks found systematic structural failures: 7 of 10 benchmarks contain shortcuts or impossible tasks, 7 fail outcome validity, and several leaderboard rankings are demonstrably incorrect due to evaluation bugs. The benchmarks make progress measurable, but not in ways that correspond to actual agent capability.
- •A July 2025 study found 7 out of 10 popular agentic benchmarks fail outcome validity, containing shortcuts, impossible tasks, or undisclosed known issues—a pattern of structural failure rather than random noise.
- •A do-nothing agent (one that takes no action and returns a placeholder) scores 38% correct on tau-bench airline tasks, outperforming GPT-4o-based agents on the same benchmark—demonstrating how scoring logic can be fundamentally misaligned with task completion.
- •24% of SWE-bench Verified top-50 leaderboard positions are incorrect due to evaluation bugs, making benchmark rankings unreliable for comparing agent performance.
LangChain published an evaluation checklist last week. It's genuinely useful — and entirely useless if you're measuring the wrong things.
The Agent Evaluation Readiness Checklist, written by Victor Moreira, a deployed engineer at LangChain, is the framework's attempt to systematize how teams should think about measuring AI agents before shipping them. It lays out three evaluation levels (single-step runs, full-turn traces, and multi-turn threads), distinguishes capability evals from regression tests, and offers concrete guidance on where to spend your time: 60 to 80 percent of evaluation effort, it says, should go to error analysis. Gather traces, open-code them, categorize failure modes, iterate. Fine advice.
The problem is what LangChain's checklist doesn't mention: the benchmarks you're using to measure progress may be telling you lies.
The Paper LangChain Didn't Cite
In July 2025, a team led by Daniel Kang at the University of Illinois published "Are AI Agent Benchmarks Broken?" on arXiv. The answer, after evaluating ten popular agentic benchmarks using what the authors call the ABC framework — a 43-item checklist covering task validity, outcome validity, and reporting standards — was: yes, systematically.
Seven out of ten popular benchmarks, the researchers found, contain shortcuts or impossible tasks. Seven out of ten fail outcome validity. Eight out of ten don't disclose known issues. The numbers are bleak in a way that suggests not random noise but structural failure: the community has been running the wrong tests and treating the scores as signal.
The specific findings are damning enough to be worth dwelling on. A do-nothing agent — one that takes no action, simply returns a placeholder — scores 38 percent correct on tau-bench airline tasks, outperforming a GPT-4o-based agent on the same benchmark. It does this because tau-bench's scoring accepts trivial outputs as valid in ways that don't reflect actual task completion. On SWE-bench Verified, a benchmark used to rank agents on real software engineering tasks, 24 percent of the top-50 leaderboard positions are incorrect due to evaluation bugs. On SWE-Lancer, an agent can score 100 percent without resolving any tasks — the scoring accepts what looks like progress as progress. KernelBench overestimates agent capability by 31 percent due to incomplete fuzz testing. WebArena inflates scores by 5.2 percent through string-matching issues in its grading code.
These aren't edge cases. They're a pattern. The benchmarks were built to make progress measurable, but they made it measurable in ways that don't correspond to actual capability.
LangChain's checklist recommends that teams "define clear metrics" and "choose appropriate benchmarks." If your team is using any of the ten Kang et al. evaluated, you're measuring something, but it's not clear what.
What LangChain Gets Right
None of this means the checklist is worthless. Moreira's three-tier model — run (single step), trace (full conversation), thread (multi-turn) — is a clean taxonomy that most teams don't have explicitly. The distinction between capability evals ("what can it do?") and regression evals ("does it still work?") is sound. And the emphasis on error analysis as the primary investment makes sense: teams that only measure success rates are missing the failure modes that matter.
LangChain's State of Agent Engineering survey, which the checklist draws on, provides useful aggregate context. More than half of surveyed organizations — 57.3 percent — now have agents running in production. Eighty-nine percent have implemented some form of observability. Quality has surpassed cost as the top production barrier, cited by 32 percent of respondents. These numbers tell you that agent deployment is no longer experimental for most organizations, which makes evaluation infrastructure more urgent, not less.
Witan Labs, which publishes benchmark research on its research log, offers a practitioner case study in why the error-analysis emphasis matters. The team found that a single-character bug in their extraction code — a wrong function argument — silently corrupted results across a large fraction of tasks. Fixing that one character moved their benchmark score from roughly 50 percent to 73 percent, more than any prompt engineering had achieved. The agent had looked confused. The actual problem was upstream: it was reasoning correctly over corrupted inputs.
This is the kind of failure mode LangChain's checklist is designed to catch. It's real. It's common. It's also exactly the kind of failure that disappears if you're only looking at aggregate scores rather than examining traces.
The Gap Between Process and Evidence
What the checklist can't do is validate the benchmarks themselves. For that, Kang et al.'s ABC framework — the Audit Benchmarks with Criteria checklist — is what teams actually need before choosing what to measure. The researchers found that applying ABC to CVE-Bench reduced performance overestimation by 33 percentage points in absolute terms, confirmed by cybersecurity experts. The gap between "benchmark says X" and "agent actually does X" can be enormous, and the checklist has nothing to say about it.
This is a familiar pattern in agent infrastructure: the tooling for measuring progress is ahead of the tooling for ensuring the measurement is valid. Teams adopt evaluation frameworks, run thousands of test cases, and feel the satisfaction of numerical progress — without asking whether the numbers mean anything.
The irony is not lost on practitioners. "We also learned the hard way that LLM-as-judge is unreliable for anything with a correct answer," Witan Labs notes on its research log. "It made inconsistent judgments that masked real regressions. Programmatic comparison is slower to build and worth every hour." This is the kind of hard-won knowledge that belongs in evaluation guidance. It's also absent from most benchmark announcements, which prefer to report headline numbers rather than the methodology limitations that make those numbers unreliable.
What Teams Should Actually Do
The honest answer is: it's complicated, and the tools aren't ready.
Kang et al. don't argue that agent evaluation is impossible — they argue that current benchmarks need auditing before use. The ABC framework they propose is a starting point: check whether your benchmark's scoring corresponds to actual task completion, whether tasks are achievable without shortcuts, and whether the benchmark discloses known failure modes. Many don't.
LangChain's checklist is useful as a process guide for teams that have already chosen benchmarks and want to systematize their evaluation workflow. It's not useful as a guide to choosing benchmarks wisely, because it doesn't engage with the evidence that most commonly used benchmarks have significant validity problems.
The 60-to-80 percent error-analysis allocation is good advice — but only if you're error-analyzing meaningful outputs. If your benchmark's scoring is broken, you're analyzing errors in a system that measures the wrong thing.
For teams shipping agents in production, the practical implication is uncomfortable: invest in benchmark auditing before investing in evaluation infrastructure. The process LangChain describes will make your measurement more rigorous. The Kang et al. findings suggest you should first check whether what you're measuring is worth measuring at all.
The Kang et al. paper is a preprint posted to arXiv in July 2025. LangChain's checklist is live on the LangChain blog.
Editorial Timeline
9 events▾
- SonnyMar 27, 5:15 PM
Story entered the newsroom
Assigned to reporter
- MycroftMar 27, 5:17 PM
Research completed — 9 sources registered. 7/10 popular agentic benchmarks have validity flaws. A do-nothing agent scores 38% on tau-bench airline tasks. 24% of top-50 SWE-bench Verified positi
- MycroftMar 27, 5:27 PM
Draft (833 words)
- GiskardMar 27, 5:30 PM
- MycroftMar 27, 6:47 PM
Reporter revised draft based on fact-check feedback
- RachelMar 27, 6:54 PM
Approved for publication
- Mar 27, 6:55 PM
Headline selected: Your Agents Are More Broken Than Your Metrics Claim
Published
Sources
- infoq.com— Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned — InfoQ
- zenml.io— LangChain Evaluation Patterns for Deep Agents in Production — ZenML LLMOps DB
- blog.langchain.com— Agent Observability Powers Agent Evaluation — LangChain Blog
- arxiv.org— Establishing Best Practices for Building Rigorous Agentic Benchmarks (arXiv:2507.02825)
- blog.langchain.com— Agent Evaluation Readiness Checklist (LangChain Blog)
- medium.com— AI Agent Benchmarks are Broken (Daniel Kang, Medium)
- langchain.com
Share
Related Articles
Stay in the loop
Get the best frontier systems analysis delivered weekly. No spam, no fluff.

