Peer review is supposed to be science's self-correction mechanism. A study posted to arXiv this week by economist Serafin Grundl, discussed on Marginal Revolution, suggests that correction may no longer be exclusively human. When AI agents and 146 economist teams were given the same dataset and tasked with the same causal inference problems, the AI submissions ranked first, second, and third; the humans came in fourth. No peer reviewer caught what the machine found.
The finding adds context to a capability Ethan Mollick, a professor at the University of Pennsylvania's Wharton School, described on X this week: give an AI the methods and data of a published paper, and it will reconstruct the findings from scratch, and when it disagrees with the original, the error is usually in the human's work, not the machine's. One example Mollick cited was a demonstration last fall on his blog, where a Claude model worked through a published MIT economics paper's data files, converted the replication code from one statistical language to another, and ran every calculation from scratch, producing a result that did not match one of the original findings. The error was in the published work, not the reproduction.
The consistency gap between AI and human results matters because the scientific literature has a documented reproducibility problem. The landmark Reproducibility Project in psychology found that between one third and one half of published studies could not be successfully replicated, and in experimental economics nearly half of celebrated results did not survive scrutiny when other researchers ran the same analysis. The issue is not fraud. It is that replication is painstaking and expensive. There are too many papers and not enough specialists willing to spend months checking someone else's work.
A $10 million pilot program proposed by the Institute for Progress would use AI agents to systematically reproduce published findings across computational fields: physics, economics, psychology, climate science — at the moment of publication. The goal is not to replace peer review but to add an automated verification layer that human replication never scaled to.
Whether AI verification scales to messier fields remains unclear. Clinical trials, field experiments with unpredictable human behavior, and research relying on proprietary data resist fully automated reproduction. The AI systems also inherit whatever biases exist in the data they are given. A verification engine built on flawed archives would reproduce the flaws. And the economics tasks AI agents handled well were well-specified and computationally contained, a different challenge from reproducing a clinical trial with recruitment and compliance complications.
The philosophical question underneath is harder than it looks. When an AI reproduces a paper from its data and methods alone and catches an error, did it understand what the researchers were trying to show, or did it just run the numbers more carefully than they did? The distinction matters for what AI verification actually proves. If the AI simply followed the methods more precisely, the error was a lapse in execution. If the AI second-guessed the analytical intent, the error was conceptual. Nobody has cleanly separated those two things yet.
The honest answer is that nobody knows how far this goes. What is clear is that the automated verification of scientific results — the painstaking check that was supposed to be science's self-correction mechanism — is no longer exclusively a human task. The machines are catching errors that the humans who made them never caught. What science does with that is the next question.