Aristarchus of Samos had heliocentrism right in the third century BC. He was wrong about the stars. Or rather, the ancient Athenians were wrong about Aristarchus: they rejected his model because it implied the stars should shift as the Earth moved, and nobody could measure any shift. The first successful measurement of stellar parallax came in 1838. That is a two-thousand-year verification loop, and it is the wrong half of the problem that Michael Nielsen thinks AI is currently accelerating.
Nielsen, a Research Fellow at the Astera Institute and co-author of "A Vision of Metascience," has spent years thinking about how science actually progresses. His argument, laid out in a recent conversation on the Dwarkesh Patel podcast, is straightforward and uncomfortable: the bottleneck in scientific discovery is almost never generating hypotheses. It is verifying them. And the tools for verification are not keeping up.
The implication for AI is pointed. "If you are attempting to reduce science to a process, you are attempting to reduce it to something where there is just a method which you can apply, and you turn the crank and out pops insight," Nielsen said. "You can do a certain amount of that, but you are going to get bottlenecked at the places where your existing method does not apply." The places where methods stop working are exactly where science needs human judgment most, and where current AI tools are weakest.
History is littered with examples. When Prout hypothesized in 1815 that all atomic nuclei are made of hydrogen, he ran into a problem: chlorine's atomic weight measured at 35.5, not a whole number. The answer required a concept that did not exist yet — isotopes — and that vocabulary gap blocked verification for decades. Michelson conducted his first ether-wind experiment in 1881 and continued running variations through the 1920s, dying in 1929 still convinced the ether existed. The muon experiments confirming time dilation came in 1940 and 1941, more than forty years after Einstein's original predictions.
AlphaFold looks like an AI victory. Nielsen thinks it is actually an infrastructure victory.
"AlphaFold really is not about AI," Nielsen said. "A massive fraction of the success there is the Protein Data Bank. It is basically the story of how we spent many decades obtaining protein structure just by going out and looking very hard at the world experimentally, and then we fitted a nice model at the end of it, which was a tiny fraction of the entire investment."
The Protein Data Bank, maintained by the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), curated roughly 200,000 experimental protein structures from 1958 to 2020. Those structures came from X-ray diffraction, nuclear magnetic resonance spectroscopy, and cryo-electron microscopy — decades of painstaking experimental work. DeepMind trained on that corpus. John Jumper, who led the AlphaFold project, has said publicly that the public data were essential.
The pattern repeats in materials science. Google's GNoME project, which predicts the stability of inorganic crystals, follows the same playbook: compute results from the model are fed back into its own training pipeline, expanding the dataset of known stable materials. The bottleneck was never the model architecture. It was the measurement and curation infrastructure that gave the model something to learn from.
What does this mean for labs pouring resources into AI-for-science? Nielsen's argument suggests they may be optimizing the wrong variable. Hypothesis generation is cheap. Verification is expensive and slow and often requires instrumentation that does not yet exist. The labs most likely to accelerate real scientific progress are the ones building measurement infrastructure, not the ones building larger models.
This is not an argument against AI in science. It is an argument for honesty about where the leverage actually sits. AlphaFold worked because someone had already spent sixty years documenting what proteins look like. The next AlphaFold will require the same upstream investment, probably in a domain where that groundwork has not happened yet.
The two-thousand-year gap between Aristarchus and stellar parallax was not a failure of imagination. It was a measurement problem. AI can generate hypotheses faster than any human. Whether those hypotheses get resolved in two years or two millennia depends entirely on what instruments exist to test them.