AlphaGo Wasn't About Go. It Was About Method — type0 | type0

AlphaGo Wasn't About Go. It Was About Method — type0 | type0

When Demis Hassabis walked into Google's London office in 2014, he was not sure he would get the money. DeepMind, the AI lab he co-founded two years earlier, had assembled something unusual — a team fluent in both the neuroscience of the brain and the mathematics of reinforcement learning — but the vision was large and the bet was expensive. Hassabis wanted to build a system that could master any task, starting with Go, a game so complex that the best computer programs of the day played at the level of a competent amateur. DeepMind's own ten-year retrospective covers this period.

Ten years ago this month, AlphaGo defeated Lee Sedol, a world-champion Go player, in a five-game match in Seoul, South Korea. The final score was 4-1. More than 200 million people followed the match online. The result arrived a decade before most experts expected. Game two was decided by move 37 — a play so unconventional that professional commentators assumed it was an error. It proved to be decisive. Move 37 is documented in DeepMind's retrospective and was described by Hassabis himself as the moment that changed how the field thought about what was possible.

But the most important thing about AlphaGo was not the victory. It was what the victory proved about method.

The dominant approach to AI in the mid-2010s was still largely built on handcrafted features and expert knowledge. Deep Blue had beaten Kasparov in 1997 by searching through millions of positions per second, guided by chess-specific programming. The field had spent two decades building on that paradigm. AlphaGo did not win by refining search — it won by learning. The system used two neural networks: one, the policy network, learned to propose moves by training on millions of human games; the other, the value network, learned to evaluate board positions. Neither network knew anything about Go beyond the data it had seen. A search algorithm — Monte Carlo Tree Search — then allowed the system to simulate thousands of possible futures from any position and weigh them by probability of winning.

The combination was not obviously correct. It required GPU clusters that were expensive to run, large quantities of training data, and a research organization willing to bet that learning would outperform expert engineering. The original AlphaGo ran on approximately 1,920 CPU cores and 280 GPUs. Training a competitive version took weeks. The infrastructure requirements are documented in the original AlphaGo Nature paper. Google backed the bet because it had the hardware. The field watched and drew its own conclusions about which institutions could afford to run the experiment again.

Pushmeet Kohli, vice president of science and strategic initiatives at Google DeepMind, told The Atlantic that the experience of watching AlphaGo work transformed how the lab thought about what was possible. "Conceptual things emerged from the whole AlphaGo experience which essentially entered the AI vocabulary," he said. Kohli's quote is from Matteo Wong's Atlantic piece on the anniversary. One of those concepts was the division between a proposal network and a judgment network — not a Go-specific trick, but a general architectural principle applicable to any domain where a system needed to generate options and evaluate them. This is the underappreciated insight, and it is what the Atlantic's retrospective mostly missed: not that AlphaGo existed, but that it established a structure the field is still building on.

The structure is the generator-verifier gap, and Noam Brown at OpenAI has traced it directly. Brown posted on X that the recipe behind today's frontier reasoning models is "surprisingly similar" to AlphaGo: first, imitate human data at scale; then scale inference compute to reason better — in AlphaGo's case through Monte Carlo Tree Search, in today's models through chain-of-thought prompting; and finally, use reinforcement learning to go beyond imitation. Brown elaborated on this pipeline in a Latent Space interview, drawing a direct line from AlphaGo and AlphaZero to the reasoning models released by frontier labs in the past two years. The three-step pipeline did not appear out of nowhere. It was worked out in the most legible, highest-stakes proving ground the field had ever constructed.

AlphaGo also showed that the method scaled at cost. The original system ran on infrastructure that only a company like Google could provide. Training competitive versions required weeks on that hardware. The lesson was not subtle: the approach worked, but finding out whether it worked was expensive. This shaped which organizations could pursue it and which had to wait.

The results downstream of that bet have been peculiar in the best sense. AlphaFold — the protein-folding system that earned DeepMind co-founder Demis Hassabis and research scientist John Jumper the 2024 Nobel Prize in Chemistry — is the clearest measure of what that bet produced at scale. AlphaProof, combining language models with AlphaZero-style self-play reinforcement learning, achieved silver-medal performance at the International Mathematical Olympiad. An advanced version of Gemini Deep Think reached gold-medal Olympiad performance using an approach inspired by AlphaGo. OpenAI's o1 model, trained with large-scale reinforcement learning to reason using chain of thought, is a direct descendant of the same pipeline. The structure that worked on Go turned out to be more general than anyone had guaranteed.

What AlphaGo could not teach, however, was why it worked. Move 37 illustrated this vividly. Professional players studied the game afterward and concluded that AlphaGo had seen a pattern invisible to human intuition — a global shape that became meaningful only many moves later. The system could not explain the move. Its value network returned a number; the network did not return a theory. This gap between performance and understanding has persisted through every generation of frontier model. AlphaGo introduced the field to the problem. It did not solve it.

Ten years later, the architecture that AlphaGo instantiated — learn from humans, scale inference, RL self-play — is the operating system of modern AI. Reasoning models released by OpenAI, Anthropic, and Google in the past eighteen months echo this lineage, according to Noam Brown, who has drawn that connection explicitly. The cost of the bet has not gotten smaller. The compute bills for training frontier models run into hundreds of millions of dollars. The organizations that can make that bet have not proliferated in proportion to the ambitions.

The generator-verifier structure — the thing that made AlphaGo AlphaGo — is now the thing that makes o1 work. That is the part the anniversary coverage keeps glancing past. The lineage is not just historical interest. It is a technical explanation of why reasoning models behave the way they do: they generate possibilities and then check them, scaled up to the point where the checking is itself a learned behavior. Whether that principle continues to generalize — whether there are many more Move 37s waiting in the dark, or whether test-time compute is running into diminishing returns — is the open question the field is working through right now.