DeepMind's Yao: The Scaling Wall Might Not Exist
DeepMind's Yao: The Scaling Wall Might Not Exist
A senior DeepMind researcher is arguing that the most consequential consensus in AI — that pre-training scaling has hit a ceiling — may be built on faulty experiments, not physics. On a podcast released 13 days ago, Yao Shunyu told host Zhang Xiaojun that progress in pre-training is far from stopping and will continue for at least the next four months. "Many claims of hitting a wall stem from bugs in experimental design," he said.
Six months earlier, Yao had left Anthropic after less than a year, having worked on Claude 3.7 Sonnet. His stated reason, documented on his personal blog: he disagreed with public anti-China statements by Anthropic's leadership, particularly a characterization of China as an adversarial nation. He estimates that disagreement motivated roughly 40 percent of his decision to leave. He joined Google DeepMind on September 29 as a senior research scientist on the Gemini team.
The claim is significant because the "scaling wall" thesis has become foundational to how venture capitalists, researchers, and competitors evaluate which labs are positioned to win the next phase of AI development. If Yao is right, the assumption underlying billions of dollars of investment decisions rests on methodological problems, not physical limits.
Yao is not a marginal figure making a contrarian bet for attention. He was a contributor to Claude 3.7 Sonnet at Anthropic before his departure. He now works on the Gemini team at DeepMind, whose own results — Gemini 3 Deep Think — Yao is partially arguing for. His incentives and his argument point in the same direction. The experimental design argument deserves scrutiny on its own terms.
Yao's specific claim is that researchers studying scaling laws have been drawing conclusions from experiments that were themselves methodologically flawed. He argues that the apparent plateau in pre-training performance reported by several labs reflects how the tests were structured, not any physical limit in the models or the data used to train them. He has not published a peer-reviewed paper making this case. The claim lives in a podcast interview and a personal blog post. Google DeepMind has not published the internal review Yao appears to be referencing. The argument is currently unfalsifiable on available public evidence — it requires either access to the internal review Yao cites or replication by an independent team with comparable compute.
The public benchmark evidence for Gemini 3 Deep Think is genuine. According to 36Kr, the model achieved a 3455 Elo score on Codeforces, roughly equivalent to an 8th-place world ranking. It scored 84.6 percent on ARC-AGI-2 and 48.4 percent on HLE. Google and DeepMind both published the results. Third parties have not independently verified the full run.
Coding and the shape of capability
Yao's other substantive claim is about where AI capability is advancing fastest. "Coding has become the fastest-advancing area for model capabilities precisely because it perfectly satisfies the two core conditions for post-training: clear feedback signal and strong data foundation," he said.
The argument is that code is uniquely suited to AI improvement because GitHub provides an enormous, structured dataset where correct and incorrect outputs can be compared automatically — a feedback signal that doesn't require human labeling at scale. This is an empirical claim that can be checked against published benchmark trends over time, though doing so requires a longer longitudinal analysis than this piece can execute in a single news cycle.
Yao also had a sharper observation about what the AI era rewards: "AI, as a field, never really required much brainpower to begin with. The most important traits in this industry are simply being reliable, being meticulous, and taking responsibility for what you do."
The China gap and the distillation question
Yao's comments on US-China AI competition are more measured than his colleagues at some labs. He says the gap is narrowing, that China has been forced into distillation techniques — extracting capabilities from larger models into smaller, more deployable ones — and draws a distinction between soft distillation, which he calls technically interesting, and hard distillation, which he describes as commercially unethical.
He separately credits ByteDance's Doubao model with having "probably the best voice generation in the world" and says US consumer AI product teams are "far behind China" in that domain. These are observational claims, not benchmark-backed ones.
What this story is and isn't
The freshest fact in this piece is the 13-day-old podcast in which Yao made the scaling-wall argument. His departure from Anthropic and the reasons for it, documented on his blog, are several months old — context, not news. The individual claims have been reported elsewhere. What this piece attempts to do is connect them into a coherent argument and subject that argument to the scrutiny the claims themselves demand.
The kill-if-false anchors: if Gemini 3 Deep Think's benchmark numbers cannot be independently verified, or if Yao's specific claims about experimental design bugs in pre-training studies cannot be tied to real prior work, the story becomes unfalsifiable gossip. The skeptical framing throughout is not editorial pessimism — it is the appropriate epistemic posture toward claims that currently lack public peer-reviewed support.
What to watch: whether any independent research team publishes a replication study of the scaling-law experimental design question Yao raised. If the claim survives that kind of scrutiny, it reprices a significant portion of the current VC and lab investment thesis. If it doesn't, it becomes a case study in how motivated reasoning can travel under the cover of contrarian credibility.