The most striking number in the ARC-AGI-3 results is not the failure rate of frontier AI. It is 12.58 percent.
That is the score of a baseline system built from off-the-shelf reinforcement learning and graph search, with no transformer architecture and no vast training run. It surpasses every commercial large language model tested on the benchmark by more than 30 times, according to results published by ARC Prize. The finding cuts against a core assumption of the AI industry's scaling era: that raw model size and reinforcement learning from human feedback would eventually close the gap on human-like fluid reasoning.
ARC Prize was founded in 2023 by François Chollet, creator of the Keras deep-learning framework and a former Google engineer, and Mike Knoop, a co-founder of Zapier, to pose a specific question: why do current AI systems fail so badly on tasks that humans find trivial? The answer, the organization argues, is that frontier models have memorized patterns from training data rather than learned to acquire new skills. Chollet calls this the engineer-vs.-oracle problem. ARC-AGI-3, the third iteration of the benchmark launched March 25 at Y Combinator's headquarters in San Francisco, is the latest attempt to measure that gap quantitatively. Results are published on ARC Prize's blog.
Humans who attempt ARC-AGI-3 solve 100 percent of the environments. Every puzzle in the set was independently verified solvable by at least two members of the general public, with a median solve time of 7.4 minutes per session. The frontier models did not come close. Google's Gemini 3.1 Pro Preview scored 0.37 percent. OpenAI's GPT-5.4 (High) scored 0.26 percent. Anthropic's Opus 4.6 (Max) scored 0.25 percent. XAI's Grok-4.20 (Beta, Reasoning) scored 0.00 percent. OfficeChai compiled the full model rankings.
The RL and graph-search baseline scored 12.58 percent in the preview phase, outperforming every one of those models. The preview environments are easier than the full private evaluation, so the gap will likely narrow. But the ordering is already informative: classical search beats transformer scaling on a benchmark designed to isolate fluid reasoning.
The technical distinction matters. Large language models generate outputs by predicting the most statistically likely next token given their training distribution. A graph-search system treats each ARC task as an explicit search problem, enumerating possible transformations within a defined state space and using learned heuristics to find a valid solution. The architectures share almost nothing. The fact that the simpler approach outperforms on a benchmark designed to measure fluid reasoning suggests the scaling paradigm has not yet captured something the task requires.
The benchmark itself has defenses against gaming. Each ARC-AGI-3 environment uses only core perceptual priors: objectness, geometry, physics. The benchmark deliberately excludes language, cultural symbols, or domain-specific knowledge. Every puzzle is a 64-by-64 color grid. The challenge is not recognizing patterns in text; it is inferring the underlying generative rule from examples and applying it to a new configuration. The full technical description is in the arXiv paper.
Scores use RHAE (Relative Human Action Efficiency), which squares the efficiency ratio: a task solved in the same number of steps as a human scores 1.0, while a model requiring many more steps scores closer to zero. Frontier models scoring below 0.4 percent are generating invalid transformations before giving up. OfficeChai explains the RHAE scoring formula.
The benchmark's designers acknowledge that scores on individual public environments do not generalize. A well-engineered evaluation harness can push Opus 4.6 to 97.1 percent on one public environment and 0 percent on another. The arXiv paper documents this harness instability. ARC-AGI-3's 135 environments, 110 of them private, are designed to make such tuning impractical.
The competition structure reflects the difficulty. The total prize pool across ARC Prize 2026's two tracks is $2 million. The ARC-AGI-3 track alone carries $850,000, with a grand prize of $700,000 that unlocks only if a team achieves perfect 100 percent accuracy, split among the top five finishers if multiple teams clear that bar. The entry deadline is October 26, 2026; the final submission deadline is November 2, 2026. Winners are announced December 4. The Kaggle competition page has the full rules. As of 11 days after the launch event, 352 teams with 2,495 entrants had registered.
The simple baseline's preview-phase score does not mean ARC-AGI-3 is solved or that classical search will dominate the full evaluation. What it suggests is that the assumption driving several years of scaling investment (that sufficiently large models would eventually acquire fluid reasoning as an emergent property) has not yet paid off on this benchmark. The code is available. The prize money is real. Whether the frontier labs close the gap before November is the open question.