We are where self-driving was in 2015 — real potential, but years of groundwork ahead. That is Sergey Arkhangelskiy's assessment of the physical AI field, and he built the benchmark designed to prove him right or wrong.
PhAIL — Physical AI Leaderboard — launched March 31, 2026, from Positronic Robotics, a Cyprus-based robotics startup founded by Arkhangelskiy, a former team lead on Google's Search Ranking team. Positronic says on its website it is the first real-hardware test for AI-powered robots that measures what actually matters in a warehouse or factory: how many units per hour the machine completes, and how often it fails. Not simulation scores. Not task-success rates in a controlled lab. Units per hour. Mean time between failures. The numbers a logistics manager actually cares about.
The early results are not flattering to the robots. Early results from tests involving several AI models, including systems from Nvidia, Hugging Face, and other developers, suggest a gap remains between current AI-driven robotic performance and human operators, particularly in throughput and reliability, according to Robotics and Automation News.
Arkhangelskiy is not new to getting machines to work outside the lab. Before Positronic, he founded Wannaby, a company that built AR try-on technology for shoes. He studied at Lomonosov Moscow State University and is based in Cyprus. Wannaby was acquired by Perfect Corp in 2025, according to that company's announcement.
PhAIL is structured as a consortium rather than a proprietary platform, with cloud provider Nebius and data company Toloka as initial partners. The benchmark runs on a standardized DROID-style setup: a Franka robot arm with a Robotiq gripper. The initial task is bin-to-bin picking — a common logistics operation that sounds simple until you try to automate it at speed.
Five models appear on the leaderboard: OpenPI 0.5 from Physical Intelligence, GR00T and DreamZero from Nvidia, and SmolVLA from Hugging Face, alongside human and teleoperated baselines. The three metrics are UPH@95 (units per hour at 95 percent success threshold), overall success rate, and mean time between failures adjusted, known as MTBF/A.
The hardware requirements alone tell you something about the state of the field. Running OpenPI 0.5 needs a GPU with roughly 78 gigabytes of VRAM for training and 62 gigabytes for inference, according to Positronic's GitHub repository. GR00T requires CUDA, Nvidia's parallel computing platform, on Linux. SmolVLA, by contrast, runs on a consumer GPU — which is part of why it has attracted attention as a more deployable option for smaller operations without enterprise-scale compute.
Physical AI needs to prove itself there first, and PhAIL is how we measure whether it can, Arkhangelskiy said in a statement to Robotics and Automation News.
The gap between AI performance and human operators in these early results is not a surprising finding. Anyone who has walked a factory floor knows the difference between a robot demo and a robot deployment. But having a number attached to it changes the conversation. When you can compare the throughput metric directly against a human baseline, the gap becomes specific. It becomes a target.
What the benchmark does not yet measure is how a robot fails. When a robot makes a mistake in a bin-picking task, does it recover smoothly and continue? Or does it drop the object, misalign its grip, and require human intervention? Throughput and reliability are tracked. Graceful failure is not, and in a real warehouse that matters as much as raw speed.
One caveat: the actual PhAIL leaderboard data — the specific UPH@95, success rate, and MTBF/A numbers for each model — was not publicly accessible at time of publication. The PhAIL website displayed "Dataset is loading" when checked. The early results cited here are drawn from secondary reporting on the launch announcement, not the live leaderboard. The numbers will come.
What to watch: whether the consortium expands to include major logistics operators with real picking data and real incentives to benchmark AI vendors honestly. Amazon, DHL, FedEx — companies that run actual warehouse floors, not just demos. A benchmark is only as credible as the operators willing to run their own systems against it.