There is a moment in every robotics demo where the machine does something unexpected. Usually that something is failure — a gripper that slips, a sensor that hallucinates, a humanoid that faceplants on stairs. But last month, researchers at Physical Intelligence experienced a different kind of surprise. A robot they had trained to fold laundry and make espresso did something they had not explicitly taught it to do: it cooked a sweet potato in an air fryer.
The air fryer was a Cuisinart model bought off the shelf. It had never appeared in any commercial robotics dataset, any open-source robotics benchmark, or any academic paper. When the researchers asked the robot to load a sweet potato into it, the machine had only two relevant fragments in its entire training history — one of a different robot closing an air fryer basket, another from an open-source dataset where a different machine placed a plastic bottle inside one. From those fragments, and from the web-scale visual knowledge the model had accumulated during pretraining, π0.7 — Physical Intelligence's latest generalist robot brain — synthesized a functional understanding of how the appliance worked.
The researchers call what the robot demonstrated "compositional generalization" — the ability to take skills learned in one context and recombine them to solve problems the model has never encountered. Until recently, the standard approach to robot training was task-by-task memorization: collect data on a specific job, train a specialized model on that data, then start over for the next job. π0.7, per the company's blog post, breaks that pattern. It is a single model that matches the performance of specialist systems trained individually for each task — making coffee, folding laundry, assembling boxes — without being fine-tuned for any of them.
What makes this notable is not any single demo. It is the degree to which the results surprised the people who built the model.
"My experience has always been that when I deeply know what's in the data, I can kind of just guess what the model will be able to do," says Ashwin Balakrishna, a research scientist at Physical Intelligence and a Stanford computer science PhD student. "I'm rarely surprised. But the last few months have been the first time where I'm genuinely surprised. I just bought a gear set randomly and asked the robot, 'Hey, can you rotate this gear?' And it just worked."
The air fryer required a caveat. Operated with a single high-level command — "load a sweet potato into the air fryer" — the model made a passable attempt but did not finish. With step-by-step verbal coaching from a human, walking it through each sub-task the way you might explain something to a new employee, it performed successfully. Early experiments ran a 5 percent success rate. After roughly 30 minutes of refining how the researchers explained the task to the model, adjusting prompts and breaking down steps, the success rate jumped to 95 percent. The bottleneck, the researchers found, was not the robot's capability. It was the human's ability to translate intent into language the model could act on.
"The capabilities are going up more than linearly with the amount of data," says Sergey Levine, a co-founder of Physical Intelligence and a UC Berkeley professor whose research helped define how foundation models change what robots can learn. Foundation models are the class of AI system that takes a camera feed and decides how the robot should move. "That much more favorable scaling property is something we've seen in other domains, like language and vision."
That dependency on human coaching is the most honest reframing of what π0.7 represents. It is not an autonomous breakthrough — a robot that can be pointed at a new task and left to figure it out. It is a new kind of collaboration layer between human expertise and machine capability. The bottleneck in robotics, which has always been framed as "can the robot learn to do this task," is quietly becoming something different: who is available to teach it.
The paper describes the behavior in careful terms — "early signs" of generalization, "initial demonstrations" of new capabilities. Physical Intelligence has been restrained about commercial timelines throughout its two-year existence, declining to speculate on when a system built on these findings might be deployed in a real environment. The company has raised over $1 billion to date, was most recently valued at $5.6 billion, and is currently in discussions for a new funding round that would value it at roughly $11 billion, according to Bloomberg via TechCrunch.
π0.7 also showed what researchers call cross-embodiment transfer: the ability to apply knowledge across different robot bodies. When they tasked the model with controlling a bimanual UR5e industrial robot to fold laundry, despite having no training data for that specific robot performing that specific task, it succeeded. The physical motion of folding a t-shirt on a large industrial arm differs significantly from the smaller robot the data was collected on. The success rate matched what expert human teleoperators — workers with an average of 375 hours of direct robot control experience — achieved when attempting the same cross-embodiment transfer for the first time.
Levine draws a parallel to GPT-2, the language model that surprised the AI community in 2019 by generating a story about unicorns in the Andes — a combination no one had explicitly taught it. "Where the heck did it learn about unicorns in Peru?" he says. "That's such a weird combination. And I think that seeing that in robotics is really special."
Critics will note an asymmetry that Levine leaves unaddressed: language models had the entire internet to learn from. Robots do not. No amount of clever prompting fully closes that gap. Standardized benchmarks for robotics do not really exist, which makes external validation of these claims difficult. Physical Intelligence measured π0.7 against its own previous specialist models — not against any independent standard.
"The criticism that can always be leveled at any robotic generalization demo is that the tasks are kind of boring," Levine says. "The robot is not doing a backflip." He argues that the distinction between an impressive demo and a system that actually generalizes is precisely the point. Generalization, he suggests, will always look less dramatic than a choreographed stunt — but it is considerably more useful.
The useful question may no longer be whether a generalist robot brain can generalize. It may be who gets to teach it.