What if a robot could watch a video of someone doing a task, imagine how to do it itself, and then just do it? That is the idea behind Large Video Planner, a new robot foundation model from researchers at Harvard's Kempner Institute, MIT, and UC Berkeley. Instead of learning tasks from language instructions and robot-specific motion data — the standard vision-language-action approach — LVP learns from internet video and generates a visual plan before acting. The robot's hardware is the prop. The brain is the story.
The core argument is deceptively simple: language contains little direct information about how the physical world behaves. A sentence like "place the bottle on the paper" tells you almost nothing about the angles, forces, and timing involved. But a video of someone doing that task encodes all of it — the reach, the grip, the release, the way fabric shifts under weight. Video is dense with physical and semantic information that text simply cannot match.
LVP works in two stages. First, given a starting image and a text instruction like "pull a piece of tape," the model generates a short video showing the task completed — a visual imagination of the action. Second, that video is converted into robot actions using a hand motion reconstruction model and a retargeting system. The robot sees what it should do before it does it. According to Yilun Du, a Kempner Institute investigator and assistant professor of Computer Science at Harvard SEAS: "Language contains little direct information about how the physical world behaves. Our idea is to train a model on a large amount of internet video data, which contains rich physical and semantic information about tasks."
On third-party selected novel tasks evaluated in real-world settings, LVP achieved 59.3 percent success, compared to 39.3 percent for the next-best prior approach — roughly a 20-point jump that suggests the system is not simply memorizing demonstrated sequences but learning something more transferable.
The result lands in the context of a long argument about what robot foundation models should be built on. The dominant approach extends large multimodal language models with action outputs, betting that scale and language understanding will transfer to physical manipulation. LVP is a different bet: that the missing ingredient is not more language reasoning but better visual priors. "The video generation capabilities learned from internet data help transfer knowledge to the robot foundation model," Du said. "The idea is to synthesize videos showing how a robot should act in a new task."
The paper's author list reads like a rough draft of a modern AI hall of fame. Yilun Du at Harvard is corresponding author. Pieter Abbeel at UC Berkeley — whose name appears on some of the most cited robot learning papers of the past decade — is a co-author. So is Russ Tedrake, a vice president of robotics research at MIT and Harvard, and Vincent Sitzmann, an associate professor at MIT. William Freeman and Jitendra Malik, both longtime pillars of computer vision at MIT and Berkeley, round out the author list.
Here is what the paper is not: a product. Those 59.3 and 39.3 percent numbers come from tasks selected by third parties in real environments — a rigorous test of generalization — but a curated benchmark is not the same as a deployed system that has to work on kitchen counters, warehouse floors, and hospital rooms where nothing goes according to plan. The authors are honest about the gap. The benchmark tells you the direction is real. It does not tell you the robot is ready.
The more enduring contribution may be conceptual. This is an argument that the path to generalizable robots runs through video, not language — that if you want a machine to understand how the world changes when someone acts on it, you should show it millions of examples of the world changing, not ask it to read about physics. Whether that argument holds as data scales and evaluation hardens is the question the rest of the field will spend the next few years answering. Du and his collaborators have made the case that the direction is worth taking seriously. Now the hard part: proving it holds.