The robot sees a cup, predicts a video of itself picking it up, and generates a beautifully plausible motion — that stops two centimeters short. The video looked correct. The task failed.
This is the core frustration driving a wave of research into video action models, the class of systems that learn visual dynamics from large-scale video data and try to transfer that knowledge to real robot control. And it's the problem a team from Zhejiang University, Westlake University, the Beijing Academy of Artificial Intelligence, and eight other institutions is trying to solve with a new post-training framework called VAMPO, according to a paper posted to arXiv this month.
The issue is a fundamental mismatch between how diffusion-based video predictors are trained and what manipulation actually requires. Current models are trained with likelihood-surrogate objectives — they get rewarded for making videos that look realistic, globally coherent, visually pleasing. But manipulation is a contact sport. Success depends on precise object pose, correct spatial relationships, exact timing of contact events. The things that make a video look good are not the same as the things that make a motion work.
"This objective mismatch often leads to subtle errors in object pose, spatial relations, and contact timing that can be amplified by downstream policies," the authors write. The robot in the video does a perfect impression of picking up a cup. The actual robot misses.
VAMPO's answer is to add a policy optimization step after the standard video model training. The researchers reframe multi-step denoising — the process by which a diffusion model refines a noisy image into a clean prediction — as a sequential decision problem. Instead of just asking "does this video look real?" they reward the model for getting the visual dynamics right in the specific sense that matters for the task at hand.
To make this tractable, they introduce an Euler Hybrid sampler. Standard diffusion denoising injects randomness at every step, which makes the process theoretically clean but creates variance problems when you're trying to optimize with policy gradients. VAMPO collapses all that stochasticity into the first step and runs the rest of the trajectory deterministically. The result is a low-variance signal that still captures what matters about the denoising process.
The whole thing is combined with GRPO — a group relative policy gradient approach — and what the researchers call a "verifiable non-adversarial reward." That last part is important: they're not training against a discriminator or running adversarial games. They're checking whether the visual dynamics in the generated trajectories match what an expert would produce, measured in latent space rather than pixel space.
Tested across CALVIN, L-CALVIN, and real-world manipulation tasks, VAMPO improved task-relevant visual dynamics and downstream action generation compared to base models without architectural changes. The gains held in both simulation and on physical hardware.
The broader context here is the crowded VLA landscape. Google's RT series, NVIDIA's GR00T, Physical Intelligence's π0, Stanford's OpenVLA — all are trying to build general-purpose robot brains that can take in video and language and output motor commands. Video world models are increasingly central to these efforts because they let robots imagine what will happen before they act. But the field has been wrestling with the gap between "looks right in the simulation" and "works in the real world."
VAMPO is a post-training fix rather than a new architecture. That makes it potentially portable — the researchers didn't modify the underlying model structure, just added an optimization layer on top. Whether it scales to the full complexity of real-world manipulation tasks, with unstructured environments and long-horizon goals, remains an open question. The paper shows results on defined benchmarks; it doesn't yet demonstrate handling the messiness of a real warehouse or home.
But the core insight — that likelihood-trained video models optimize for the wrong thing, and that policy optimization can correct course — is likely to show up in more systems. The robots in the videos are getting prettier. The question is whether the arms attached to them will finally learn to reach.
The VAMPO project homepage is at the VAMPO Project Homepage.