Training a robot to do something new in the real world usually hits a wall: someone has to tell it, step by step, whether it is doing well. That process is called reward engineering, and in academic robotics labs it routinely consumes six months of graduate student labor per task. The work is tedious, expensive, and deeply human. A new framework from Carnegie Mellon University's Robotics Institute and Amazon Robotics proposes to eliminate it entirely, using large language models to decompose tasks and vision-language models to track progress automatically. The results are striking, but the paper contains a finding its authors bury in the ablation studies: the choice of vision model is load-bearing in ways nobody fully understands.
The framework is called VLLR, for Value-function Initialization with LLM-driven task decomposition and self-certainty-based Reward. The team, led by Silong Yong, a PhD student at CMU's Robotics Institute, combines two signals. The first is extrinsic: an LLM breaks a long-horizon task into subtasks, then a vision-language model watches the robot's progress and feeds that signal into the value function that guides learning. The second is intrinsic: the robot's own policy generates a confidence score at each step, nudging it toward actions it executes reliably. No graduate student spends three months hand-coding reward functions.
On the CHORES benchmark, which tests robots on mobile manipulation and navigation tasks including picking objects from tables, navigating to rooms, and relocating items relative to other objects, VLLR achieved up to 56 percent absolute success rate gains over the pretrained policy alone. Against state-of-the-art RL finetuning methods on tasks the robot was trained on, it gained up to 5 percent. On tasks drawn from distributions the robot had never seen, it gained up to 10 percent. All without manual reward engineering.
Those numbers are the news. Here is the part that should make every robotics engineer nervous.
The team tested three vision-language models for the progress estimation job. Amazon's Nova Pro VLM correctly tracked task completion, predicting it late and increasing its confidence gradually as the robot approached the goal. Qwen2VL, the open-source alternative, called task completion early and then plateaued. A robot trained on its signal would quit too soon. CLIP, the computer vision model that anchors half the multimodal AI ecosystem, showed no correlation with task progress whatsoever. Different vision model, completely different learning signal. The choice of VLM is not a swap-in upgrade. It is architecture.
"We observe that CLIP tends to show no correlation with the task progress, Qwen tends to predict task completion early, whereas Nova Pro tends to predict task completion at a later time and its prediction is gradually increasing," the authors write. That sentence is the most important in the paper, and it is not the headline.
The FLaRe method, introduced in 2024 and presented at ICRA 2025, underpins the PPO fine-tuning approach. The team initialized the value function using VLM-derived progress signals for 200,000 steps, fewer than half a percent of the 50 million total training steps, and then ran the rest of fine-tuning without querying the vision model again. That design choice keeps inference costs manageable. It also means the entire framework rests on a brief window of vision-language model judgment, which the ablation studies confirm is the mechanism doing most of the work.
Katia Sycara, a research professor at CMU's Robotics Institute and Yong's advisor, co-authored the paper with Silong Yong and Amazon Robotics researchers Stephen Sheng, Carl Qi, Xiaojie Wang, Evan Sheehan, Anurag Shivaprasad, Yaqi Xie, and Yesh Dattatreya from the University of Texas at Austin. Yong was an intern at Amazon Robotics while working on the project. The paper was submitted to arXiv on March 31, 2026.
What VLLR actually demonstrates is that the robotics field has found a credible path around one of its oldest bottlenecks. Reward engineering has always been the tax on generalization: a robot trained to sort packages in one warehouse needs a new reward function to sort packages in a different warehouse, and that retraining has always required human expertise. Automating that step, if it holds at scale, shortens the iteration cycle for every warehouse automation company from months to weeks.
But the VLM sensitivity finding cuts the other way. The current generation of vision-language models was not designed for robotic progress tracking. They were designed to describe images and answer questions. Asking them to watch a robot move through a task and judge whether it is winning is a new job, and the paper shows that different models do the same job differently. They fail in entirely different ways. CLIP says the robot is making no progress when it is. Qwen says it is done when it is not. Nova Pro happens to be right. That is a fragile foundation for a general method.
The practical implication is that VLLR's gains are specific to the Nova Pro configuration the team chose, at least until the field develops VLMs explicitly designed for embodied progress estimation rather than adapted from general computer vision. The paper makes a convincing case that eliminating manual reward engineering is possible. It also reveals how far the field is from understanding why the alternative works.
The CHORES benchmark covers six task types: Fetch, Pick-Up, Object Navigation, Room Visit, Object Navigation Relative, and Object Navigation Affective. The team tested on a physical robot platform as well as simulation, which matters because sim-to-real transfer has derailed more than a few promising results. The paper does not claim to have closed that gap, only that the RL fine-tuning step is now faster and requires less human intervention.
VLLR is a preprint posted to arXiv on March 31, 2026. The authors have also released a project page with visualization videos of the VLM progress estimation process. The results are specific to the tasks, robot platforms, and VLM configurations the team chose. But the core insight is that foundation models can supervise robot learning without humans in the loop, provided you pick the right vision model and understand why it works. That is the kind of finding that gets replicated fast when other labs read it.