When Microsoft Research tested GPT-4o inside its new embodied AI benchmark, the model confused a flame with its reflection — then continued as if the scene matched what it expected to see, not what it actually saw. The mug it was supposed to retrieve was already sitting in front of it. A toddler would not make this error. The $20-per-month chatbot did.
The episode appears in a paper Microsoft Research posted to arXiv on March 18, 2026, introducing AsgardBench — a new benchmark designed to isolate one specific capability: whether an AI agent can look at a scene, understand what it actually sees, and revise its plan mid-task. Built on AI2-THOR, an open-source home simulation environment from the Allen Institute for AI, the benchmark contains 108 task instances across 12 task types and three scene types. Agents are positioned at interaction-ready locations, removing navigation as a confound and forcing the evaluation directly onto visual understanding and planning.
The results are a sharp reminder that vision and language remain unevenly paired in embodied AI systems. Most models more than doubled their success rates when given images versus text-only scene descriptions, according to Microsoft Research. Text-only performance stayed low across every model tested — a meaningful contrast to older embodied benchmarks like ALFWorld, where text-only agents can already perform competitively.
The failure modes fell into three consistent buckets. Agents misinterpreted subtle visual cues — confusing on/off states, clean/dirty surfaces. They lost track of where they were in a task sequence. And they got stuck attempting actions that were no longer possible, undoing steps they had already completed. The GPT-4o flame-and-reflection confusion is the most vivid example, but it is not an outlier. It is the kind of error the benchmark was built to surface.
The interesting wrinkle is that some models did not need images to compete. Qwen3-VL, Mistral-Large-3, and Maverick reached image-based baseline performance using text-only input with detailed feedback — suggesting that for certain architectures, the bottleneck is not visual perception but the quality of state representation in language. Kimi-K5.2 and GPT-4o, by contrast, maintained substantially higher image-based performance than text-only even with detailed feedback. The gap between seeing and describing is real, and it varies by model.
AsgardBench is not peer-reviewed — it is a preprint posted to arXiv — and the evaluation lives inside a simulation, not the real world. AI2-THOR is a mature and widely used environment, but photorealistic simulators still impose their own visual biases. Real kitchens have fingerprints on the cabinets and shadows that confuse depth estimators. Whether these results generalize is an open question.
The deeper context is the broader benchmarking landscape. ALFWorld and EmbodiedBench evaluate embodied agents in text-forward settings where language descriptions carry enough information to act on. AsgardBench was built to test the harder problem: what happens when the scene itself is the primary information carrier and the agent must read it correctly to avoid compounding errors across a multi-step task. The two benchmarks answer different questions. Both are necessary.
The practical stakes are not abstract. Embodied AI — agents that perceive, navigate, and act in physical environments — is one of the areas where frontier labs have pledged significant capabilities in the next 12 to 24 months. If models still misread whether a mug is clean or whether a burner is on, the gap between demos and reliable deployments is wider than the roadmap implies. AsgardBench does not answer that question. It just makes it harder to ignore.
The benchmark and paper are open access under a Creative Commons license and available on GitHub. Microsoft Research did not immediately respond to a request for comment on whether additional model evaluations are planned.
Sources: Microsoft Research blog post | arXiv paper 2603.15888 | AsgardBench GitHub repository | AI2-THOR (Allen Institute for AI)