Google DeepMind built Vision Banana to generate images. Give it simple text instructions for spatial tasks, however, and it outperforms models purpose-built for those jobs without ever using a real-world depth measurement.
The paper, posted to arXiv on April 22, shows Vision Banana estimating depth more accurately than Depth Anything V3, segmenting scenes more precisely than SAM 3, and doing both while retaining the image generation capabilities it was trained for. The key move, reported by 36kr: all depth training data came from synthetic rendering engines, no actual depth measurements required. What looks like a clever side effect may be something more fundamental: a model trained to represent the visual world learning spatial relationships without being taught them directly.
On metric depth estimation, averaging performance across four standard datasets, Vision Banana scored 0.929 on the δ1 accuracy metric against Depth Anything V3's 0.918, according to the paper. On semantic segmentation of urban scenes, it achieved 0.699 mean Intersection over Union against SAM 3's 0.652. On referring segmentation, pointing the model at a specific object in an image via natural language, it scored 0.738 against SAM 3 Agent's 0.734. Surface normal estimation, predicting the direction a surface faces in 3D space, came in at 18.928 degrees mean angle error against Lotus-2's 19.642. Each margin is narrow. Together they suggest a pattern: a general visual representation transfers across tasks that specialist models handle separately.
The paper does not claim universal applicability. Results may hold on academic benchmarks while failing on diverse real-world conditions. The authors note this limitation explicitly. But the mechanism behind the results is the more interesting question. Vision Banana was built by instruction-tuning a text-to-image base model called Nano Banana Pro on a mixture of its original training data plus a small amount of vision task data. Instruction-tuning is the same technique that turned a language model into Claude: give it examples of the task you want, let it follow instructions. Applied to a model already competent at visual representation, the same approach produces spatial understanding without depth supervision. No camera parameters needed. No real-world training runs. The model learns to draw before it learns to see, and seeing emerges.
What changes if this generalizes? Three separate vision pipelines are now candidates for collapse into one instruction-following model: depth estimation, semantic segmentation, and image generation. For robotics companies building autonomous systems, that means one model instead of three to integrate, maintain, and run. For augmented reality developers, it means metric depth and object recognition available via a text prompt instead of separate API calls. For synthetic data generation, it means a model that both renders scenes and understands their spatial structure, useful for training other systems without real-world capture. The cost structure of visual perception changes when the same model handles what used to require a stack.
The author list adds an odd angle. Kaiming He and Saining Xie, both researchers at Meta's AI division FAIR, appear as leadership sponsors on a Google DeepMind paper. He and Xie previously worked at Google; their current employer is Meta. The paper credits their guidance on a project published under a different institution's name. Meta did not respond to questions about whether its FAIR lab contributed compute, data, or personnel to the work.
Whether Vision Banana's approach survives contact with conditions outside academic benchmarks, including variable lighting, occluded objects, and the long tail of real environments, is the open question. The paper's own authors are careful not to claim it has. What they have shown is narrow enough to be credible and strange enough to be worth watching.