A team of researchers built the most ambitious AI taste model to date — and a dumb baseline that predicts the average rating for every image beat it.
That is not a misprint. A constant-mean predictor, the statistical equivalent of guessing the same answer for everything, achieves the lowest mean absolute error of any model tested in the FoodSense-VL benchmark, at 0.377. It outperforms a Gemma 3 27B-IT model fine-tuned on nearly 67,000 human sensory judgments. The best actual correlation the fine-tuned model achieves with human taste ratings is 0.297. Human raters disagree with each other at 0.793.
Taste was supposed to be different. A team at the University of Central Florida built FoodSense-VL, a vision-language model fine-tuned on nearly 67,000 sensory judgments from 8,382 human raters across 2,987 food images. The model was asked to predict how food tastes, smells, feels, and sounds based solely on photographs. The researchers called it a milestone. The benchmark, they said, represented a new frontier.
The best the model could manage was a Pearson correlation of 0.297 with human taste ratings. Human annotators disagree with each other at a leave-one-out MAE of 0.793, according to the project website. To put that in plain terms: the ceiling set by actual humans is nearly three times higher than what the best AI model achieves. And a constant-mean baseline that simply predicts the average rating for every image — no vision, no language, no fine-tuning, just the number that appears most often in the training data — achieves the lowest mean absolute error of any model tested, at 0.377.
That number should be embarrassing. A model that has learned nothing about food, that has never tasted salt or smelled rosemary or felt the texture of crispy skin, outperforms a Gemma 3 27B-IT model fine-tuned on 66,842 human judgments. What it means is that the metric being used to evaluate these systems — raw MAE, mean absolute error — rewards a model that hedges. That plays it safe. That predicts something close to the average for everything, rather than risk being confidently wrong about anything.
The FoodSense-VL team knew this. Their website states it plainly: "We show that a constant-mean predictor achieves the lowest MAE (0.377), proving raw MAE alone rewards conservative hedging over genuine sensory discrimination." They built a two-stage pipeline that deliberately trades MAE for higher correlation in the second stage, accepting the accuracy hit in exchange for predictions that track relative differences between foods more reliably. This is a legitimate research decision — correlation is genuinely useful for ranking foods by sensory properties, even if absolute predictions remain noisy. But it means the most impressive-sounding number in the paper, the 0.297 correlation, comes at the cost of the most practically useful number, raw error. The model is better at appearing to taste than at actually tasting.
This is not a niche methodological problem. It is the same disease that has afflicted AI benchmarks from computer vision to natural language understanding: the moment you optimize for a metric, you start harvesting the easy gains. Predictable patterns that inflate the number without necessarily improving the underlying capability. The constant-mean baseline is what happens when you strip all the sophistication away and just do the obvious thing. When that baseline wins on a core metric, it suggests the metric has been gamed, not that the baseline is clever.
The gap between what AI can do in food science and what the press coverage implies is wide. Givaudan, the Swiss flavor giant, has deployed AI to reduce salt in cheese snacks by 33 percent while preserving taste — but their ATOM tool works on formulation chemistry, not food photography. Cargill has integrated AI into its food development process to predict how ingredients will perform together. These are real applications with demonstrated commercial value. They are not, however, systems that look at a photo of your dinner and tell you how it tastes.
China has a robot restaurant that scans customers faces and tongues to assess health and recommend personalized dishes, a commercial deployment of taste-adjacent AI in a food service context. Electronic tongues — physical sensor arrays that chemically analyze food samples — have been in development at Penn State and elsewhere for years. These approaches, physical and chemical, represent a fundamentally different strategy from vision-language models that infer taste from images. They measure actual taste compounds. They are hard, expensive, and domain-specific. Vision-language taste prediction is easy, cheap, and generalizeable in theory. In practice, as FoodSense-VL shows, the generalization is mostly to the mean.
What FoodSense-VL actually demonstrates is not that AI can taste, but that AI can sort foods roughly by their average sensory profile using visual cues. A burger photographs differently than a salad. A deep-fried dish looks different than a steamed one. The model has learned to associate photographic patterns with broad sensory categories — this looks like it might be salty, that looks like it might be sweet. This is not taste. It is the visual equivalent of reading a menu description and guessing what it might be like. The fact that this guessing is slightly better than random, and that the researchers built an entire benchmark and dataset to quantify the gap, is genuinely useful science. It is not a taste AI.
The question worth asking is what changes — not when AI can taste, but when the appearance of taste is good enough. If a model can rank foods by their likely sensory properties with 0.297 correlation to human raters, is that useful for anything? The answer is probably yes, in narrow domains, for rough screening tasks where false positives are not costly. A food manufacturer screening new product concepts might use a model like this to quickly sort promising photographs from unpromising ones before committing to prototype development. This is a real application. It is also a far cry from the "AI palate" framing that accompanied the paper's coverage.
And yet the trajectory matters. Benchmarks improve. Models get better. The constant-mean baseline winning on MAE today does not mean it will win on MAE in three years, when the training data is larger, the visual representations more sophisticated, and the fine-tuning more targeted. The question is not whether AI taste prediction will become useful. It probably will, in specific domains, for specific purposes. The question is what it means that the last sensory domain humans considered exclusively theirs — intimately tied to pleasure, memory, and cultural identity — is now a calibration problem with a leaderboard.
Every previous colonization felt abstract. AI took jobs in data entry and customer service. AI made art that looked almost human. AI wrote passable prose. Taste is different because it is not information work or creative work. It is the thing you do when you eat. It is pleasure, and pleasure is the thing that remains when you have optimized everything else away. The FoodSense-VL paper is, at bottom, a research document about a narrow computational problem. But it arrives at a moment when the pattern it belongs to — AI entering the last domains humans thought were safe — is becoming clear enough to name. The question of what happens to human joy when taste becomes a solved engineering problem is not answered by this paper. It is not answered by any current research. But it is the right question to start asking.