New Method Outperforms kNN on Objective Vision Tasks

New Method Outperforms kNN on Objective Vision Tasks — type0 | type0

For years, the machine learning community settled on a deceptively simple answer to the question of which demonstration examples to show a multimodal LLM: pick the ones that look most like what you're asking about. k-Nearest Neighbor selection — find visually similar examples, feed them in — became the default for visual in-context learning, and nobody revisited it.

That assumption gets challenged in a new paper from researchers at the University of Cincinnati and UCLA. The paper, "Learning to Select Visual In-Context Demonstrations", accepted to the CVPR 2026 Findings Track, puts the kNN standard against an RL-trained alternative called LSD across five benchmarks spanning subjective and objective tasks. The result is a clean split that the researchers call a dichotomy: kNN is still right for some things. For others, it's the wrong tool.

The five benchmarks in the study cover both ends of the spectrum. UTKFace requires predicting a person's age from a photograph across a range from 0 to 116 years — an objective, factual quantity. AVA measures aesthetic ratings on a 1-to-10 scale. SCUT-FBP5500 asks raters to score facial attractiveness. KonIQ-10k and KADID-10k assess perceived image quality. The first is a measurement; the others are opinions.

On objective regression tasks, LSD outperforms kNN meaningfully. On UTKFace at K=4, kNN posts a mean absolute error of 7.27 years; LSD comes in at 6.27, according to the project's results page. On KonIQ-10k, the error drops from 0.44 to 0.40. On age prediction — predicting a fact, not a feeling — the learned agent is notably better.

On subjective tasks, the pattern reverses. On AVA aesthetics at K=8, kNN's MAE is 0.83 versus LSD's 0.98. On SCUT-FBP5500 attractiveness at K=4, kNN scores 0.39 versus LSD's 0.62. kNN wins. The intuitive approach — similarity retrieval — is the right answer when the target is a human preference that correlates with visual resemblance. When the target is an objective measurement that exists independent of who is looking, learned selection beats visual proximity.

The architecture behind LSD is a Dueling DQN agent with a query-centric Transformer Decoder. The query-centric design was not an accident: in early experiments, the team found that concatenating query and candidate embeddings directly caused what they call policy collapse — the agent learned to select the same generally-good demonstrations regardless of the query image. The decoder architecture forces the selection to be genuinely conditional on what is being asked.

The agent uses a FAISS IVFPQ index to retrieve the top 200 candidates at each step, bringing the action space from linear to logarithmic. Training runs for 16,000 steps on a single NVIDIA A100 GPU — roughly seven hours, per the paper. That is not a large compute budget by frontier lab standards. The approach is reproducible at a university lab scale.

The most practically significant result may be cross-model generalization. The LSD policy was trained using reward signals from Gemma 3 4B-it. When evaluated without retraining on Qwen 2.5 7B and Phi-3.5-vision, it maintained its performance advantage and comparable performance respectively. A single policy, trained on one model family, transfers to others. Per the project page, "a single LSD agent trained using reward signals from Gemma 3 4B-it successfully transfers to unseen models."

This matters for deployment. If the demo selection strategy generalizes across architectures, teams do not need to retrain a selector for every new model they adopt — the learned policy carries over. That is a meaningful practical benefit built on a small compute budget.

The researchers also ran a shuffling experiment that deserves attention: they took the demonstrations LSD selected and randomly reordered them. The MAE barely moved. That result suggests LSD is learning to pick the right set, not the right sequence — the agent found which examples matter, not how to arrange them.

Eugene Lee and Jiajie Diao at the University of Cincinnati and Yu-Chi Lin at UCLA conducted the research. The Findings Track acceptance at CVPR 2026 is notable — the main conference accepts roughly 26 to 28 percent of submissions; Findings is more selective.

For engineers building visual ICL pipelines, the practical takeaway is not a wholesale replacement of kNN but a nuanced one: use kNN for aesthetic, preference, and style tasks. Consider LSD or a learned alternative for regression and measurement tasks where the target is a factual quantity. The authors make no claims beyond the benchmarks evaluated, and the approach has not been tested on domains beyond the five in the study.

What remains open is whether the dichotomy holds in other modalities — audio, document understanding, mixed inputs — and whether the RL agent's selections are interpretable enough to say why a particular demonstration helps. Those are the questions the next round of experiments should answer. The finding that the right tool depends on what you're measuring is not revolutionary. But in a field that defaulted to one answer for years, it is the kind of quiet correction that changes how people build.

New Method Outperforms kNN on Objective Vision Tasks

Editorial Timeline

Sources

Share

Related Articles

Iran Named a $30 Billion AI Data Center an Annihilation Target. It Is Not Bluster.

Microsofts Three New AI Models Are the Story. The Partnership Is Over.

Anthropics Claude Code Flags You as Negative If You Type WTF

Stay in the loop

Iran Named a $30 Billion AI Data Center an Annihilation Target. It Is Not Bluster.

Microsofts Three New AI Models Are the Story. The Partnership Is Over.

Anthropics Claude Code Flags You as Negative If You Type WTF

Related Articles

Iran Named a $30 Billion AI Data Center an Annihilation Target. It Is Not Bluster.
Artificial Intelligence · 3h 17m ago · 3 min read

Microsofts Three New AI Models Are the Story. The Partnership Is Over.

Anthropics Claude Code Flags You as Negative If You Type WTF