When a new paper asks whether large language models can actually plan a trip — not just describe one — the answer turns out to be nearly zero.
ItinBench, a benchmark developed by researchers at the University of Virginia, tests LLMs on itinerary planning across two cognitive dimensions simultaneously: verbal reasoning (understanding user preferences, constraints, time windows) and spatial reasoning (optimizing routes to minimize travel distance). The paper, authored by Tianlong Wang, Pinqiao Wang, Weili Shi, and Sheng Li, was posted to arXiv this week. The results are not kind to the current generation of frontier models — and the reason why is more interesting than the numbers alone.
On the combined task — verbal and spatial together — validated plan rates collapse to near-zero. Mistral Large: 0%. Llama 3.1 8B: 0%. GPT-4o: 4%. OpenAI's o1: 4%. Even o1, which leads the pack on reasoning tasks, drops from 18% validated plans when tested on verbal reasoning alone to 4% when spatial is added simultaneously.
Gemini 1.5 Pro failed the spatial component of Task 2 entirely — it couldn't produce a rational number of attractions for a given time window.
The benchmark grades harshly: a plan must satisfy all user constraints and produce a geometrically efficient route. Most plans fail on both counts. The partial-credit view isn't more forgiving — on spatial metrics alone (Total Distance Gap, Detour Ratio), every model trails a simple greedy heuristic. o1 comes closest, achieving a Total Distance Gap of 7.5km against the greedy baseline's 9.2km on the pre-filtered task. That's the brightest signal in the paper for reasoning-scale models, and the wire didn't have it.
The more important result is buried in the ablation. When researchers gave models pre-computed proximity clusters — text descriptions of which attractions are physically close to each other — spatial performance improved substantially. The paper describes this as a "shortcut" condition, and the authors note that under it, models can process spatial structure when it's handed to them as text, but struggle to infer it independently from coordinates or addresses.
This sits alongside something that's been argued theoretically. Subbarao Kambhampati at Arizona State University has pushed the LLM-Modulo framework — the idea that LLMs are best understood as approximate knowledge stores and language interfaces, not planners, and that robust planning requires wrapping them in external verification and search. The ItinBench ablation result is consistent with that framing: give a model the spatial structure explicitly and it can use it; ask it to derive that structure from raw location data and performance collapses.
The paper has real limits. It's a single-city benchmark (Philadelphia, Yelp data). No retries — each model gets one shot per task, which underestimates the performance of systems with self-correction loops or tool use. The model versions are dated: no GPT-4.5, no Claude 3.7, no Gemini 2.x. Any of those could score differently, particularly the reasoning-class models.
There's also an open question the paper raises but doesn't answer: what happens when you give models access to a routing API? The benchmark tests base model capability, not augmented systems. In production, you'd almost certainly offload route optimization to Google Maps or OSRM. The interesting engineering question is whether LLMs can correctly translate natural language constraints into API calls — which is a different capability entirely, and a more tractable one.
Planning is one of the canonical test cases for AI capability claims. Every few months a paper or demo suggests that frontier models can handle complex, multi-step planning. ItinBench is a sobering correction: on a task that a competent travel agent handles daily, combining moderate verbal complexity with basic spatial optimization, current models fail more than 95% of the time.
The lesson isn't that planning is impossible for LLMs. It's that the gap between "can describe a plan" and "can produce a valid plan" is larger than most capability narratives acknowledge. Models that pass verbal-only evaluations look competent. Add a second cognitive dimension and the floor drops out.
That's a useful thing to know before you build a product on it.