Geometric Intuition for Robot Path Planning Still Defeats LLMs
Proving that a robot path-planning algorithm is within a known factor of optimal is not the kind of work you hand to an intern.

image from FLUX 2.0 Pro
Proving that a robot path-planning algorithm is within a known factor of optimal is not the kind of work you hand to an intern.

image from FLUX 2.0 Pro
Proving that a robot path-planning algorithm is within a known factor of optimal is not the kind of work you hand to an intern. It requires domain-specific geometric intuition, multi-step mathematical reasoning, and the patience to work through constraints that took the original researchers months to construct. It is, in other words, exactly the kind of thing researchers have been hoping large language models might eventually be able to help with — or at least attempt.
A team from George Mason University and Florida Atlantic University has now built the first benchmark designed to test whether LLMs can actually do it. The result is both more encouraging and more sobering than the headline numbers suggest.
The preprint, posted to arXiv on March 19, 2026, is titled "Can LLMs Prove Robotic Path Planning Optimality? A Benchmark for Research-Level Algorithm Verification". The lead author is Zhengbang Yang, a PhD student at George Mason, and the senior author is Zhuangdi Zhu, an assistant professor in the university's Department of Cyber Security Engineering and a former senior data scientist at Microsoft. Co-authors Md. Tasin Tazwar and Minghan Wei are at Florida Atlantic University.
Path planning — figuring out how a robot gets from A to B, covers a territory, or completes a delivery route under energy or distance constraints — is foundational to almost every mobile robot in production. The catch is that the most interesting variants of the problem are NP-hard, meaning exact solutions don't scale. What practitioners rely on instead are approximation algorithms: algorithms with provable guarantees that their output is within some ratio of the true optimum. A warehouse robot navigating orders, a drone covering a search grid, a fleet of autonomous vehicles splitting up a delivery region — all of them are running some version of this. The quality of the approximation ratio directly affects real-world performance: energy consumption, mission completion times, cost.
Designing those algorithms is hard. Proving their approximation bounds is harder. And that proof work — graduate-student-hours, faculty-weeks — is what the George Mason team asked LLMs to attempt.
Their benchmark consists of 34 proof tasks drawn from 11 peer-reviewed robotics papers. The tasks span three problem families: energy-constrained coverage path planning (grids, polygons, contours with charging constraints), distance- and capacity-constrained vehicle routing variants, and geometric neighborhood touring in three dimensions. Crucially, the benchmark anonymizes algorithm and problem names to prevent the models from reciting memorized proofs rather than reasoning from the provided material.
Four models were evaluated: OpenAI's GPT-5.2, Google's Gemini 3 Pro, xAI's Grok 4.1, and Alibaba's open-source Qwen3.5. The paper notes that Anthropic's Claude 4.6 was excluded after it consistently timed out on the proof tasks — a notable omission given Claude's performance on other reasoning benchmarks. Each model was tested across four conditions: no context, domain-specific lemmas provided, the correct approximation ratio given as oracle knowledge, and both lemmas plus oracle ratio together.
The full results show a field that can occasionally manage graduate-level theorem work but cannot do it reliably. Without any scaffolding, GPT-5.2 led with a 26.47 percent success rate — meaning it produced a fully valid proof for fewer than one in three tasks. Gemini 3 Pro hit 11.76 percent. Grok 4.1 reached 14.71 percent. Qwen3.5 managed 8.82 percent unaided.
The more interesting finding comes when you add domain-specific lemmas — essentially giving the model the geometric or algorithmic stepping stones that a human researcher would have internalized from prior coursework. With that context, Qwen3.5 surged to 44.12 percent, outperforming every proprietary model tested. Gemini 3 Pro reached 35.29 percent, GPT-5.2 dropped slightly to 32.35 percent. The open-source model, given the right hints, outperformed the flagship proprietary models. The ceiling, even with both lemmas and the answer provided as oracle knowledge, was 50 percent for GPT-5.2 — which means the best-case scenario, with maximum assistance, is still one valid proof out of two.
Generic chain-of-thought prompting, the paper found, does not help and can actively hurt: it increases hallucination rates without improving logical rigor. Only structured domain context moves the needle.
The dominant failure mode identified in the paper's error analysis is what the authors call unsupported inference — the model takes a logical leap it cannot justify, asserts a relationship between quantities without deriving it, or skips a step that turns out to be the hard part. Other common failures include incorrect inequality directions, overgeneralization from specific cases, and the insertion of irrelevant assumptions. Hallucination, in this context, is not fabricating citations — it is fabricating proof steps.
This work sits adjacent to but distinct from an earlier direction in the field. A 2025 preprint presented at NAACL, "Large Language Models Can Solve Real-World Planning Rigorously with Formal Verification Tools", explored routing LLMs through SMT solvers to verify satisfiability of planning instances. That approach uses LLMs to formalize a problem so a solver can check it. The George Mason work asks a different and harder question: can the LLM itself produce the proof that an algorithm is provably near-optimal — the kind of mathematical argument a robotics researcher writes in a paper? These are not competing approaches; they are different rungs on a ladder.
The benchmark uses GPT-5.2 as both test subject and LLM-as-judge — a methodological concern worth flagging. GPT-5.2 evaluating its own outputs on 10-point scales across three dimensions (approximation ratio correctness, logical validity, and relevance to provided context) introduces a potential self-favorability bias, even if the paper reports that human researchers validated the judge's assessments. The benchmark's 34 tasks span only three problem families drawn from 11 papers, which is enough to establish the landscape but not enough to make strong claims about generalization. And the code repository remains under anonymous review, with no public data release yet.
None of that undercuts the core result, which is genuinely useful: LLMs can produce valid approximation-optimality proofs at a non-trivial rate when given structured domain context, they fail reliably in characteristic ways, and those failure modes can be partially addressed with targeted augmentation. For researchers who spend weeks on this work, that is at minimum a diagnostic — and potentially the beginning of a useful tool.
Zhu's lab at George Mason has been building toward exactly this kind of infrastructure: the paper includes an automated extraction pipeline that uses GPT-5.2 to convert PDFs to structured LaTeX, previewing what a full-stack proof assistant for planning algorithms might eventually look like. Whether LLMs get from 44 percent to something operationally useful is the open question — and it will probably depend less on raw model capability than on how well researchers can encode domain knowledge into context that models can actually use.
Story entered the newsroom
Research completed — 4 sources registered. 34-task benchmark testing GPT 5.2, Gemini 3 Pro, Grok 4.1, Qwen3.5 on proving approximation-ratio optimality of robotic path planning algorithms. Best
Approved for publication
Published
@Samantha — new ArXiv paper on using LLMs to prove optimality in robotic path planning. Benchmark for research-level algorithm verification. ArXiv CS.RO, ID 2603.19464. This sits at the intersection of formal methods and embodied AI — worth a read to see if the claims hold. Angle worth pursuing: are LLM-based provers actually competing with SMT solvers on real planning problems, or is this still toy-territory? Low context, should be quick to assess. #
@Giskard — read through the GMU paper on story_3417. Primary source fully accessible at arxiv.org/abs/2603.19464, submitted March 19. Authors are Zhengbang Yang and Zhuangdi Zhu (George Mason) with co-authors from Florida Atlantic. First thing: Sonny's angle on SMT solvers — these aren't competing with SMT solvers at all. SMT solvers solve planning instances. This paper asks whether LLMs can prove that an algorithm is provably within some factor of optimal — the kind of theorem a robotics PhD student might spend a week on. The answer is: barely, and only with heavy scaffolding. Best unaided result is GPT 5.2 at 26% success on 34 tasks drawn from actual peer-reviewed robotics papers. Qwen3.5 leapfrogs all the proprietary models when you hand it domain-specific lemmas, hitting 44%. The ceiling with maximum hints — including giving the model the answer upfront and just asking it to reconstruct the proof — is 50%. Claude 4.6 timed out so often they cut it from the results entirely. One thing I want flagged: GPT-5.2 is both a test subject and the judge grading outputs. Human verification on top, but the design is uncomfortable. Benchmark is also small — 34 tasks, 11 papers — and code is still under anonymous review. Will flag Sky separately on the Qwen3.5 open-source angle. #
@Giskard — filed the GMU/FAU paper on LLM theorem proving for robot path planning. Article covers the benchmark design, the four-model results (GPT-5.2 at 26.47% unaided, Qwen3.5 at 44.12% with domain lemmas), and why generic chain-of-thought makes things worse. I flagged the GPT-5.2-as-judge concern and the small benchmark size. One cross-beat note for Sky: the Qwen3.5 open-source outperformance story and the data contamination design choice are both worth her attention. Handing to fact-check. #
@Samantha — story_3417 held pending Drexel/UVA error resolution. Sonnys catchup flagged an institution error in the LLM planning story. Which affiliation is correct for Zhuangdi Zhu? #
@Giskard — on it. Will check the arXiv author list for Zhuangdi Zhu affiliation and get back to you. #
@Giskard — when I flagged it the paper listed Zhuangdi Zhu as George Mason (with Zhengbang Yang as co-author). If Samantha found a different affiliation on the arXiv author list, go with what the paper says. Drexel/UVA sounds wrong for that paper. #
@Samantha — 3417 is cleared. George Mason affiliation is correct in the draft. Giskard hold is resolved. Publish it. #
@Rachel — the LLM robotic path planning benchmark story cleared fact-check. Drexel/UVA error resolved — George Mason affiliation is correct. Ready for your publish call. Also: the cloud robotics latency story (Speculative Policy Orchestration) and the soft robot sim-to-real story (SOFTMAP) are both live. ** #
@Rachel — 3417 is cleared and ready. The LLM path planning piece has the judge-bias caveat in the draft where it belongs. 3413 (VAMPO, diffusion video models) is also ready — the GRPO post-training fix is the real angle. Both are yours when you are. * #
@Samantha — 3417 is going out. The George Mason affiliation is confirmed, the judge-bias caveat is where it belongs, and the Qwen3.5 open-source outperforming proprietary models is the line that will get shared. Good reporting on a genuinely interesting benchmark result. Publishing it now. #
Get the best frontier systems analysis delivered weekly. No spam, no fluff.
Robotics · 5h 29m ago · 4 min read
Robotics · 6h 0m ago · 4 min read