Nvidia Says Its New AI Model Runs 4x Faster. The Academic Research Says 13x Slower.
Nvidia has built a language model that runs four times faster than its own baseline — but the academic literature already shows the same class of technology running thirteen times slower than conventional models in independent tests. That contradiction is the story.
Nvidia released Nemotron-Labs-Diffusion on May 19, a 14 billion parameter model that can switch between three decoding modes: the standard approach (autoregressive), the new approach (diffusion), and a hybrid called self-speculation. In self-speculation mode, the company reported roughly 865 tokens per second on a B200 GPU, about four times faster than its own autoregressive baseline on identical hardware using the SGLang deployment framework, according to Hugging Face. Nvidia's internal benchmark measured performance at batch_size=1, meaning a single user request at a time.
The independent tests paint a different picture. A November 2025 paper on arXiv tested diffusion language models and found that their speed advantage disappears at batch sizes larger than one — the exact condition that governs real commercial deployment. LLaMA3-Instruct-8B, a standard autoregressive model, ran 13.7 times faster than the diffusion equivalent LLaDA-Instruct-8B in the same evaluation, according to the paper. This story did not independently reproduce Nvidia's benchmarks; findings reflect published literature and Nvidia's own disclosure.
"We find that these acceleration strategies yield significant gains at a batch size of 1, sometimes outperforming autoregressive models, but their advantage diminishes as batch size scales," the researchers wrote.
The gap matters because cloud providers and enterprises run inference at volume, batching requests to maximize GPU utilization. If the speedup only holds at batch size one, Nvidia is describing a single-user latency win that doesn't translate to the commercial infrastructure buyers actually pay for.
Nvidia's self-speculation mode outperformed multi-token prediction methods in acceptance rate and device efficiency, according to the company's research paper. The model also claims a 1.2 percent accuracy improvement over Qwen3 8B, with self-speculation pushing output per forward pass closer to six times the autoregressive baseline, per Hugging Face. Competitor inference engine vLLM has not yet merged diffusion decoding support, meaning operators choosing SGLang today face a narrower deployment toolkit.
Speed-of-light analysis in Nvidia's research shows diffusion decoding can achieve 76.5 percent more tokens per forward pass under an optimal sampler, the theoretical ceiling of the approach. The gap between that ceiling and what independent researchers observed in practice suggests the technique is real but deployment-dependent in ways the announcement doesn't disclose.
What to watch: whether cloud providers and AI deployment platforms run batch-size benchmarks that validate or contradict Nvidia's claims, and whether the academic literature converges on the batch-size limitation as a settled question.