Your GPU is not too slow. It is waiting.
That distinction matters more than it sounds, and it is the core of what Hugging Face researchers Rémi Ouazan Reboul, Pedro Cuenca, and Aritra Roy Gosthipaty found when they profiled a synchronous continuous-batching loop running an 8B parameter model at batch size 32 over 8K tokens. The total generation time was 300.6 seconds. Of that, 72.1 seconds — 24% — was GPU idle. The GPU was not doing anything because the CPU had not finished preparing the next batch yet. This is not a hardware problem. It is a scheduling problem.
The finding, published May 14 as the second post in Hugging Face's ongoing inference-efficiency series, matters for anyone running LLM inference at scale. Continuous batching — the technique that packs variable-length requests into tight GPU batches to eliminate padding waste — solved one inefficiency. It did not solve this one. The CPU and GPU still take turns: while the GPU computes, the CPU waits. While the CPU prepares the next batch, the GPU waits. These gaps compound over hundreds of steps per second, and as Hugging Face's benchmark shows, they can consume nearly a quarter of total runtime.
The fix is not a new chip. It is a different way of programming the hardware.
The default stream problem
To understand the gap, you need to know what happens when you run PyTorch code without specifying a CUDA stream. Every GPU operation lands on the default stream, and the default stream is synchronizing: it waits for all other streams to flush before it runs anything. This means that if you transfer data from GPU to CPU using a default-stream operation — even with a transfer that is supposed to be non-blocking — the CPU blocks until every GPU task is done. The GPU, in turn, waits for the CPU to finish its batch update before it can start the next compute cycle. The result is that neither the CPU nor the GPU is ever doing useful work at the same time.
The fix, as Hugging Face describes it in their blog post, is to move all GPU operations off the default stream and use non-default streams, which return control to the CPU immediately after launching. The specific pattern is a three-stream pipeline: one stream handles host-to-device (H2D) data transfer, one handles GPU compute, and one handles device-to-host (D2H) transfer. Each stream is independent, so while the GPU is running batch N's compute, the CPU can simultaneously prepare and transfer batch N+1's inputs on the H2D stream. CUDA events — record and wait calls — enforce cross-stream ordering without blocking the CPU.
The result, if CPU overhead could be entirely eliminated, is a theoretical speedup from 300 seconds to 228 seconds: a 24% reduction in total generation time. That is what Hugging Face calls the optimistic view.
The engineering below the headline
There is a reason this gap existed in widely-deployed code. The async batching pattern requires solving problems that look simple until you try to run two batches simultaneously.
The first is a race condition. Batch N and batch N+1 cannot share the same device-side input buffers, or the CPU will overwrite data the GPU is still reading. The solution is double-slot tensor buffering: two alternating slots, where batch N processes from slot A while the CPU updates slot B for batch N+1. The cost is roughly doubled RAM and VRAM for input tensors.
The second problem is CUDA graphs. Inference deployments often use CUDA graphs to reduce CPU-side dispatch overhead by pre-recording and replaying GPU execution sequences. A graph captured against slot A's memory addresses cannot be replayed against slot B's. The answer Hugging Face landed on is a memory pool: both graphs allocate from the same shared VRAM pool, and the only constraint is that they cannot execute concurrently — which is already guaranteed by the batch pipeline ordering. The total VRAM usage stays roughly at the single-graph level.
The tradeoffs are real. Async batching in the transformers implementation requires use_async_batching=True in ContinuousBatchingConfig, which in turn requires CUDA graphs (use_cuda_graph=True) and roughly doubles the VRAM needed for input tensors. It also currently requires FlashAttention, because FlashAttention does not need an attention mask — the mask is the largest input tensor, and without it, the 2x VRAM overhead becomes prohibitive.
Is this your bottleneck?
The 24% figure is Hugging Face's number, measured on a specific configuration. For builders evaluating whether this applies to their pipeline, the relevant questions are narrower than the headline suggests.
Is your batch size large enough that CPU preparation time is significant relative to GPU compute time? For small batches or short sequences, CPU overhead is smaller and the gain is smaller. Is your workload memory-bound rather than compute-bound? If the GPU is already starved for data, hiding CPU latency will not help. Are you running CUDA graphs and FlashAttention? If not, the implementation path is longer.
The profiling script Hugging Face published makes the diagnostic self-serve: instrument your own pipeline, dump CPU and GPU activity spans, and look at the ratio. The gap the blog post describes may be larger or smaller in your environment.
Hugging Face is not the only team working on this class of problem. vLLM, TensorRT-LLM, and SGLang have all made engineering investments in continuous batching and GPU-CPU overlap. The CUDA streams technique itself is general CUDA programming — not a transformers-specific invention. What Hugging Face did is land a clean, documented, reproducible implementation in the library that millions of researchers already use, which is genuinely useful regardless of whether the 24% figure transfers to any given workload.
The next inference frontier
The two-post series is worth noting as context. The first post, published November 2025, covered continuous batching from first principles: KV cache, FlashAttention, attention masks, ragged batching. This second installment is the payoff — the gap that remained after the obvious inefficiencies were fixed. The pattern across both posts points somewhere consistent: the gains left in LLM inference are increasingly in scheduler and memory management design, not in model architecture or hardware speed. For anyone building or investing in inference infrastructure, that is the more durable signal.
The GPU is waiting. Whether that wait is your problem depends on what else your pipeline is doing.