NVIDIA's BlueField-4 STX Puts KV Cache on the Critical Path for AI Inference
NVIDIA BlueField-4 STX Is Real Infrastructure — But You Cannot Buy It Yet When an AI agent loses context mid-task because storage cannot keep pace with inference, that is not a model problem.

image from FLUX 2.0 Pro
When an AI agent loses context mid-task because storage cannot keep pace with inference, that is not a model problem. It is a storage problem. That is the argument NVIDIA made at GTC 2026 on March 16, and it is the argument that makes BlueField-4 STX worth watching — not because it is shipping today, but because it is the most concrete hardware-level answer yet to a problem that every production AI system is quietly accumulating.
NVIDIA has not publicly defined what STX stands for. What the company has been explicit about is the architecture: a modular reference design for AI-native storage built around a new concept called CMX — Context Memory Storage. The first rack-scale implementation of CMX expands GPU memory with a high-performance context layer designed specifically to hold key-value cache data generated by large language models during inference.
Here is the bottleneck STX targets. Key-value cache is the stored record of what a model has already processed — the intermediate attention calculations an LLM saves so it does not recompute across the full context on every inference step. It is what allows an agent to maintain coherent working memory across sessions, tool calls, and reasoning chains. As context windows have grown and agents have started taking more steps, that cache has grown with them. When KV cache data has to traverse a traditional storage path — over a general-purpose CPU, through a standard network fabric — to get back to the GPU, inference slows and expensive GPU cycles go idle. In a production AI factory running hundreds of concurrent agentic tasks, that idle time compounds fast.
The architecture is built on the Vera Rubin platform with a storage-optimized BlueField-4 processor that combines NVIDIA's Vera CPU and the ConnectX-9 SuperNIC, running on Spectrum-X Ethernet and programmable through DOCA. The new DOCA Memo component, confirmed by NVIDIA VP Ian Buck in a press briefing, gives storage partners a software reference platform to build context-optimized storage on top of the BlueField-4 programmable substrate — not just a hardware blueprint but a code path.
The performance claims NVIDIA makes against traditional CPU-based storage are: 5x token throughput, 4x energy efficiency, and 2x faster data ingestion. These figures are NVIDIA's own claims and should be treated as directional until a published baseline is available. Moving KV cache off CPU onto a DPU with a SuperNIC and RDMA fabric is a meaningful architectural shift, and the direction is plausible. But infrastructure buyers doing TCO analysis will need published comparison data that NVIDIA has not yet provided.
The partner list is where STX gets more interesting as a signal. Storage providers co-designing STX-based infrastructure include Cloudian, DDN, Dell Technologies, Everpure, Hitachi Vantara, HPE, IBM, MinIO, NetApp, Nutanix, VAST Data, and WEKA. Manufacturing partners building STX-based systems include AIC, Supermicro, and Quanta Cloud Technology. On the cloud and AI side, CoreWeave, Crusoe, IREN, Lambda, Mistral AI, Nebius, Oracle Cloud Infrastructure, and Vultr have all committed to STX for context memory storage.
That combination of enterprise storage incumbents — Dell, HPE, NetApp, IBM — alongside AI-native cloud providers is the meaningful signal. NVIDIA is not positioning STX as a specialty product for hyperscalers alone. It is positioning it as a reference architecture for storage infrastructure serving agentic AI workloads.
IBM sits on both sides of the announcement. It is listed as a storage partner co-designing STX-based infrastructure, and NVIDIA separately confirmed that IBM Storage Scale System 6000 — validated on NVIDIA DGX platforms — serves as the high-performance storage foundation for NVIDIA's own GPU-native analytics infrastructure. IBM also announced expanded collaboration with NVIDIA including GPU-accelerated integration between watsonx.data Presto and NVIDIA's cuDF library. A production proof of concept with Nestle put numbers on the broader argument: a data refresh cycle across the company's Order-to-Cash data mart covering 186 countries and 44 tables dropped from 15 minutes to three minutes, with IBM reporting 83% cost savings and a 30x price-performance improvement. That result is structured analytics, not agentic inference, but it makes the data-layer argument concrete.
STX-based platforms will be available from partners in the second half of 2026. Given that most major storage vendors are already co-designing on the architecture, enterprises evaluating storage refreshes for AI infrastructure in the next 12 months should expect STX-based options from their existing vendor relationships. The question worth asking of any storage vendor currently in conversation: what is your STX roadmap?
This is not a product NVIDIA sells. It is a reference architecture — a blueprint and a software stack — handed to the partner ecosystem to build around. That distinction matters. The actual performance in production will depend on how well each partner implements the design, particularly around RDMA and RoCE configuration in the network fabric connecting CMX to the GPU. The storage is no longer a box at the end of a wire but a distributed layer of the network fabric itself — which means performance tuning now requires a team that understands both networking and storage, not one or the other.
The storage layer becoming a first-class infrastructure decision — not an afterthought to GPU procurement — is the real story here. General-purpose NAS and object storage were not designed to serve KV cache data at inference latency requirements. STX is NVIDIA's answer to that gap, and the breadth of the partner commitment suggests the industry sees the gap as worth addressing.

