Google is betting that the real money in AI is not in building the models. It is in running them. The company announced its first chip designed specifically to serve AI, splitting its traditional TPU (Tensor Processing Unit, the custom silicon at the center of its AI infrastructure) into two specialized parts: one for training AI systems, one for the ongoing cost of running them at scale. The announcement was made at Google Cloud Next in Las Vegas this week.
Training a large AI model is expensive but happens once. Serving it, answering billions of queries across millions of users, happens every day. That ongoing cost is called inference, and Google just bet that custom silicon purpose-built for inference can undercut the general-purpose chips that currently handle most of it. The question is whether that bet is real or expensive PR.
Google expects to ship 4.3 million TPU units in 2026, scaling to more than 35 million by 2028, The Next Web reported, citing industry analysts TrendForce. That is not a hardware refresh cycle. That is a signal of where Google thinks the money flows. Anthropic has committed to up to one million of Google's next-generation TPUs, with access to approximately 3.5 gigawatts of compute starting in 2027, The Next Web reported. At Anthropic's scale, a 20 percent reduction in cost-per-query compounds into billions of dollars annually. That is the prize.
The inference chip, codenamed Zebrafish and officially called TPUv8i, is built by MediaTek, Wccftech reported, which has been Google's partner on the input/output side of its custom silicon strategy since the current-generation Ironwood chip. MediaTek's designs on Ironwood ran 20 to 30 percent cheaper than comparable alternatives, The Next Web reported. The training chip, codenamed Sunfish and officially called TPUv8t, is designed by Broadcom, which currently commands more than 70 percent of the custom AI accelerator market and is projecting $100 billion in AI chip revenue by 2027. Google is now running the most diversified custom AI supply chain in the industry, explicitly betting that splitting the problem is the way to win it.
Nvidia currently dominates both training and inference. Its H100 and H200 GPUs handle the vast majority of AI workloads in hyperscale data centers. If custom inference silicon can meaningfully undercut GPU economics, it erodes Nvidia's position in the higher-volume half of the AI compute market. Broadcom, which also designs custom AI chips for Google and other clients, benefits either way. Industry analysts project custom chip sales will grow 45 percent in 2026, compared with 16 percent growth in GPU shipments, a divergence that suggests the inference economics bet is not just Google's.
Both TPUv8 chips target TSMC's 2-nanometer manufacturing process, with tape-out expected in late 2027, Wccftech reported. The current-generation Ironwood chip delivers a 10-times peak performance improvement over TPU v5p and more than four times better performance per chip compared to TPU v6e, Google said in its Cloud Next announcement. Ironwood has 192 gigabytes of HBM3E memory per chip with 7.2 terabytes per second of bandwidth, scaling to 9,216 chips per superpod producing 42.5 FP8 exaflops. Google Cloud Next runs through April 24 in Las Vegas.
No independent benchmarks for the TPUv8i exist yet. The cost and latency claims that will determine whether this is a real challenge to Nvidia or a press release will not be testable until the chips tape out in late 2027 and customers run their own workloads. The full announcement is on Google's AI Blog.
What to watch: whether the MediaTek-designed inference chip actually delivers lower per-query cost at scale, and whether any hyperscaler customer besides Anthropic commits to meaningful TPUv8i volumes. If the 2027 benchmarks hold, Nvidia's inference moat is at risk. If they do not, this was an expensive sideshow, and Google will have bet the agentic era on a chip that did not deliver.