Google DeepMind released Gemma 4 on April 3, and the headline is real: the 31B dense model sits at number three on the Arena AI open-models leaderboard with an ELO of 1452. But the benchmark picture obscures a latency gap that community testers uncovered within 24 hours of release, and it lands squarely on the model Google is positioning as the efficient alternative.
The Gemma 4 26B mixture-of-experts model runs at roughly 11 tokens per second on consumer hardware, according to multiple independent testers on identical RTX 5060 Ti 16GB graphics cards. Qwen 3.5 35B-A3B, a comparable MoE model from Alibaba's Qwen team, runs at 60-plus tokens per second on the same hardware. That is a 5-to-1 latency difference. Trending Topics reported the figures first; independent testers on the DEV Community and Hacker News confirmed them within hours of the release.
The culprit is familiar to anyone who has been watching Google's inference engineering: the 26B-A4B variant activates 3.8 billion of its 26 billion total parameters per token through a MoE routing mechanism — 128 experts, eight active per token, plus one shared expert. That is a large expert vocabulary being routed through a relatively thin shared layer, and the community consensus is that it creates memory bandwidth bottlenecks on commodity hardware that Qwen's more compact MoE design avoids. Hugging Face's technical analysis has the expert count confirmed.
DeepMind's own blog frames the MoE result differently: the 26B-A4B variant achieves 97 percent of the dense 31B model's quality while activating only 3.8 billion parameters, and ranks number six on the Arena AI open-models leaderboard with an ELO of 1441. That is a valid engineering claim — parameter count is not the same as compute cost. But the framing obscures the inference latency reality that community testing surfaced within 24 hours of release. The efficiency story holds on datacenter H100 clusters. It does not hold on a $450 graphics card.
The bigger surprise is the E2B model — a 2.3 billion effective parameter variant that community testers confirmed beats Gemma 3 27B on most benchmarks. Let that land. A two-billion-parameter model matching the performance of a model that required 27 billion parameters to run is genuinely unusual, and it points to something the dense-model benchmark comparisons miss entirely. DEV Community user dentity007 and independent testers on Hacker News flagged the E2B result within hours of the release, noting it handles text, image, and audio inputs natively — a capability the larger models do not have. The edge model supports speech recognition directly, not just vision.
On the standard benchmarks Google cited, the 31B model posts credible numbers: 85.2 percent on MMLU-Pro, 84.3 percent on GPQA Diamond, 80 percent on LiveCodeBench, a 2150 ELO on Codeforces, and 89.2 percent on AIME 2026. Those figures are real and the model is genuinely capable. But on MMLU-Pro and GPQA Diamond specifically, the 31B trails Qwen 3.5 27B — 85.2 versus 86.1 percent on the former, 84.3 versus 85.5 percent on the latter. Google is number three on the leaderboard, and the gap to Qwen at number one is real on the two most education-weighted benchmarks. DeepMind's own blog has the full benchmark table.
There is one change that matters to the commercial users Google is courting with this release. Gemma 4 ships under Apache 2.0, replacing the Google-specific license that gated Gemma 3. That means fine-tuning, commercial deployment, and integration into proprietary products without the prior restrictions. The Gemmaverse — the community of developers who have downloaded Gemma models 400 million times and built more than 100,000 variants — now has a license that does not require them to ask Google's permission first. DeepMind confirmed the license change in the release post.
The hardware requirement for the dense 31B model is not trivial. Running it unquantized requires an NVIDIA H100 or an RTX 6000 Pro — workstation-class silicon that starts around $4,000 and goes up from there. Ars Technica reported the hardware spec after the release. This is not a model you are running on your laptop, even if the parameter count sounds small by frontier-model standards.
What to watch next: the community fine-tuning results. The Gemma series has historically punched above its weight class once researchers get their hands on the weights — the 100,000-plus community variants from prior Gemma releases are not an accident. The combination of Apache 2.0, the strong dense baseline, and the surprisingly capable E2B variant creates conditions for the kind of downstream innovation that does not show up in the day-one benchmark table. The latency problem on the 26B MoE is a real constraint for consumer deployment, but it is a software and hardware optimization problem, not a fundamental architecture problem. Google has shown in prior Gemma releases that it can close those gaps in subsequent versions. Whether it will is a different question.
† Add attribution: 'according to [source name]' or add footnote: 'Source-reported; not independently verified.'
†† Add attribution: 'According to DeepMind's blog, the E2B model supports text, image, and audio inputs natively — a capability the larger models do not have.' or add footnote: 'Source-reported; not independently verified.'