Google's latest open model family can run multi-step autonomous tasks on your phone, without a server round trip. No cloud required to get started.
Gemma 4 launched April 2 under the Apache 2.0 license — Google's first major move away from its custom Gemma license, which carried usage restrictions and terms Google could update unilaterally. The new license matches what Mistral, Qwen, and most of the open-weight ecosystem already use: no usage thresholds, no geographic restrictions, no acceptable use policy beyond what the law requires. For enterprise teams that had to route Gemma through legal review before deploying it commercially, that friction disappears.
The numbers support the capability story. The 31B dense model ranks #3 globally on the Arena AI open-weight leaderboard with a score of 1452. On AIME 2026, a rigorous math reasoning benchmark, it scores 89.2%. The 26B mixture-of-experts variant — which activates only 3.8 billion of its 25.2 billion parameters per inference step — sits at #6 with a score of 1441 and benchmarks competitively with dense models twice its effective size. The smaller variants punch above their weight: the E4B scores 42.5% on AIME 2026, the E2B reaches 37.5%.
On-device, the gains are significant. Google reports 4x speed improvement over Gemma 3 on Android, with battery drain cut by up to 60%. Arm's own benchmarks, on chips using the SME2 extension, show a 5.5x average speedup in prefill. On a Raspberry Pi 5 running the E2B variant on CPU, the model reaches 133 prefill and 7.6 decode tokens per second. With a Qualcomm Dragonwing NPU, that climbs to 3,700 prefill and 31 decode tokens per second.
The architecture splits into four variants. E2B and E4B are built for edge devices — the "effective parameters" naming means only the parameters active during inference are counted. Quantized E2B takes up about 1.3 GB on-device and runs on devices with 6 GB of RAM. E4B needs roughly 2.5 GB and 8 GB of RAM. Both edge models have a 128K context window. The larger 26B and 31B models target servers and workstations with 256K context windows.
The agentic capabilities are the substantive shift. Gemma 4 supports native function calling — it can plan across multiple steps, call tools, and complete tasks autonomously without relying on instruction-following prompts to approximate structured behavior. The Google AI Edge Gallery app ships with what Google calls "Agent Skills": Wikipedia search, map interactions, auto-generated summaries, flashcards. The model can describe photos, turn voice input into visualizations, and integrate with other local models for tasks like text-to-speech or image generation. A demo skill describes and plays animal vocalizations by combining these capabilities.
None of those individual features are novel. What matters is that a local model on a consumer phone handles them without a round trip to a server, and that the underlying model will ship as Gemini Nano 4 on new Android flagship devices later this year. Gemini Nano already runs on over 140 million Android devices, powering Smart Replies and audio summaries. If the Gemma 4 foundation carries forward, the on-device agentic infrastructure is about to be pervasive.
Gemma 4's τ2-bench agentic tool use score of 57.5% (E4B) versus Gemma 3's 6.6% closes the capability gap that made prior on-device claims feel like privacy novelties. The model still makes trade-offs relative to cloud-resident variants, but the delta has narrowed to the point where the trade-off is no longer disqualifying for a meaningful class of agent tasks.
The Developer and Enterprise Picture
Gemma 4 is available with CPU and GPU support across Android and iOS. Desktop support spans Windows, Linux, and macOS via Metal, plus WebGPU for browser execution. LiteRT-LM — the inference runtime built on the existing LiteRT stack trusted by millions of Android developers — handles deployment, with new GenAI-specific libraries layered on top.
For edge and IoT, the model runs on Raspberry Pi 5 and Qualcomm Dragonwing IQ8, which also powers the Arduino VENTUNO Q announced in March. Google also shipped a Python package and CLI tool for local experimentation, including tool calling support that mirrors Agent Skills.
The Gemma family has accumulated over 400 million downloads since the first generation, with more than 100,000 community variants on Hugging Face. The Google AI Edge Gallery app has climbed to fourth place among free productivity apps in the iOS App Store, behind only Claude, Gemini, and ChatGPT.
The License That Changes the Procurement Conversation
For the past two years, enterprise teams evaluating open-weight models faced a consistent trade-off: Gemma delivered strong performance but its custom license required legal review, created compliance ambiguity, and gave Google the right to change terms. Many teams chose Mistral or Qwen instead, where the license was already familiar to procurement.
Gemma 4 under Apache 2.0 removes that friction entirely. Fine-tuning on proprietary data, deploying commercially, and distributing derivative works all fall within standard Apache 2.0 terms — no call to legal required. Meta's Llama 4, by contrast, carries a community license with real restrictions: applications exceeding 700 million monthly active users require a separate commercial agreement, and the acceptable use policy restricts entire application categories. The Open Source Initiative has stated that Llama's license does not meet the Open Source Definition.
The timing is notable. As some Chinese AI labs — most notably Alibaba with its latest Qwen releases — have moved toward more restrictive terms, Google has opened Gemma 4 under the same permissive license the rest of the ecosystem uses. For teams that had been waiting for Google to compete on licensing terms as well as performance, the evaluation can begin without a procurement detour.
Google is also making a broader infrastructure bet here. Arm and Qualcomm optimizations are baked in from the start — this is not a reference implementation. The inference speed concern some have flagged for the 31B model on certain providers is real; the 26B MoE variant addresses it by delivering near-31B quality at 4B-class compute cost, and both workstation models can run serverless on Google Cloud Run with GPU support, scaling to zero when idle.
For developers building agent systems, the implication is straightforward: the assumption that serious agents live in the cloud, where compute is available, needs updating. A model that plans, uses tools, and completes multi-step tasks on a device with 6–8 GB of RAM is a different design constraint than one that assumes a GPU cluster is always available. The stack that wins in the next two years will need to treat cloud-resident and device-resident models as first-class citizens — not as privacy-mode alternatives to a primary cloud deployment.