Google's new iPhone app can run a genuine AI model entirely on your device, no internet required. That's notable. What's more notable is what the app doesn't do: it doesn't remember anything.
Google AI Edge Gallery, released April 2 alongside Gemma 4, is the first official app from a major AI vendor designed for trying local language models on iOS. Simon Willison, a developer and writer who reviewed the app, described it as "the first time I have seen a local model vendor release an official app for trying out their models on iPhone." The app runs two Gemma 4 variants locally — the 2-billion parameter E2B and the 4-billion parameter E4B — with offline inference, image question-and-answer, and audio transcription up to 30 seconds.
The model sizes are small enough to download over cellular: E2B is a 2.54GB download and runs at 133 prefill and 7.6 decode tokens per second on a Raspberry Pi 5 CPU, according to Google's developer blog. On a Qualcomm Dragonwing NPU, performance reaches 3,700 prefill and 31 decode tokens per second. The app itself is 35.4MB and requires iOS 17.0 or later.
But the real thing worth noticing is the absence of a feature every other consumer AI product considers mandatory: conversation history.
"Ephemeral conversations with this app are not stored," Willison noted. No logs, no context retention between sessions, no account-linked conversation archive. You start fresh every time.
That choice sits in uncomfortable contrast with the App Store privacy label. The data collected and linked to your identity, according to the App Store listing: identifiers, diagnostics, and other data. The app requires an internet connection to download models, and model downloads are separate from the app itself — meaning Apple's standard app review doesn't audit the model weights or what they contain.
Gemma 4 launched under the Apache 2.0 license on April 2, available in four sizes: E2B, E4B, a 26-billion parameter mixture-of-experts variant, and a 31-billion parameter dense model. Developers have downloaded Gemma more than 400 million times, Google said, and the model family has spawned over 100,000 variants in the Gemmaverse community. The 31B model ranks as the number three open model on Arena AI's text leaderboard; the 26B ranks sixth.
The larger Gemma 4 variants offer a 256,000-token context window. The edge-focused E2B and E4B models max out at 128,000 tokens — still substantial, but intentionally smaller for on-device deployment.
Google's positioning of AI Edge Gallery is as a demonstration platform: a way to try Gemma 4, explore its capabilities, and see what local inference feels like. The app includes what Google calls "skills" — pre-configured tool-calling demonstrations. Google's developer blog says LiteRT-LM, the runtime powering these skills, can process 4,000 input tokens across two distinct skills in under three seconds.
The ephemeral-by-design framing is either a genuine privacy stance or a convenient answer to a hard question. Local inference keeps data off servers — that's real and valuable. But an app that runs entirely on-device while still collecting identifiers and diagnostics is not the same thing as an app that keeps nothing. The question Google isn't answering explicitly is what those diagnostics contain, and whether a session that starts from zero still contributes to something cumulative.
The local AI distribution story is real. Gemma 4's Apache 2.0 license means anyone can ship it, modify it, or run it without API costs. That open distribution, combined with the ability to run on a phone without a server call, is genuinely new infrastructure for how AI reaches users. The conversation-history gap is also real — and it's a gap Google will eventually have to explain, one way or another.