Google released a text-to-speech model today that you control with plain English, not dropdowns and sliders. If you run a company that depends on expressive AI voice, that is a problem.
Gemini 3.1 Flash TTS, announced April 15, introduces a feature the company calls audio tags: natural language instructions embedded directly in the text you send the model. You want a radio host from Newcastle who sounds energized? You write "requires a charismatic Newcastle accent" and the model produces it. No voice preset to select, no speed knob to adjust, no SSML tags to learn. Just a prompt.
Simon Willison, the independent developer and blogger, spent ten minutes with the API and reproduced the effect on camera: swapping one line in a scene-description prompt changed the accent from Brixton to Newcastle. His demo is worth watching because it is the rare case where a capability claim is also a verifiable test anyone can run.
The difference this represents is not quality. It is interface paradigm. Existing TTS APIs built their developer experience around discrete controls: select a voice, pick a style preset, adjust speed, optionally annotate with SSML markup. Prompt-based TTS works the way you prompt a language model. You describe what you want in prose. The model figures out how to deliver it. Those two approaches are not the same product. They require different engineering mental models, different integration patterns, and different expectations about what a "voice" even is.
Google says the model scores 1,211 on the Artificial Analysis TTS leaderboard, a benchmark the company says uses thousands of blind human preference comparisons. Google also positions the model in its "most attractive quadrant" for quality versus price. All output carries SynthID watermarking, which embeds an imperceptible signal designed to flag AI-generated audio. The model supports 70 or more languages.
Those are Google's numbers. The leaderboard page did not render scores in a way I could independently verify. And "preview" access means the API is not yet production-grade: it is available through Google AI Studio and the Gemini API, with Vertex AI access also in preview. The practical difference between "we shipped it" and "you can build with it in production" is not a detail.
The comparison that matters is to ElevenLabs and OpenAI's TTS APIs. Both have built commercial businesses on expressive, controllable AI voice using preset-based and SSML-style control surfaces. If natural language prompting becomes the expected developer interface for TTS, the competitive pressure on those companies is not about voice quality. It is about whether their interface feels like the old way of building software.
The analogy that developers already know is the one that actually fits: this is what happened when LLMs replaced template-based text generation. Prompt engineering made retrieval-augmented generation look slow. Natural language is still winning that argument. TTS is arriving at the same inflection point. Whether ElevenLabs and OpenAI respond by adopting the prompt paradigm or by differentiating on voice quality is the next thing to watch. What is already clear is that any developer building a product today on a traditional TTS API is building on an interface that will feel old-fashioned within months.
The announcement is real but the story is not finished. Gemini 3.1 Flash TTS is in preview, not GA. Google's claims about benchmark position and cost are self-reported. Simon Willison's demo works, but Willison is a developer who knows how to write good prompts. Whether the results generalize to average users building average products is the remaining question. That question will be answered by the developers who try it in the next thirty days.