Improved Gemini audio models for powerful voice experiences

The conversation with AI is about to get a whole lot more natural and powerful. Following recent upgrades to text-to-speech models, Google has now released a significant update to its Gemini 2.5 Flash Native Audio model, specifically designed for live voice agents. This isn’t just about generating speech; it’s about creating intelligent, fluid, and context-aware conversations.

This new model is now available across Google’s ecosystem, including Google AI Studio and Vertex AI, and is starting to roll out within Gemini Live and, for the first time, Search Live. This means you can brainstorm more effectively in real-time with Gemini, get instant help through Search Live, or build the next generation of enterprise-grade customer service agents.

But the innovation doesn’t stop at voice agents. This native audio capability unlocks a groundbreaking new feature: live speech-to-speech translation for headphones. This feature preserves the speaker’s original intonation, pacing, and pitch, making translations sound remarkably natural. A beta experience is launching today in the Google Translate app.

Enhanced Live Voice Agents

To power a wide range of applications, Gemini 2.5 Native Audio has been refined in three critical areas:

Sharper Function Calling: The model is now more reliable at triggering external functions. It can accurately identify when to fetch real-time data mid-conversation and seamlessly integrate that information back into the audio response without disrupting the flow. It leads with a 71.5% score on the ComplexFuncBench Audio evaluation.
Robust Instruction Following: Handling complex instructions is now more precise, leading to higher user satisfaction. The model adheres to developer instructions 90% of the time, up from 84%, ensuring more reliable and complete outputs.
Smoother Conversations: Significant improvements have been made in multi-turn conversation quality. The model effectively retrieves context from previous exchanges, creating more cohesive and natural dialogues.

Real-World Impact: What Customers Are Saying

Leading companies are already leveraging these capabilities to drive tangible business results:

Shopify: “Users often forget they’re talking to AI within a minute of using Sidekick… New Live API AI capabilities offered through Gemini [2.5 Flash Native Audio] empower our merchants to win.” – David Wurtz, VP of Product.
United Wholesale Mortgage (UWM): “By integrating the Gemini 2.5 Flash Native Audio model… we’ve significantly enhanced Mia’s capabilities… This powerful combination has enabled us to generate over 14,000 loans for our broker partners.” – Jason Bressler, Chief Technology Officer.
Newo.ai: “Working with the Gemini 2.5 Flash Native Audio model through Vertex AI allows Newo.ai AI Receptionists to achieve unmatched conversational intelligence… They can identify the main speaker even in noisy settings, switch languages mid-conversation, and sound remarkably natural.” – David Yang, Co-founder.

Breakthrough: Live Speech Translation

Gemini now natively supports live speech translation designed for two primary modes:

Continuous Listening: Automatically translates speech from multiple languages into a single target language, allowing you to hear the world around you in your preferred language through headphones.
Two-Way Conversation: Handles real-time translation between two languages, automatically switching the output based on who is speaking.

Key capabilities make this practical for real-world use:
* Language Coverage: Translates speech in over 70 languages and 2000 language pairs.
* Style Transfer: Preserves the speaker’s vocal nuances for natural-sounding translations.
* Multilingual Input: Understands multiple languages in a single session.
* Auto-Detection: Identifies the spoken language to begin translation instantly.
* Noise Robustness: Filters ambient noise for clear conversations even in loud environments.

Starting today, you can try the beta in the Google Translate app for real-time headphone translation on Android devices in the US, Mexico, and India, with iOS and more regions coming soon.

Get Started Today

Begin building advanced voice agents with Gemini 2.5 Flash Native Audio, now generally available on Vertex AI and in preview in the Gemini API. You can experiment with it directly in Google AI Studio. The Gemini 2.5 Flash and 2.5 Pro text-to-speech models are also available via the API.