OpenAI Launches New Audio Models for Smarter Voice Agents

Alex
Mar 24
270
0
8

News

OpenAI

Open AI is introducing new speech-to-text and text-to-speech models that make it easier to build smarter voice agents. These models are more powerful, customizable, and reliable than before. The new speech-to-text models are better at understanding speech, even in tough situations like strong accents, noisy environments, or fast speech. This makes them perfect for tasks like transcribing customer calls or meeting notes.

We’ve also improved the text-to-speech model, so developers can now tell it how to speak—for example, asking it to sound like a caring customer service agent. This opens up new possibilities for things like more emotional customer service or lively storytelling.

These improvements come after years of work since we launched our first audio model in 2022. The new models offer better accuracy, making them more useful in real-world applications.

New Speech-to-Text Models

We’re introducing two new models, gpt-4o-transcribe and gpt-4o-mini-transcribe, which perform better than the old Whisper models in terms of word recognition and language understanding. These models can handle different languages and accents, and they perform well in noisy or fast-paced environments. They are now available in the speech-to-text API.

???
00:00 Intro
01:32 Audio agents
03:27 Speech-to-text
06:18 Text-to-speech
08:48 Agents SDK

Read more in our blog post: https://t.co/DQYJAO3eA3 https://t.co/ZtJbWexks9 pic.twitter.com/X6pikNMCjI
— OpenAI Developers (@OpenAIDevs) March 20, 2025

New Text-to-Speech Model

We’re also launching gpt-4o-mini-tts, a new text-to-speech model that can be customized to speak in different ways. Developers can now tell it how to say something, which is great for creating personalized voice experiences. This model is available in the text-to-speech API.

How do These Models work?

These new models use advanced techniques like reinforcement learning and specialized training on real-world audio data to improve their performance. This makes them more accurate, especially when transcribing difficult speech.

API Availability

The new models are available now for developers. They can use these models to add speech-to-text and text-to-speech capabilities to their apps easily. There's also a new tool to help developers build voice agents with low-latency speech-to-speech interactions.

In short, these updates make it easier for developers to create more accurate, customizable, and effective voice applications.