OpenAI Unveils Three New Audio Models for Real-Time Voice, Makes Realtime API Generally Available

OpenAI Releases Three New Audio Models, Realtime API Now GA

OpenAI has launched three new audio models through its Realtime API, targeting distinct capabilities in live voice applications: GPT-Realtime-2 for voice agents with advanced reasoning, GPT-Realtime-Translate for live speech translation, and GPT-Realtime-Whisper for streaming transcription. The Realtime API has officially exited beta and is now generally available, signaling a major step for developers considering production systems.

OpenAI Unveils Three New Audio Models for Real-Time Voice, Makes Realtime API Generally Available — Source: www.marktechpost.com

The models are available immediately through the OpenAI API and can be tested in the Playground. Together, they push voice applications beyond simple question-and-answer loops toward systems that can listen, reason, translate, transcribe, and act within a single conversation.

GPT-Realtime-2: Voice Reasoning with a 128K Context Window

The flagship model, GPT-Realtime-2, is described by the OpenAI team as its first voice model with GPT-5-class reasoning. It can process complex requests, manage interruptions, and maintain natural conversation flow. The context window has been expanded from 32K to 128K tokens, enabling longer interactions and more complex tasks without losing context.

Previous voice models often stalled on multi-step requests or dropped earlier context during extended sessions. GPT-Realtime-2 is designed to keep the conversation moving while reasoning through requests. Developers can enable short preamble phrases like “let me check that” or “one moment while I look into it,” so users know the agent is working.

“These features directly address one of the most common failure modes in deployed voice agents: awkward silence that makes the system feel broken,” said an OpenAI spokesperson. The model can also call multiple tools simultaneously and narrate its actions, providing a running commentary during multi-step tasks.

Adjustable Reasoning Effort and Tone Control

A key control for production builders is adjustable reasoning effort. Developers can dial reasoning across five levels: minimal, low, medium, high, and xhigh. The default is “low” to minimize latency for simple requests, while tougher tasks can tap into more compute. This allows teams to tune the performance-latency tradeoff at the session level depending on the use case — a quick customer lookup does not need the same depth as a multi-step travel booking workflow.

GPT-Realtime-2 also adds tone control, adapting its speaking style based on the situation — remaining calm during problem-solving, shifting to empathetic when users are frustrated, and turning upbeat after a successful outcome. The model is also better at understanding industry-specific terminology, including healthcare vocabulary and proper nouns.

On benchmarks, gains are measurable. GPT-Realtime-2 with high reasoning scored 96.6% on Big Bench Audio, compared to 81.4% for GPT-Realtime-1.

GPT-Realtime-Translate and GPT-Realtime-Whisper

GPT-Realtime-Translate focuses on real-time speech translation, enabling cross-language conversations with minimal delay. GPT-Realtime-Whisper provides streaming transcription, converting spoken language to text in real time. Both models are available through the Realtime API.

Background

OpenAI’s Realtime API was originally launched in beta earlier this year, allowing developers to integrate voice capabilities into applications. However, limitations in context length and reasoning ability hampered adoption for complex use cases. The new models address these gaps, building on feedback from early adopters who needed more reliable performance in production environments.

“We’ve listened to developers and invested heavily in improvements that make voice agents more dependable and versatile,” the spokesperson added.

What This Means

With these releases, OpenAI is positioning the Realtime API as a serious platform for building voice-enabled applications across industries — from customer support and healthcare to education and entertainment. The general availability status gives enterprises confidence to invest in the API for mission-critical systems.

Developers can now create voice agents that handle complex reasoning, adapt tone to user emotions, and maintain context over long conversations. The adjustable reasoning effort and tone control offer fine-grained customization, making it easier to deploy voice AI in diverse scenarios. The combination of translation and transcription models further expands possibilities for multilingual and accessibility-focused applications.

Tags: