OpenAudio S1 Brings Natural Text to Speech Breakthrough

Vijay Kumari
Jun 04
2k
0
13

News

Open Audio's S1

OpenAudio has officially unveiled S1, its groundbreaking text-to-speech (TTS) model, now available for trial at Fish Audio. Trained on over 2 million hours of audio, OpenAudio S1 sets a new standard in voice generation, offering unmatched naturalness, expressiveness, and instruction-following capabilities that closely mimic professional voice actors.

Setting a New Benchmark in Voice Synthesis

OpenAudio S1 leverages a massive 4 billion parameter model, the largest dataset ever used for TTS, and advanced training techniques such as in-house reward modeling and RLHF (Reinforcement Learning with Human Feedback) using the GRPO method. This approach overcomes the instability of modeling both semantic and acoustic information together, a challenge that led to artifacts and incorrect words in previous models. As a result, S1 delivers superior audio quality, emotional depth, and speaker similarity.

TTS

Industry-Leading Performance

Word Error Rate (WER): 0.008 (S1), 0.011 (S1-mini)
Character Error Rate (CER): 0.004 (S1), 0.005 (S1-mini)
Human Subjective Evaluation: Ranked #1 on HuggingFace TTS-Arena-V2 as of June 3, 2025

Unparalleled Emotional and Vocal Control

What truly sets OpenAudio S1 apart is its sophisticated understanding and reproduction of human emotion and vocal nuance. The model supports a rich array of markers for controlling synthesized speech, allowing users to infuse dialogue with specific emotions, tones, and even special vocalizations.

Key Features

Emotional Markers: (angry), (sad), (excited), (surprised), (sarcastic), (joyful), (empathetic), and more
Tone Markers: (in a hurry tone), (shouting), (screaming), (whispering), (soft tone)
Special Markers: (laughing), (chuckling), (sobbing), (sighing), (panting), (crowd laughing)
Onomatopoeia: Realistic vocalizations such as “Ha,ha,ha” for laughter and “Hmm,hmm” for chuckling

A soon-to-be-released proprietary speech-to-text model further enhances S1’s capabilities, enabling detailed captioning of audio with emotion, tone, and speaker information. Over 100,000 hours of audio were captioned using this technology to train S1 for instruction-following.

Making Advanced Voice Tech Accessible

OpenAudio S1 is the most affordable state-of-the-art TTS model on the market, priced at just $15 per million bytes (about $0.8 per hour). This pricing makes high-quality voice generation accessible even for high-volume or budget-conscious developers. Continuous improvements in training and inference pipelines are expected to drive costs even lower, helping make AI-driven voice technology a foundation for next-generation human-computer interaction.

Price

Truly Global Voice Support

S1 natively supports a wide range of languages, enabling creators and developers to reach global audiences with consistent, high-quality output. Supported languages include: English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, and Portuguese.

Fishaudio

Model Variants for Every Need

S1 (4B): The full-scale flagship model for the richest, most nuanced performance
S1-mini (0.5B): A distilled, efficient version for resource-constrained applications without significant quality loss

Built on the Qwen3 architecture, S1 is fundamentally a native multimodal model capable of TTS, speech-to-text (STT), TextQA, and AudioQA, though only TTS is currently available. The system uses a Descript Audio Codec-like architecture, enhanced with transformers for superior text modeling, and benefits from online RLHF using the GRPO approach.

OpenaudioS1