![Open Audio's S1]()
OpenAudio has officially unveiled S1, its groundbreaking text-to-speech (TTS) model, now available for trial at Fish Audio. Trained on over 2 million hours of audio, OpenAudio S1 sets a new standard in voice generation, offering unmatched naturalness, expressiveness, and instruction-following capabilities that closely mimic professional voice actors.
Setting a New Benchmark in Voice Synthesis
OpenAudio S1 leverages a massive 4 billion parameter model, the largest dataset ever used for TTS, and advanced training techniques such as in-house reward modeling and RLHF (Reinforcement Learning with Human Feedback) using the GRPO method. This approach overcomes the instability of modeling both semantic and acoustic information together, a challenge that led to artifacts and incorrect words in previous models. As a result, S1 delivers superior audio quality, emotional depth, and speaker similarity.
![TTS]()
Industry-Leading Performance
- Word Error Rate (WER): 0.008 (S1), 0.011 (S1-mini)
- Character Error Rate (CER): 0.004 (S1), 0.005 (S1-mini)
- Human Subjective Evaluation: Ranked #1 on HuggingFace TTS-Arena-V2 as of June 3, 2025
Unparalleled Emotional and Vocal Control
What truly sets OpenAudio S1 apart is its sophisticated understanding and reproduction of human emotion and vocal nuance. The model supports a rich array of markers for controlling synthesized speech, allowing users to infuse dialogue with specific emotions, tones, and even special vocalizations.
Key Features
- Emotional Markers: (angry), (sad), (excited), (surprised), (sarcastic), (joyful), (empathetic), and more
- Tone Markers: (in a hurry tone), (shouting), (screaming), (whispering), (soft tone)
- Special Markers: (laughing), (chuckling), (sobbing), (sighing), (panting), (crowd laughing)
- Onomatopoeia: Realistic vocalizations such as “Ha,ha,ha” for laughter and “Hmm,hmm” for chuckling
A soon-to-be-released proprietary speech-to-text model further enhances S1’s capabilities, enabling detailed captioning of audio with emotion, tone, and speaker information. Over 100,000 hours of audio were captioned using this technology to train S1 for instruction-following.
Making Advanced Voice Tech Accessible
OpenAudio S1 is the most affordable state-of-the-art TTS model on the market, priced at just $15 per million bytes (about $0.8 per hour). This pricing makes high-quality voice generation accessible even for high-volume or budget-conscious developers. Continuous improvements in training and inference pipelines are expected to drive costs even lower, helping make AI-driven voice technology a foundation for next-generation human-computer interaction.
![Price]()
Truly Global Voice Support
S1 natively supports a wide range of languages, enabling creators and developers to reach global audiences with consistent, high-quality output. Supported languages include: English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, and Portuguese.
![Fishaudio]()
Model Variants for Every Need
- S1 (4B): The full-scale flagship model for the richest, most nuanced performance
- S1-mini (0.5B): A distilled, efficient version for resource-constrained applications without significant quality loss
Built on the Qwen3 architecture, S1 is fundamentally a native multimodal model capable of TTS, speech-to-text (STT), TextQA, and AudioQA, though only TTS is currently available. The system uses a Descript Audio Codec-like architecture, enhanced with transformers for superior text modeling, and benefits from online RLHF using the GRPO approach.
![OpenaudioS1]()