NVIDIA Riva TTS Enhances Multilingual Speech and Voice Cloning
Rebeca Moen Jul 15, 2025 13:06
NVIDIA introduces Riva TTS models enhancing multilingual speech synthesis and voice cloning, with applications in AI agents, digital humans, and more, featuring advanced architecture and preference alignment.

NVIDIA has unveiled its latest advancements in text-to-speech (TTS) technology with the introduction of Riva TTS models, designed to enhance multilingual speech synthesis and voice cloning capabilities. These models, Magpie TTS Multilingual, Magpie TTS Zeroshot, and Magpie TTS Flow, are set to transform industries by enabling applications such as AI voice agents, digital humans, and more, according to NVIDIA.
New TTS Models and Their Applications
The Riva TTS models leverage a streaming encoder-decoder transformer architecture, ensuring high-quality, natural-sounding speech synthesis across various languages and applications. The Magpie TTS Multilingual model supports English, Spanish, French, and German, making it ideal for multilingual interactive voice response (IVR) systems and digital human interactions. Meanwhile, Magpie TTS Zeroshot and Magpie TTS Flow focus on English, targeting live telephony, gaming non-player characters (NPCs), studio dubbing, and podcast narration.
Advanced Architecture and Preference Alignment
These models employ a non-autoregressive (NAR) encoder and an autoregressive (AR) decoder, utilizing NVIDIA's preference alignment framework and classifier-free guidance (CFG) to enhance accuracy and authenticity. This technology ensures that the AI generates reliable audio outputs, minimizing errors and improving adherence to input texts.
The Magpie TTS Flow model introduces an alignment-aware pretraining framework, integrating discrete speech units like HuBERT into a training framework to learn text-speech alignment efficiently. This approach reduces dependency on large transcribed datasets, allowing for effective voice cloning with minimal data.
Collaboration for Safe Speech AI
NVIDIA is committed to the responsible development of synthetic speech technologies. As part of its Trustworthy AI initiative, NVIDIA collaborates with industry leaders such as Pindrop to address potential risks associated with voice cloning. These partnerships aim to establish standards for secure speech deployment, enhancing media integrity and preventing fraud in critical sectors.
Implications for Industry and Research
With the ability to synthesize voices from short audio samples, NVIDIA's Riva TTS models offer significant potential for various industries, including healthcare and accessibility, where real-time, lifelike voice interaction is crucial. The models' flexibility and high performance, demonstrated by low word error rates, position them as ideal solutions for applications requiring dynamic and adaptive audio outputs.
Overall, NVIDIA's Riva TTS models represent a significant step forward in the field of speech AI, providing powerful tools for developers and researchers aiming to create more interactive and engaging voice-based applications.
Image source: Shutterstock