VoxCPM2 is an advanced open-source text-to-speech system that redefines speech synthesis by eliminating traditional tokenization and instead generating continuous speech representations through a diffusion-based autoregressive architecture. Built on top of the MiniCPM model family, it enables highly natural, expressive, and context-aware speech generation that adapts tone, emotion, and pacing directly from input text. The system is trained on massive multilingual datasets, enabling support for dozens of languages and dialects while maintaining high fidelity and realism in generated audio. VoxCPM stands out for its ability to perform voice cloning with minimal input, capturing not only the speaker’s timbre but also nuanced features such as rhythm, accent, and emotional delivery. It also introduces voice design capabilities, allowing users to generate entirely new voices from natural language descriptions without requiring reference audio.
Features
- Tokenizer-free speech generation using diffusion autoregressive modeling
- Multilingual support across dozens of languages without explicit tagging
- High-quality voice cloning from short reference audio samples
- Voice design from natural language descriptions without audio input
- Real-time streaming synthesis with low latency performance
- Studio-quality audio output with built-in super-resolution