Open-source multilingual speech generation
Tokenizer-free TTS built for voice design and faithful cloning.
VoxCPM2 is a 2B-parameter speech model trained on 2M+ hours of multilingual audio. It synthesizes 30 languages and 9 Chinese dialects, creates new voices from natural-language prompts, and delivers controllable cloning with native 48kHz output.
- Voice Design
- Create a new voice from natural language alone.
- Controllable Cloning
- Preserve timbre while steering pace, emotion, and style.
- Ultimate Cloning
- Continue from prompt audio and transcript for higher fidelity.
- Streaming
- Real-time serving through Nano-vLLM or vLLM-Omni acceleration.
VoxCPM2 / Continuous speech representations
Design voices. Clone voices. Stream voices.Highlights
The core capabilities that make VoxCPM stand out.
The official repository centers on six capabilities: multilingual synthesis, voice design, controllable cloning, extreme fidelity continuation, context-aware prosody, and fast streaming.
30-language synthesis
Generate speech from raw text across 30 languages without separate language tags.
Voice design
Describe a voice in natural language and synthesize from that prompt with no reference clip.
Controllable cloning
Clone a reference voice while layering style instructions for pace, emotion, and energy.
Ultimate cloning
Use prompt audio and transcript for continuation-style cloning that preserves micro-details.
Context-aware prosody
The model infers rhythm and expressiveness from text instead of relying on rigid templates.
Real-time serving
Pair with Nano-vLLM or vLLM-Omni for lower RTF and OpenAI-compatible serving workflows.
Use Cases
Three ways teams usually approach VoxCPM.
The project is strong when you need a direct path from text or a short voice reference to expressive, high-quality audio with open-source flexibility.
Build multilingual product voices
Start from text only, cover global markets, and keep a single synthesis stack for product narration, onboarding, or interactive assistants.
Prototype new speaker identities
Use voice design prompts to audition personas before collecting studio data, then refine toward the timbre and tone you actually want.
Clone with tighter expressive control
Combine a short reference clip with control instructions or prompt audio continuation when fidelity and style retention both matter.
Quick Start
Install once, generate speech immediately.
The official project targets Python 3.10+, PyTorch 2.5+, and CUDA 12.0+. For local testing, the shortest path is the `openbmb/VoxCPM2` checkpoint.
- `pip install voxcpm` for the base package.
- `openbmb/VoxCPM2` is the current recommended release.
- Use Read the Docs for full quick-start and deployment details.
- ModelScope mirrors are available for domestic network access.
from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained(
"openbmb/VoxCPM2",
load_denoiser=False,
)
wav = model.generate(
text="VoxCPM2 is ready for realistic multilingual speech synthesis.",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)
Ecosystem
The shortest route to code, docs, models, demos, and community.
This landing page stays lightweight and points straight to the official resources you actually need to evaluate or ship VoxCPM.
Code
OpenBMB GitHub
Repository, releases, issues, and the canonical README for VoxCPM and VoxCPM2.
Open repository →Docs
Read the Docs
Installation, quick start, API, CLI, and deployment guidance.
Read documentation →Demo
Live Playground
Test text-to-speech, voice design, and cloning behavior before local setup.
Launch demo →Weights
Hugging Face
Model checkpoints for the latest recommended release and older variants.
Browse models →Mirror
ModelScope
Alternative model distribution for domestic access and mirrored downloads.
Open mirror →Samples
Audio Demo Page
Listen to multilingual examples, cloning outputs, and qualitative comparisons.
Listen to samples →FAQ
Common questions from people landing on VoxCPM for the first time.
What does “tokenizer-free” mean in VoxCPM?
VoxCPM directly models continuous speech representations instead of depending on a discrete audio tokenizer. That is the core architectural difference highlighted by the official project.
Can VoxCPM create a brand-new voice without any reference sample?
Yes. The voice design workflow accepts natural-language descriptions so you can generate a fresh voice without first providing a reference recording.
How do I get the best cloning similarity?
The official examples recommend combining reference audio with prompt audio and its transcript for the highest-fidelity continuation behavior.
Where should I start if I only want to evaluate quality?
Open the Live Playground and the audio sample page first. Then move to the GitHub repo and Read the Docs once you want to install locally or deploy inference.