VoxCPM wordmark

Open-source multilingual speech generation

Tokenizer-free TTS built for voice design and faithful cloning.

VoxCPM2 is a 2B-parameter speech model trained on 2M+ hours of multilingual audio. It synthesizes 30 languages and 9 Chinese dialects, creates new voices from natural-language prompts, and delivers controllable cloning with native 48kHz output.

2B parameters 2M+ training hours 30 languages 9 Chinese dialects Apache-2.0
Voice Design
Create a new voice from natural language alone.
Controllable Cloning
Preserve timbre while steering pace, emotion, and style.
Ultimate Cloning
Continue from prompt audio and transcript for higher fidelity.
Streaming
Real-time serving through Nano-vLLM or vLLM-Omni acceleration.

VoxCPM2 / Continuous speech representations

Design voices. Clone voices. Stream voices.
Architecture Diffusion autoregressive generation without a discrete audio tokenizer.
Base Built on MiniCPM-4 with open-source code and weights.
Output Native 48kHz audio through AudioVAE V2 asymmetric encode/decode.
Serving OpenAI-compatible streaming paths via official and community inference stacks.
Release VoxCPM2
Coverage 30 languages
Dialects 9 Chinese dialects
Quality 48kHz output
License Apache-2.0

Highlights

The core capabilities that make VoxCPM stand out.

The official repository centers on six capabilities: multilingual synthesis, voice design, controllable cloning, extreme fidelity continuation, context-aware prosody, and fast streaming.

01

30-language synthesis

Generate speech from raw text across 30 languages without separate language tags.

02

Voice design

Describe a voice in natural language and synthesize from that prompt with no reference clip.

03

Controllable cloning

Clone a reference voice while layering style instructions for pace, emotion, and energy.

04

Ultimate cloning

Use prompt audio and transcript for continuation-style cloning that preserves micro-details.

05

Context-aware prosody

The model infers rhythm and expressiveness from text instead of relying on rigid templates.

06

Real-time serving

Pair with Nano-vLLM or vLLM-Omni for lower RTF and OpenAI-compatible serving workflows.

Use Cases

Three ways teams usually approach VoxCPM.

The project is strong when you need a direct path from text or a short voice reference to expressive, high-quality audio with open-source flexibility.

A

Build multilingual product voices

Start from text only, cover global markets, and keep a single synthesis stack for product narration, onboarding, or interactive assistants.

30 languages Context-aware 48kHz
B

Prototype new speaker identities

Use voice design prompts to audition personas before collecting studio data, then refine toward the timbre and tone you actually want.

Natural-language control No reference audio Creative direction
C

Clone with tighter expressive control

Combine a short reference clip with control instructions or prompt audio continuation when fidelity and style retention both matter.

Reference voice Style prompts Streaming deployment

Quick Start

Install once, generate speech immediately.

The official project targets Python 3.10+, PyTorch 2.5+, and CUDA 12.0+. For local testing, the shortest path is the `openbmb/VoxCPM2` checkpoint.

  • `pip install voxcpm` for the base package.
  • `openbmb/VoxCPM2` is the current recommended release.
  • Use Read the Docs for full quick-start and deployment details.
  • ModelScope mirrors are available for domestic network access.
Python API
from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
)

wav = model.generate(
    text="VoxCPM2 is ready for realistic multilingual speech synthesis.",
    cfg_value=2.0,
    inference_timesteps=10,
)

sf.write("demo.wav", wav, model.tts_model.sample_rate)

Ecosystem

The shortest route to code, docs, models, demos, and community.

This landing page stays lightweight and points straight to the official resources you actually need to evaluate or ship VoxCPM.

FAQ

Common questions from people landing on VoxCPM for the first time.

What does “tokenizer-free” mean in VoxCPM?

VoxCPM directly models continuous speech representations instead of depending on a discrete audio tokenizer. That is the core architectural difference highlighted by the official project.

Can VoxCPM create a brand-new voice without any reference sample?

Yes. The voice design workflow accepts natural-language descriptions so you can generate a fresh voice without first providing a reference recording.

How do I get the best cloning similarity?

The official examples recommend combining reference audio with prompt audio and its transcript for the highest-fidelity continuation behavior.

Where should I start if I only want to evaluate quality?

Open the Live Playground and the audio sample page first. Then move to the GitHub repo and Read the Docs once you want to install locally or deploy inference.