VoxCPM2 is a tokenizer-free multilingual text-to-speech model that directly generates continuous speech representations for natural, expressive synthesis.

VoxCPM2 supports multilingual text-to-speech, natural-language voice design, controllable voice cloning, ultimate cloning, and 48kHz audio output.

Which languages are supported?

The current release supports 30 global languages and 9 Chinese dialects according to the official VoxCPM repository.

Where are the official project resources?

Official code, docs, demo, and model links are available through the OpenBMB GitHub repository, Read the Docs, Hugging Face, and ModelScope.

Open-source multilingual speech generation

Tokenizer-free TTS built for voice design and faithful cloning.

VoxCPM2 is a 2B-parameter speech model trained on 2M+ hours of multilingual audio. It synthesizes 30 languages and 9 Chinese dialects, creates new voices from natural-language prompts, and delivers controllable cloning with native 48kHz output.

Try Live Demo View GitHub Read Docs

2B parameters 2M+ training hours 30 languages 9 Chinese dialects Apache-2.0

Voice Design: Create a new voice from natural language alone.
Controllable Cloning: Preserve timbre while steering pace, emotion, and style.
Ultimate Cloning: Continue from prompt audio and transcript for higher fidelity.
Streaming: Real-time serving through Nano-vLLM or vLLM-Omni acceleration.

VoxCPM2 / Continuous speech representations

Design voices. Clone voices. Stream voices.

Architecture Diffusion autoregressive generation without a discrete audio tokenizer.

Base Built on MiniCPM-4 with open-source code and weights.

Output Native 48kHz audio through AudioVAE V2 asymmetric encode/decode.

Serving OpenAI-compatible streaming paths via official and community inference stacks.

Release VoxCPM2

Coverage 30 languages

Dialects 9 Chinese dialects

Quality 48kHz output

License Apache-2.0

Highlights

The core capabilities that make VoxCPM stand out.

The official repository centers on six capabilities: multilingual synthesis, voice design, controllable cloning, extreme fidelity continuation, context-aware prosody, and fast streaming.

30-language synthesis

Generate speech from raw text across 30 languages without separate language tags.

Voice design

Describe a voice in natural language and synthesize from that prompt with no reference clip.

Controllable cloning

Clone a reference voice while layering style instructions for pace, emotion, and energy.

Ultimate cloning

Use prompt audio and transcript for continuation-style cloning that preserves micro-details.

Context-aware prosody

The model infers rhythm and expressiveness from text instead of relying on rigid templates.

Real-time serving

Pair with Nano-vLLM or vLLM-Omni for lower RTF and OpenAI-compatible serving workflows.

Use Cases

Three ways teams usually approach VoxCPM.

The project is strong when you need a direct path from text or a short voice reference to expressive, high-quality audio with open-source flexibility.

Build multilingual product voices

Start from text only, cover global markets, and keep a single synthesis stack for product narration, onboarding, or interactive assistants.

30 languages Context-aware 48kHz

Prototype new speaker identities

Use voice design prompts to audition personas before collecting studio data, then refine toward the timbre and tone you actually want.

Natural-language control No reference audio Creative direction

Clone with tighter expressive control

Combine a short reference clip with control instructions or prompt audio continuation when fidelity and style retention both matter.

Reference voice Style prompts Streaming deployment

Quick Start

Install once, generate speech immediately.

The official project targets Python 3.10+, PyTorch 2.5+, and CUDA 12.0+. For local testing, the shortest path is the `openbmb/VoxCPM2` checkpoint.

`pip install voxcpm` for the base package.
`openbmb/VoxCPM2` is the current recommended release.
Use Read the Docs for full quick-start and deployment details.
ModelScope mirrors are available for domestic network access.

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
)

wav = model.generate(
    text="VoxCPM2 is ready for realistic multilingual speech synthesis.",
    cfg_value=2.0,
    inference_timesteps=10,
)

sf.write("demo.wav", wav, model.tts_model.sample_rate)

Ecosystem

The shortest route to code, docs, models, demos, and community.

This landing page stays lightweight and points straight to the official resources you actually need to evaluate or ship VoxCPM.

Code

OpenBMB GitHub

Repository, releases, issues, and the canonical README for VoxCPM and VoxCPM2.

Open repository →

Docs

Read the Docs

Installation, quick start, API, CLI, and deployment guidance.

Read documentation →

Demo

Live Playground

Test text-to-speech, voice design, and cloning behavior before local setup.

Launch demo →

Weights

Hugging Face

Model checkpoints for the latest recommended release and older variants.

Browse models →

Mirror

ModelScope

Alternative model distribution for domestic access and mirrored downloads.

Open mirror →

Samples

Audio Demo Page

Listen to multilingual examples, cloning outputs, and qualitative comparisons.

Listen to samples →

FAQ

Common questions from people landing on VoxCPM for the first time.

What does “tokenizer-free” mean in VoxCPM?

VoxCPM directly models continuous speech representations instead of depending on a discrete audio tokenizer. That is the core architectural difference highlighted by the official project.

Can VoxCPM create a brand-new voice without any reference sample?

Yes. The voice design workflow accepts natural-language descriptions so you can generate a fresh voice without first providing a reference recording.

How do I get the best cloning similarity?

The official examples recommend combining reference audio with prompt audio and its transcript for the highest-fidelity continuation behavior.

Where should I start if I only want to evaluate quality?

Open the Live Playground and the audio sample page first. Then move to the GitHub repo and Read the Docs once you want to install locally or deploy inference.