You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching

arXiv demo

Overview

ZipVoice is a high-quality zero-shot TTS model with a small model size and fast inference speed.

Key features:

  • Small and fast: only 123M parameters.

  • High-quality: state-of-the-art voice cloning performance in speaker similarity, intelligibility, and naturalness.

  • Multi-lingual: support Chinese and English.

News

2025/06/16: ๐Ÿ”ฅ ZipVoice is released.

Installation

  • Clone icefall repository and change to zipvoice directory:
git clone https://github.com/k2-fsa/icefall.git
cd icefall/egs/zipvoice
  • Create a Python virtual environment (optional but recommended):
python3 -m venv venv
source venv/bin/activate
  • Install the required packages:
pip install -r requirements.txt

Usage

To generate speech with our pre-trained ZipVoice or ZipVoice-Distill models, use the following commands (Required models will be downloaded from HuggingFace):

1. Inference of a single sentence:

python3 zipvoice/zipvoice_infer.py \
    --model-name "zipvoice_distill" \
    --prompt-wav prompt.wav \
    --prompt-text "I am the transcription of the prompt wav." \
    --text "I am the text to be synthesized." \
    --res-wav-path result.wav

# Example with a pre-defined prompt wav and text
python3 zipvoice/zipvoice_infer.py \
    --model-name "zipvoice_distill" \
    --prompt-wav assets/prompt-en.wav \
    --prompt-text "Some call me nature, others call me mother nature. I've been here for over four point five billion years, twenty two thousand five hundred times longer than you." \
    --text "Welcome to use our tts model, have fun!" \
    --res-wav-path result.wav

2. Inference of a list of sentences:

python3 zipvoice/zipvoice_infer.py \
    --model-name "zipvoice_distill" \
    --test-list test.tsv \
    --res-dir results/test
  • --model-name can be zipvoice or zipvoice_distill, which are models before and after distillation, respectively.
  • Each line of test.tsv is in the format of {wav_name}\t{prompt_transcription}\t{prompt_wav}\t{text}.

Note: If you having trouble connecting to HuggingFace, try:

export HF_ENDPOINT=https://hf-mirror.com

Training Your Own Model

The following steps show how to train a model from scratch on Emilia and LibriTTS datasets, respectively.

0. Install dependencies for training

# Install pytorch and k2.
# If you want to use different versions, please refer to https://k2-fsa.org/get-started/k2/ for details.
# For users in China mainland, please refer to https://k2-fsa.org/zh-CN/get-started/k2/

# Note: Make sure you have installed the correct version of PyTorch and k2 that matches your CUDA version.
# For example, if want to use pytorch 2.5.1 and you are using CUDA 12.1, you can install PyTorch and k2 as follows:

pip install torch==2.5.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install k2==1.24.4.dev20250208+cuda12.1.torch2.5.1 -f https://k2-fsa.github.io/k2/cuda.html

pip install -r ../../requirements.txt

1. Data Preparation

1.1. Prepare the Emilia dataset

bash scripts/prepare_emilia.sh

See scripts/prepare_emilia.sh for step by step instructions.

1.2 Prepare the LibriTTS dataset

bash scripts/prepare_libritts.sh

See scripts/prepare_libritts.sh for step by step instructions.

2. Training

2.1 Traininig on Emilia

Expand to view training steps
2.1.1 Train the ZipVoice model
  • Training:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_flow.py \
        --world-size 8 \
        --use-fp16 1 \
        --dataset emilia \
        --max-duration 500 \
        --lr-hours 30000 \
        --lr-batches 7500 \
        --token-file "data/tokens_emilia.txt" \
        --manifest-dir "data/fbank" \
        --num-epochs 11 \
        --exp-dir zipvoice/exp_zipvoice
  • Average the checkpoints to produce the final model:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/generate_averaged_model.py \
      --epoch 11 \
      --avg 4 \
      --distill 0 \
      --token-file data/tokens_emilia.txt \
      --dataset "emilia" \
      --exp-dir ./zipvoice/exp_zipvoice
# The generated model is zipvoice/exp_zipvoice/epoch-11-avg-4.pt
2.1.2. Train the ZipVoice-Distill model (Optional)
  • The first-stage distillation:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_distill.py \
        --world-size 8 \
        --use-fp16 1 \
        --tensorboard 1 \
        --dataset "emilia" \
        --base-lr 0.0005 \
        --max-duration 500 \
        --token-file "data/tokens_emilia.txt" \
        --manifest-dir "data/fbank" \
        --teacher-model zipvoice/exp_zipvoice/epoch-11-avg-4.pt \
        --num-updates 60000 \
        --distill-stage "first" \
        --exp-dir zipvoice/exp_zipvoice_distill_1stage
  • Average checkpoints for the second-stage initialization:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/generate_averaged_model.py \
      --iter 60000 \
      --avg 7 \
      --distill 1 \
      --token-file data/tokens_emilia.txt \
      --dataset "emilia" \
      --exp-dir ./zipvoice/exp_zipvoice_distill_1stage
# The generated model is zipvoice/exp_zipvoice_distill_1stage/iter-60000-avg-7.pt
  • The second-stage distillation:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_distill.py \
        --world-size 8 \
        --use-fp16 1 \
        --tensorboard 1 \
        --dataset "emilia" \
        --base-lr 0.0001 \
        --max-duration 200 \
        --token-file "data/tokens_emilia.txt" \
        --manifest-dir "data/fbank" \
        --teacher-model zipvoice/exp_zipvoice_distill_1stage/iter-60000-avg-7.pt \
        --num-updates 2000 \
        --distill-stage "second" \
        --exp-dir zipvoice/exp_zipvoice_distill_new

2.2 Traininig on LibriTTS

Expand to view training steps
2.2.1 Train the ZipVoice model
  • Training:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_flow.py \
        --world-size 8 \
        --use-fp16 1 \
        --dataset libritts \
        --max-duration 250 \
        --lr-epochs 10 \
        --lr-batches 7500 \
        --token-file "data/tokens_libritts.txt" \
        --manifest-dir "data/fbank" \
        --num-epochs 60 \
        --exp-dir zipvoice/exp_zipvoice_libritts
  • Average the checkpoints to produce the final model:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/generate_averaged_model.py \
      --epoch 60 \
      --avg 10 \
      --distill 0 \
      --token-file data/tokens_libritts.txt \
      --dataset "libritts" \
      --exp-dir ./zipvoice/exp_zipvoice_libritts
# The generated model is zipvoice/exp_zipvoice_libritts/epoch-60-avg-10.pt
2.1.2 Train the ZipVoice-Distill model (Optional)
  • The first-stage distillation:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_distill.py \
        --world-size 8 \
        --use-fp16 1 \
        --tensorboard 1 \
        --dataset "libritts" \
        --base-lr 0.001 \
        --max-duration 250 \
        --token-file "data/tokens_libritts.txt" \
        --manifest-dir "data/fbank" \
        --teacher-model zipvoice/exp_zipvoice_libritts/epoch-60-avg-10.pt \
        --num-epochs 6 \
        --distill-stage "first" \
        --exp-dir zipvoice/exp_zipvoice_distill_1stage_libritts
  • Average checkpoints for the second-stage initialization:
export PYTHONPATH=../../:$PYTHONPATH
python3 ./zipvoice/generate_averaged_model.py \
      --epoch 6 \
      --avg 3 \
      --distill 1 \
      --token-file data/tokens_libritts.txt \
      --dataset "libritts" \
      --exp-dir ./zipvoice/exp_zipvoice_distill_1stage_libritts
# The generated model is zipvoice/exp_zipvoice_distill_1stage_libritts/epoch-6-avg-3.pt
  • The second-stage distillation:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_distill.py \
        --world-size 8 \
        --use-fp16 1 \
        --tensorboard 1 \
        --dataset "libritts" \
        --base-lr 0.001 \
        --max-duration 250 \
        --token-file "data/tokens_libritts.txt" \
        --manifest-dir "data/fbank" \
        --teacher-model zipvoice/exp_zipvoice_distill_1stage_libritts/epoch-6-avg-3.pt \
        --num-epochs 6 \
        --distill-stage "second" \
        --exp-dir zipvoice/exp_zipvoice_distill_libritts
  • Average checkpoints to produce the final model:
export PYTHONPATH=../../:$PYTHONPATH
python3 ./zipvoice/generate_averaged_model.py \
      --epoch 6 \
      --avg 3 \
      --distill 1 \
      --token-file data/tokens_libritts.txt \
      --dataset "libritts" \
      --exp-dir ./zipvoice/exp_zipvoice_distill_libritts
# The generated model is ./zipvoice/exp_zipvoice_distill_libritts/epoch-6-avg-3.pt

3. Inference with the trained model

3.1 Inference with the model trained on Emilia

Expand to view inference commands.
3.1.1 ZipVoice model before distill:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/infer.py \
      --checkpoint zipvoice/exp_zipvoice/epoch-11-avg-4.pt \
      --distill 0 \
      --token-file "data/tokens_emilia.txt" \
      --test-list test.tsv \
      --res-dir results/test \
      --num-step 16 \
      --guidance-scale 1
3.1.2 ZipVoice-Distill model before distill:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/infer.py \
      --checkpoint zipvoice/exp_zipvoice_distill/checkpoint-2000.pt \
      --distill 1 \
      --token-file "data/tokens_emilia.txt" \
      --test-list test.tsv \
      --res-dir results/test_distill \
      --num-step 8 \
      --guidance-scale 3

3.2 Inference with the model trained on LibriTTS

Expand to view inference commands.
3.2.1 ZipVoice model before distill:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/infer.py \
      --checkpoint zipvoice/exp_zipvoice_libritts/epoch-60-avg-10.pt \
      --distill 0 \
      --token-file "data/tokens_libritts.txt" \
      --test-list test.tsv \
      --res-dir results/test_libritts \
      --num-step 8 \
      --guidance-scale 1 \
      --target-rms 1.0 \
      --t-shift 0.7
3.2.2 ZipVoice-Distill model before distill
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/infer.py \
      --checkpoint zipvoice/exp_zipvoice_distill/epoch-6-avg-3.pt \
      --distill 1 \
      --token-file "data/tokens_libritts.txt" \
      --test-list test.tsv \
      --res-dir results/test_distill_libritts \
      --num-step 4 \
      --guidance-scale 3 \
      --target-rms 1.0 \
      --t-shift 0.7

4. Evaluation on benchmarks

See local/evaluate.sh for details of objective metrics evaluation on three test sets, i.e., LibriSpeech-PC test-clean, Seed-TTS test-en and Seed-TTS test-zh.

Citation

@article{zhu-2025-zipvoice,
      title={ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching},
      author={Han Zhu and Wei Kang and Zengwei Yao and Liyong Guo and Fangjun Kuang and Zhaoqing Li and Weiji Zhuang and Long Lin and Daniel Povey}
      journal={arXiv preprint arXiv:2506.13053},
      year={2025},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support