File size: 2,138 Bytes
4b65bde
 
 
 
 
 
 
 
 
20404a9
 
4b65bde
20404a9
4b65bde
 
 
 
 
 
 
20404a9
4b65bde
 
 
 
 
 
 
 
 
 
 
20404a9
4b65bde
 
 
 
 
 
 
 
 
20404a9
 
4b65bde
 
 
 
 
 
 
 
 
20404a9
4b65bde
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
tags:
  - text-to-speech
  - vietnamese
  - ai-model
  - deep-learning
license: cc-by-nc-sa-4.0
library_name: pytorch
datasets:
  - PhoAudioBook
  - ViVoice
  - UEH
model_name: ZipVoice-Vietnamese-2500h
language: vi
---

# πŸ›‘ Important Note ⚠️  
This model is only intended for **research purposes**.  
**Access requests must be made using an institutional, academic, or corporate email**. Requests from public email providers will be denied. We appreciate your understanding.  

# πŸŽ™οΈ ZipVoice-Vietnamese-2500h
ZipVoice is a series of fast and high-quality zero-shot TTS models based on flow matching.

Key features:
1. Small and fast: only 123M parameters.

2. High-quality voice cloning: state-of-the-art performance in speaker similarity, intelligibility, and naturalness.

3. Multi-lingual: support Chinese and English.

4. Multi-mode: support both single-speaker and dialogue speech generation.

This checkpoint is a compact fine-tuned version of ZipVoice trained on 2500 hours of Vietnamese speech.  

πŸ”— For more fine-tuning and inference experiments, visit: https://github.com/k2-fsa/ZipVoice.  

πŸ“œ **License:** [CC-BY-NC-SA-4.0](https://spdx.org/licenses/CC-BY-NC-SA-4.0) β€” Non-commercial research use only.  

---

## πŸ“Œ Model Details

- **Dataset:** PhoAudioBook, ViVoice, TeacherDinh-UEH.
- **Total dataset durations:** 2500 hours
- **Data processing Technique:**
  - Remove all music background from audios, using facebook demucs model: https://github.com/facebookresearch/demucs
  - Do not use audio files shorter than 1 second or longer than 30 seconds.
  - Keep the default punctuation marks unchanged.
  - Normalize to lowercase format.
- **Training Configuration:**  
  - **Base Model:** ZipVoice with espeak-ng vi for tokenizer  
  - **GPU:** RTX 3090  
  - **Batch Siz:** Max duration 200  
- **Training Progress:** Stopped at **525,000 steps at epoch 11**  

---

## πŸ›‘ Update Note
Thank you, Teacher Định from the University of Economics Ho Chi Minh City (UEH), for providing me with an additional 50-hours high-quality labeled dataset.

Him contact: https://www.facebook.com/luudinhit93