hynt
/

ZipVoice-Vietnamese-2500h

Model card Files Files and versions

hynt commited on 7 days ago

Commit

20404a9

·

verified ·

1 Parent(s): 4b65bde

Update README.md

Files changed (1) hide show

README.md +8 -10

README.md CHANGED Viewed

@@ -7,12 +7,10 @@ tags:
 license: cc-by-nc-sa-4.0
 library_name: pytorch
 datasets:
-  - VLSP2021
-  - VLSP2022
-  - VLSP2023
-  - vietTTS
   - UEH
-model_name: ZipVoice-Vietnamese-150h
 language: vi
 ---
@@ -20,7 +18,7 @@ language: vi
 This model is only intended for **research purposes**.
 **Access requests must be made using an institutional, academic, or corporate email**. Requests from public email providers will be denied. We appreciate your understanding.
-# 🎙️ ZipVoice-Vietnamese-150h
 ZipVoice is a series of fast and high-quality zero-shot TTS models based on flow matching.
 Key features:
@@ -32,7 +30,7 @@ Key features:
 4. Multi-mode: support both single-speaker and dialogue speech generation.
-This checkpoint is a compact fine-tuned version of ZipVoice trained on 150 hours of Vietnamese speech.
 🔗 For more fine-tuning and inference experiments, visit: https://github.com/k2-fsa/ZipVoice.
@@ -42,8 +40,8 @@ This checkpoint is a compact fine-tuned version of ZipVoice trained on 150 hours
 ## 📌 Model Details
-- **Dataset:** VLSP 2021, VLSP 2022, VLSP 2023, VietTTS, TeacherDinh-UEH and some speech sources from YouTube channels.
-- **Total dataset durations:** 150 hours
 - **Data processing Technique:**
   - Remove all music background from audios, using facebook demucs model: https://github.com/facebookresearch/demucs
   - Do not use audio files shorter than 1 second or longer than 30 seconds.
@@ -53,7 +51,7 @@ This checkpoint is a compact fine-tuned version of ZipVoice trained on 150 hours
   - **Base Model:** ZipVoice with espeak-ng vi for tokenizer
   - **GPU:** RTX 3090
   - **Batch Siz:** Max duration 200
-- **Training Progress:** Stopped at **96,000 steps at epoch 30**
 ---

 license: cc-by-nc-sa-4.0
 library_name: pytorch
 datasets:
+  - PhoAudioBook
+  - ViVoice
   - UEH
+model_name: ZipVoice-Vietnamese-2500h
 language: vi
 ---
 This model is only intended for **research purposes**.
 **Access requests must be made using an institutional, academic, or corporate email**. Requests from public email providers will be denied. We appreciate your understanding.
+# 🎙️ ZipVoice-Vietnamese-2500h
 ZipVoice is a series of fast and high-quality zero-shot TTS models based on flow matching.
 Key features:
 4. Multi-mode: support both single-speaker and dialogue speech generation.
+This checkpoint is a compact fine-tuned version of ZipVoice trained on 2500 hours of Vietnamese speech.
 🔗 For more fine-tuning and inference experiments, visit: https://github.com/k2-fsa/ZipVoice.
 ## 📌 Model Details
+- **Dataset:** PhoAudioBook, ViVoice, TeacherDinh-UEH.
+- **Total dataset durations:** 2500 hours
 - **Data processing Technique:**
   - Remove all music background from audios, using facebook demucs model: https://github.com/facebookresearch/demucs
   - Do not use audio files shorter than 1 second or longer than 30 seconds.
   - **Base Model:** ZipVoice with espeak-ng vi for tokenizer
   - **GPU:** RTX 3090
   - **Batch Siz:** Max duration 200
+- **Training Progress:** Stopped at **525,000 steps at epoch 11**
 ---