🇻🇳 Vietnamese Text-to-Speech (TTS)

Model Description

This is a Vietnamese Text-to-Speech (TTS) model trained to generate natural-sounding Vietnamese speech from text. The model is designed for applications such as virtual assistants, audiobooks, and accessibility tools.

Model Name: zalopay/vietnamese-tts
Language: Vietnamese (vi)
Task: Text-to-Speech (TTS)
Framework: F5-TTS
License: CC-BY-4.0

Model Architecture

F5-TTS uses Diffusion Transformer with ConvNeXt V2, faster trained and inference.

Training Data

Dataset: this model was trained using 200+ hours public Vietnamese Voice and Youtube

Inference Example

from f5_tts.infer.utils_infer import (
    preprocess_ref_audio_text,
    load_vocoder,
    load_model,
    infer_process,
    save_spectrogram,
)


vocoder = load_vocoder()
# dim: 1024
#     depth: 22
#     heads: 16
#     ff_mult: 2
#     text_dim: 512
model = load_model(
    DiT,
    dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4),
    ckpt_path=str(
        cached_path("hf://zalopay/vietnamese-tts/model_960000.pt")
    ),
    mel_spec_type="vocos",
    vocab_file=str(cached_path("hf://zalopay/vietnamese-tts/vocab.txt")),
)

...

ref_audio, ref_text = preprocess_ref_audio_text(ref_audio_orig, ref_text)
    gr.Info("Generated audio text: {} with audio file {} ".format(ref_text, ref_audio_orig))
    final_wave, final_sample_rate, combined_spectrogram = infer_process(
        ref_audio,
        ref_text,
        gen_text,
        model,
        vocoder,
        cross_fade_duration=0.15,
        nfe_step=32,
        speed=speed,
    )

Applications

Virtual assistants (e.g., chatbots, AI voice interactions)
Audiobooks and content narration
Accessibility tools for visually impaired users
Automated announcements and voiceovers

Limitations & Biases

May struggle with uncommon words or names.
Limited support for different accents or dialects.
Background noise or pronunciation inconsistencies may occur.
Duplicated voice may occur

Citation

If you use this model, please cite:

@misc{zalopay-vietnamese-tts,
  title={Zalopay Vietnamese Text-to-Speech Model},
  author={Zalopay},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/zalopay/vietnamese-tts}
}

Acknowledgments

Special thanks to F5-TTS for providing such wonderful base model and framework

zalopay
/

vietnamese-tts