πŸ‡»πŸ‡³ Vietnamese Text-to-Speech (TTS)

Model Description

This is a Vietnamese Text-to-Speech (TTS) model trained to generate natural-sounding Vietnamese speech from text. The model is designed for applications such as virtual assistants, audiobooks, and accessibility tools.

  • Model Name: zalopay/vietnamese-tts
  • Language: Vietnamese (vi)
  • Task: Text-to-Speech (TTS)
  • Framework: F5-TTS
  • License: CC-BY-4.0

Model Architecture

  • F5-TTS uses Diffusion Transformer with ConvNeXt V2, faster trained and inference.

Training Data

  • Dataset: this model was trained using 200+ hours public Vietnamese Voice and Youtube

Inference Example

from f5_tts.infer.utils_infer import (
    preprocess_ref_audio_text,
    load_vocoder,
    load_model,
    infer_process,
    save_spectrogram,
)


vocoder = load_vocoder()
# dim: 1024
#     depth: 22
#     heads: 16
#     ff_mult: 2
#     text_dim: 512
model = load_model(
    DiT,
    dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4),
    ckpt_path=str(
        cached_path("hf://zalopay/vietnamese-tts/model_960000.pt")
    ),
    mel_spec_type="vocos",
    vocab_file=str(cached_path("hf://zalopay/vietnamese-tts/vocab.txt")),
)

...

ref_audio, ref_text = preprocess_ref_audio_text(ref_audio_orig, ref_text)
    gr.Info("Generated audio text: {} with audio file {} ".format(ref_text, ref_audio_orig))
    final_wave, final_sample_rate, combined_spectrogram = infer_process(
        ref_audio,
        ref_text,
        gen_text,
        model,
        vocoder,
        cross_fade_duration=0.15,
        nfe_step=32,
        speed=speed,
    )

Applications

  • Virtual assistants (e.g., chatbots, AI voice interactions)
  • Audiobooks and content narration
  • Accessibility tools for visually impaired users
  • Automated announcements and voiceovers

Limitations & Biases

  • May struggle with uncommon words or names.
  • Limited support for different accents or dialects.
  • Background noise or pronunciation inconsistencies may occur.
  • Duplicated voice may occur

Citation

If you use this model, please cite:

@misc{zalopay-vietnamese-tts,
  title={Zalopay Vietnamese Text-to-Speech Model},
  author={Zalopay},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/zalopay/vietnamese-tts}
}

Acknowledgments

Special thanks to F5-TTS for providing such wonderful base model and framework

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for zalopay/vietnamese-tts

Base model

SWivid/F5-TTS
Finetuned
(33)
this model

Space using zalopay/vietnamese-tts 1