metadata

language:
  - vi
  - vi
license: bsd-3-clause
base_model: facebook/mbart-large-50
tags:
  - generated_from_trainer
metrics:
  - bleu
model-index:
  - name: PhoTextNormalization
    results:
      - task:
          name: Translation
          type: translation
        metrics:
          - name: Bleu
            type: bleu
            value: 88.8267

PhoTextNormalization: Text normalization model for Vietnamese

PhoTextNormalization converts Vietnamese text from written to spoken form. For example, "Một tháng có 30 hoặc 31 ngày, riêng tháng 2 có 28 ngày." will be converted to "một tháng có ba mươi hoặc ba mươi mốt ngày, riêng tháng hai có hai tám ngày."

Details of the training can be found in our ACL 2025 paper:

@inproceedings{vu2025zeroshottexttospeechvietnamese,
      title={Zero-Shot Text-to-Speech for Vietnamese}, 
      author={Thi Vu and Linh The Nguyen and Dat Quoc Nguyen},
      year={2025},
      booktitle={Proceedings of ACL},
}

Usage

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model_name = "thivux/PhoTextNormalization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

text = 'Một tháng có 30 hoặc 31 ngày, riêng tháng 2 có 28 ngày.'
inputs = tokenizer(text, return_tensors="pt", padding=True,
                    truncation=True, max_length=1024).to(device)

# Generate translations
with torch.no_grad():
    translated_tokens = model.generate(
        **inputs, max_length=1024, num_beams=5)

# Decode 
decoded_outputs = [tokenizer.decode(output, skip_special_tokens=True)
                    for output in translated_tokens]

# decoded_outputs: ['một tháng có ba mươi hoặc ba mươi mốt ngày, riêng tháng hai có hai tám ngày.']
print(f'decoded_outputs: {decoded_outputs}')