thivux
/

PhoTextNormalization

Generated from Trainer

Model card Files Files and versions Community

PhoTextNormalization / README.md

thivux's picture

Upload folder using huggingface_hub

a5e8aea verified 28 days ago

|

history blame contribute delete

1.96 kB

	---
	language:
	- vi
	- vi
	license: bsd-3-clause
	base_model: facebook/mbart-large-50
	tags:
	- generated_from_trainer
	metrics:
	- bleu
	model-index:
	- name: PhoTextNormalization
	results:
	- task:
	name: Translation
	type: translation
	metrics:
	- name: Bleu
	type: bleu
	value: 88.8267
	---

	# PhoTextNormalization: Text normalization model for Vietnamese

	PhoTextNormalization converts Vietnamese text from written to spoken form. For example, "Một tháng có 30 hoặc 31 ngày, riêng tháng 2 có 28 ngày." will be converted to "một tháng có ba mươi hoặc ba mươi mốt ngày, riêng tháng hai có hai tám ngày."

	Details of the training can be found in our [ACL 2025 paper](https://arxiv.org/abs/2506.01322):

	```bibtex
	@inproceedings{vu2025zeroshottexttospeechvietnamese,
	title={Zero-Shot Text-to-Speech for Vietnamese},
	author={Thi Vu and Linh The Nguyen and Dat Quoc Nguyen},
	year={2025},
	booktitle={Proceedings of ACL},
	}
	```

	## Usage
	```python
	import torch
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

	device = "cuda:0" if torch.cuda.is_available() else "cpu"

	model_name = "thivux/PhoTextNormalization"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

	text = 'Một tháng có 30 hoặc 31 ngày, riêng tháng 2 có 28 ngày.'
	inputs = tokenizer(text, return_tensors="pt", padding=True,
	truncation=True, max_length=1024).to(device)

	# Generate translations
	with torch.no_grad():
	translated_tokens = model.generate(
	**inputs, max_length=1024, num_beams=5)

	# Decode
	decoded_outputs = [tokenizer.decode(output, skip_special_tokens=True)
	for output in translated_tokens]

	# decoded_outputs: ['một tháng có ba mươi hoặc ba mươi mốt ngày, riêng tháng hai có hai tám ngày.']
	print(f'decoded_outputs: {decoded_outputs}')
	```