umt5-thai-g2p

This model is a fine-tuned version of google/umt5-small on the B-K/thai-g2p dataset for Thai Grapheme-to-Phoneme (G2P) conversion.

It achieves the following results on the sentence evaluation set:

Loss: 1.5449
CER: 0.094

Model Description

umt5-thai-g2p is designed to convert Thai text (words or sentences) into their corresponding phonemic International Phonetic Alphabet (IPA) representations.

Intended uses & limitations

Intended Uses

Thai Grapheme-to-Phoneme (G2P) Conversion: The primary use of this model is to generate phonemic transcriptions (IPA) for Thai text.
Speech Synthesis Preprocessing: Can be used as a component in a Text-to-Speech (TTS) pipeline to convert input text into phonemes before acoustic model processing.

Limitations

Accuracy: While the model achieves a Character Error Rate (CER) of approximately 0.094 on the evaluation set, it is not 100% accurate. Users should expect some errors in the generated phonemes.
Out-of-Distribution Data: Performance may degrade on words, phrases, or sentence structures significantly different from those present in the B-K/thai-g2p training dataset. This includes very rare words, neologisms, or complex named entities.
Ambiguity: Thai orthography can sometimes be ambiguous, and the model might not always resolve such ambiguities correctly to the intended pronunciation in all contexts.
Sentence-Level vs. Word-Level: While trained on a dataset that includes sentences, its robustness for very long or highly complex sentences might vary. The average generated length observed during training was around 27 tokens.
Inherited Limitations: As a fine-tuned version of google/umt5-small, it inherits the general architectural limitations and scale of the base model.

How to use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("B-K/umt5-thai-g2p")
model = AutoModelForSeq2SeqLM.from_pretrained("B-K/umt5-thai-g2p")

thai_text = "สวัสดีครับ" # Example Thai text
inputs = tokenizer(thai_text, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(**inputs, num_beams=3, max_new_tokens=48)
phonemes = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Thai Text: {thai_text}")
print(f"Phonemes: {phonemes}")

Training procedure

Training Hyperparameters

The following hyperparameters were used during training:

optimizer: adamw_torch
learning_rate: (starts with 5e-4 ends with 5e-6)
lr_scheduler_type: cosine
num_train_epochs: (about 200? i tune the training settings alot)
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
weight_decay: (starts with 0.01 ends with 0.1)
label_smoothing_factor: 0.1
max_grad_norm: 1.0
warmup_steps: 100
mixed_precision: bf16

Training results

Training Loss	Epoch	Step	Validation Loss	Cer	Gen Len
No log	1.0	134	1.5636	0.0917	27.1747
No log	2.0	268	1.5603	0.093	27.1781
No log	3.0	402	1.5566	0.0938	27.1729
1.1631	4.0	536	1.5524	0.0941	27.1678
1.1631	5.0	670	1.5508	0.0939	27.113
1.1631	6.0	804	1.5472	0.0932	27.1575
1.1631	7.0	938	1.5450	0.0933	27.1421
1.1603	8.0	1072	1.5449	0.094	27.0616

Framework versions

Transformers 4.47.0
Pytorch 2.5.1
Datasets 3.6.0
Tokenizers 0.21.0

B-K
/

umt5-thai-g2p