A newer version of this model is available: B-K/umt5-thai-g2p-v2-0.5k

umt5-thai-g2p

This model is a fine-tuned version of google/umt5-small on the B-K/thai-g2p dataset for Thai Grapheme-to-Phoneme (G2P) conversion.

It achieves the following results on the sentence evaluation set:

  • Loss: 1.5449
  • CER: 0.094

Model Description

umt5-thai-g2p is designed to convert Thai text (words or sentences) into their corresponding phonemic International Phonetic Alphabet (IPA) representations.

Intended uses & limitations

Intended Uses

  • Thai Grapheme-to-Phoneme (G2P) Conversion: The primary use of this model is to generate phonemic transcriptions (IPA) for Thai text.
  • Speech Synthesis Preprocessing: Can be used as a component in a Text-to-Speech (TTS) pipeline to convert input text into phonemes before acoustic model processing.

Limitations

  • Accuracy: While the model achieves a Character Error Rate (CER) of approximately 0.094 on the evaluation set, it is not 100% accurate. Users should expect some errors in the generated phonemes.
  • Out-of-Distribution Data: Performance may degrade on words, phrases, or sentence structures significantly different from those present in the B-K/thai-g2p training dataset. This includes very rare words, neologisms, or complex named entities.
  • Ambiguity: Thai orthography can sometimes be ambiguous, and the model might not always resolve such ambiguities correctly to the intended pronunciation in all contexts.
  • Sentence-Level vs. Word-Level: While trained on a dataset that includes sentences, its robustness for very long or highly complex sentences might vary. The average generated length observed during training was around 27 tokens.
  • Inherited Limitations: As a fine-tuned version of google/umt5-small, it inherits the general architectural limitations and scale of the base model.

How to use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("B-K/umt5-thai-g2p")
model = AutoModelForSeq2SeqLM.from_pretrained("B-K/umt5-thai-g2p")

thai_text = "สวัสดีครับ" # Example Thai text
inputs = tokenizer(thai_text, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(**inputs, num_beams=3, max_new_tokens=48)
phonemes = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Thai Text: {thai_text}")
print(f"Phonemes: {phonemes}")

Training procedure

Training Hyperparameters

The following hyperparameters were used during training:

  • optimizer: adamw_torch
  • learning_rate: (starts with 5e-4 ends with 5e-6)
  • lr_scheduler_type: cosine
  • num_train_epochs: (about 200? i tune the training settings alot)
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • weight_decay: (starts with 0.01 ends with 0.1)
  • label_smoothing_factor: 0.1
  • max_grad_norm: 1.0
  • warmup_steps: 100
  • mixed_precision: bf16

Training results

Training Loss Epoch Step Validation Loss Cer Gen Len
No log 1.0 134 1.5636 0.0917 27.1747
No log 2.0 268 1.5603 0.093 27.1781
No log 3.0 402 1.5566 0.0938 27.1729
1.1631 4.0 536 1.5524 0.0941 27.1678
1.1631 5.0 670 1.5508 0.0939 27.113
1.1631 6.0 804 1.5472 0.0932 27.1575
1.1631 7.0 938 1.5450 0.0933 27.1421
1.1603 8.0 1072 1.5449 0.094 27.0616

Framework versions

  • Transformers 4.47.0
  • Pytorch 2.5.1
  • Datasets 3.6.0
  • Tokenizers 0.21.0
Downloads last month
1
Safetensors
Model size
46.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train B-K/umt5-thai-g2p

Evaluation results