A newer version of this model is available:
B-K/umt5-thai-g2p-v2-0.5k
umt5-thai-g2p
This model is a fine-tuned version of google/umt5-small on the B-K/thai-g2p dataset for Thai Grapheme-to-Phoneme (G2P) conversion.
It achieves the following results on the sentence evaluation set:
- Loss: 1.5449
- CER: 0.094
Model Description
umt5-thai-g2p
is designed to convert Thai text (words or sentences) into their corresponding phonemic International Phonetic Alphabet (IPA) representations.
Intended uses & limitations
Intended Uses
- Thai Grapheme-to-Phoneme (G2P) Conversion: The primary use of this model is to generate phonemic transcriptions (IPA) for Thai text.
- Speech Synthesis Preprocessing: Can be used as a component in a Text-to-Speech (TTS) pipeline to convert input text into phonemes before acoustic model processing.
Limitations
- Accuracy: While the model achieves a Character Error Rate (CER) of approximately 0.094 on the evaluation set, it is not 100% accurate. Users should expect some errors in the generated phonemes.
- Out-of-Distribution Data: Performance may degrade on words, phrases, or sentence structures significantly different from those present in the
B-K/thai-g2p
training dataset. This includes very rare words, neologisms, or complex named entities. - Ambiguity: Thai orthography can sometimes be ambiguous, and the model might not always resolve such ambiguities correctly to the intended pronunciation in all contexts.
- Sentence-Level vs. Word-Level: While trained on a dataset that includes sentences, its robustness for very long or highly complex sentences might vary. The average generated length observed during training was around 27 tokens.
- Inherited Limitations: As a fine-tuned version of
google/umt5-small
, it inherits the general architectural limitations and scale of the base model.
How to use
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("B-K/umt5-thai-g2p")
model = AutoModelForSeq2SeqLM.from_pretrained("B-K/umt5-thai-g2p")
thai_text = "สวัสดีครับ" # Example Thai text
inputs = tokenizer(thai_text, return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs, num_beams=3, max_new_tokens=48)
phonemes = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Thai Text: {thai_text}")
print(f"Phonemes: {phonemes}")
Training procedure
Training Hyperparameters
The following hyperparameters were used during training:
- optimizer: adamw_torch
- learning_rate: (starts with 5e-4 ends with 5e-6)
- lr_scheduler_type: cosine
- num_train_epochs: (about 200? i tune the training settings alot)
- per_device_train_batch_size: 128
- per_device_eval_batch_size: 128
- weight_decay: (starts with 0.01 ends with 0.1)
- label_smoothing_factor: 0.1
- max_grad_norm: 1.0
- warmup_steps: 100
- mixed_precision: bf16
Training results
Training Loss | Epoch | Step | Validation Loss | Cer | Gen Len |
---|---|---|---|---|---|
No log | 1.0 | 134 | 1.5636 | 0.0917 | 27.1747 |
No log | 2.0 | 268 | 1.5603 | 0.093 | 27.1781 |
No log | 3.0 | 402 | 1.5566 | 0.0938 | 27.1729 |
1.1631 | 4.0 | 536 | 1.5524 | 0.0941 | 27.1678 |
1.1631 | 5.0 | 670 | 1.5508 | 0.0939 | 27.113 |
1.1631 | 6.0 | 804 | 1.5472 | 0.0932 | 27.1575 |
1.1631 | 7.0 | 938 | 1.5450 | 0.0933 | 27.1421 |
1.1603 | 8.0 | 1072 | 1.5449 | 0.094 | 27.0616 |
Framework versions
- Transformers 4.47.0
- Pytorch 2.5.1
- Datasets 3.6.0
- Tokenizers 0.21.0
- Downloads last month
- 1
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Dataset used to train B-K/umt5-thai-g2p
Evaluation results
- Character Error Rate on B-K/thai-g2pself-reported0.094
- Loss on B-K/thai-g2pself-reported1.545