umt5-thai-g2p / README.md
B-K's picture
Update README.md
805b5a0 verified
metadata
library_name: transformers
license: apache-2.0
model-index:
  - name: umt5-thai-g2p-9
    results:
      - task:
          type: text2text-generation
          name: Grapheme-to-Phoneme Conversion
        dataset:
          name: B-K/thai-g2p
          type: B-K/thai-g2p
          config: default
          split: sentence_validation
        metrics:
          - type: cer
            value: 0.094
            name: Character Error Rate
          - type: loss
            value: 1.5449
            name: Loss
datasets:
  - B-K/thai-g2p
language:
  - th
metrics:
  - cer
pipeline_tag: text2text-generation
widget:
  - text: สวัสดีครับ
    example_title: Thai G2P Example
new_version: B-K/umt5-thai-g2p-v2-0.5k

umt5-thai-g2p

This model is a fine-tuned version of google/umt5-small on the B-K/thai-g2p dataset for Thai Grapheme-to-Phoneme (G2P) conversion.

It achieves the following results on the sentence evaluation set:

  • Loss: 1.5449
  • CER: 0.094

Model Description

umt5-thai-g2p is designed to convert Thai text (words or sentences) into their corresponding phonemic International Phonetic Alphabet (IPA) representations.

Intended uses & limitations

Intended Uses

  • Thai Grapheme-to-Phoneme (G2P) Conversion: The primary use of this model is to generate phonemic transcriptions (IPA) for Thai text.
  • Speech Synthesis Preprocessing: Can be used as a component in a Text-to-Speech (TTS) pipeline to convert input text into phonemes before acoustic model processing.

Limitations

  • Accuracy: While the model achieves a Character Error Rate (CER) of approximately 0.094 on the evaluation set, it is not 100% accurate. Users should expect some errors in the generated phonemes.
  • Out-of-Distribution Data: Performance may degrade on words, phrases, or sentence structures significantly different from those present in the B-K/thai-g2p training dataset. This includes very rare words, neologisms, or complex named entities.
  • Ambiguity: Thai orthography can sometimes be ambiguous, and the model might not always resolve such ambiguities correctly to the intended pronunciation in all contexts.
  • Sentence-Level vs. Word-Level: While trained on a dataset that includes sentences, its robustness for very long or highly complex sentences might vary. The average generated length observed during training was around 27 tokens.
  • Inherited Limitations: As a fine-tuned version of google/umt5-small, it inherits the general architectural limitations and scale of the base model.

How to use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("B-K/umt5-thai-g2p")
model = AutoModelForSeq2SeqLM.from_pretrained("B-K/umt5-thai-g2p")

thai_text = "สวัสดีครับ" # Example Thai text
inputs = tokenizer(thai_text, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(**inputs, num_beams=3, max_new_tokens=48)
phonemes = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Thai Text: {thai_text}")
print(f"Phonemes: {phonemes}")

Training procedure

Training Hyperparameters

The following hyperparameters were used during training:

  • optimizer: adamw_torch
  • learning_rate: (starts with 5e-4 ends with 5e-6)
  • lr_scheduler_type: cosine
  • num_train_epochs: (about 200? i tune the training settings alot)
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • weight_decay: (starts with 0.01 ends with 0.1)
  • label_smoothing_factor: 0.1
  • max_grad_norm: 1.0
  • warmup_steps: 100
  • mixed_precision: bf16

Training results

Training Loss Epoch Step Validation Loss Cer Gen Len
No log 1.0 134 1.5636 0.0917 27.1747
No log 2.0 268 1.5603 0.093 27.1781
No log 3.0 402 1.5566 0.0938 27.1729
1.1631 4.0 536 1.5524 0.0941 27.1678
1.1631 5.0 670 1.5508 0.0939 27.113
1.1631 6.0 804 1.5472 0.0932 27.1575
1.1631 7.0 938 1.5450 0.0933 27.1421
1.1603 8.0 1072 1.5449 0.094 27.0616

Framework versions

  • Transformers 4.47.0
  • Pytorch 2.5.1
  • Datasets 3.6.0
  • Tokenizers 0.21.0