English to Thai Transliteration Model
This model transliterates English text to Thai script, converting the sound of English words into Thai characters.
Model Description
This model is based on ByT5, a token-free sequence-to-sequence model that operates directly on UTF-8 bytes. It has been fine-tuned specifically for English to Thai transliteration using multiple data sources to improve accuracy and coverage.
- Developed by: yacht
- Model type: ByT5 (Sequence-to-Sequence)
- Language(s): English → Thai
- License: MIT (free for commercial use)
- FP16 Support: Yes (model supports half-precision inference)
Intended Uses & Limitations
Intended Uses
- Converting English names, places, and terms into Thai script
- Assisting with the transliteration of foreign words into Thai
- Educational purposes for learning Thai script
- Improving accessibility of English content for Thai speakers
Limitations
- The model may struggle with uncommon or complex English words
- Transliteration quality depends on the training data coverage
- The model focuses on phonetic conversion, not translation
Training and Evaluation
Training Data
The model was trained on a combined dataset of English-Thai transliteration pairs from multiple sources. The dataset includes:
- Common English words and their Thai transliterations
- Names of people, places, and organizations
- Technical terms and other domain-specific vocabulary
- Geological and scientific terminology
Training Procedure
- Training framework: Hugging Face Transformers
- Base model:
google/byt5-base
- Training hyperparameters:
- Learning rate:
2e-4
- Batch size:
8
- Number of epochs:
10
- Optimizer:
AdamW
- Mixed precision: FP16 (model was trained with mixed precision)
- Gradient clipping: Yes (max_grad_norm=1.0)
- Learning rate:
Evaluation Results
- Accuracy:
0.7831
- Character Error Rate:
0.0591
- Mean Levenshtein Distance:
0.4654
How to Use
Standard Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
model_name = "yacht/byt5-base-en2th-transliterator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Transliterate English to Thai
english_text = "hello"
inputs = tokenizer(english_text, return_tensors="pt")
outputs = model.generate(inputs.input_ids)
thai_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"English: {english_text} → Thai: {thai_text}")
Using with FP16 for Faster Inference
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model with fp16 for faster inference (requires GPU with CUDA)
model_name = "yacht/byt5-base-en2th-transliterator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)
# Transliterate English to Thai with fp16
english_text = "hello"
inputs = tokenizer(english_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
outputs = model.generate(inputs.input_ids)
thai_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"English: {english_text} → Thai: {thai_text}")
Examples
English | Thai |
---|---|
hello | เฮลโล |
computer | คอมพิวเตอร์ |
thailand | ไทยแลนด์ |
bangkok | แบงค็อก |
graph | กราฟ |
grossular | กรอสซูลาร์ |
grossularite | กรอสซูลาไรต์ |
Performance Benefits of FP16
Using FP16 (half-precision) can provide significant performance benefits:
- Up to 2x faster inference on compatible GPUs
- Reduced memory usage (approximately half compared to FP32)
- Minimal impact on transliteration quality
Multi-Dataset Training
This model was trained on multiple datasets combined together, which provides several advantages:
- Broader vocabulary coverage across different domains
- Improved handling of edge cases and uncommon words
- More consistent transliteration patterns
- Better generalization to new, unseen words
Limitations and Bias
This model is designed specifically for transliteration, not translation. It attempts to convert the sounds of English words into Thai script, not to provide their Thai translations.
The model's performance may vary based on:
- The phonetic complexity of the input
- Whether the input contains sounds that are difficult to represent in Thai
- The coverage of similar words in the training data
Common Errors
Some common error patterns observed:
- group → กรูป (should be: กรุ๊ป)
- golf → โกล์ฟ (should be: กอล์ฟ)
License
MIT License
Copyright (c) 2025 yacht
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- Downloads last month
- 345