English to Thai Transliteration Model

This model transliterates English text to Thai script, converting the sound of English words into Thai characters.

Model Description

This model is based on ByT5, a token-free sequence-to-sequence model that operates directly on UTF-8 bytes. It has been fine-tuned specifically for English to Thai transliteration using multiple data sources to improve accuracy and coverage.

Developed by: yacht
Model type: ByT5 (Sequence-to-Sequence)
Language(s): English → Thai
License: MIT (free for commercial use)
FP16 Support: Yes (model supports half-precision inference)

Intended Uses & Limitations

Intended Uses

Converting English names, places, and terms into Thai script
Assisting with the transliteration of foreign words into Thai
Educational purposes for learning Thai script
Improving accessibility of English content for Thai speakers

Limitations

The model may struggle with uncommon or complex English words
Transliteration quality depends on the training data coverage
The model focuses on phonetic conversion, not translation

Training and Evaluation

Training Data

The model was trained on a combined dataset of English-Thai transliteration pairs from multiple sources. The dataset includes:

Common English words and their Thai transliterations
Names of people, places, and organizations
Technical terms and other domain-specific vocabulary
Geological and scientific terminology

Training Procedure

Training framework: Hugging Face Transformers
Base model: google/byt5-base
Training hyperparameters:
- Learning rate: 2e-4
- Batch size: 8
- Number of epochs: 10
- Optimizer: AdamW
- Mixed precision: FP16 (model was trained with mixed precision)
- Gradient clipping: Yes (max_grad_norm=1.0)

Evaluation Results

Accuracy: 0.7831
Character Error Rate: 0.0591
Mean Levenshtein Distance: 0.4654

How to Use

Standard Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
model_name = "yacht/byt5-base-en2th-transliterator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Transliterate English to Thai
english_text = "hello"
inputs = tokenizer(english_text, return_tensors="pt")
outputs = model.generate(inputs.input_ids)
thai_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"English: {english_text} → Thai: {thai_text}")

Using with FP16 for Faster Inference

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model with fp16 for faster inference (requires GPU with CUDA)
model_name = "yacht/byt5-base-en2th-transliterator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

# Transliterate English to Thai with fp16
english_text = "hello"
inputs = tokenizer(english_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
outputs = model.generate(inputs.input_ids)
thai_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"English: {english_text} → Thai: {thai_text}")

Examples

English	Thai
hello	เฮลโล
computer	คอมพิวเตอร์
thailand	ไทยแลนด์
bangkok	แบงค็อก
graph	กราฟ
grossular	กรอสซูลาร์
grossularite	กรอสซูลาไรต์

Performance Benefits of FP16

Using FP16 (half-precision) can provide significant performance benefits:

Up to 2x faster inference on compatible GPUs
Reduced memory usage (approximately half compared to FP32)
Minimal impact on transliteration quality

Multi-Dataset Training

This model was trained on multiple datasets combined together, which provides several advantages:

Broader vocabulary coverage across different domains
Improved handling of edge cases and uncommon words
More consistent transliteration patterns
Better generalization to new, unseen words

Limitations and Bias

This model is designed specifically for transliteration, not translation. It attempts to convert the sounds of English words into Thai script, not to provide their Thai translations.

The model's performance may vary based on:

The phonetic complexity of the input
Whether the input contains sounds that are difficult to represent in Thai
The coverage of similar words in the training data

Common Errors

Some common error patterns observed:

group → กรูป (should be: กรุ๊ป)
golf → โกล์ฟ (should be: กอล์ฟ)

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.