XLM-RoBERTa for Khmer-English Language Processing
Model Description
This is a custom-trained XLM-RoBERTa-base model specifically designed for Khmer (ខ្មែរ) and English language processing. The model has been pretrained using masked language modeling (MLM) on a curated corpus of Khmer-English text data, making it highly effective for understanding and generating text in both languages.
Key Features
🌟 Bilingual Proficiency: Understands both Khmer and English with high accuracy
🚀 State-of-the-art Architecture: Based on RoBERTa with optimized training
📚 Domain Versatile: Trained on diverse text covering multiple domains
🔧 Ready-to-use: Can be fine-tuned for downstream tasks or used directly
⚡ Efficient: Optimized for both inference speed and model size
Model Details
Attribute | Value |
---|---|
Model Type | XLM-RoBERTa (Transformer) |
Architecture | RoBERTa-base |
Languages | Khmer (km), English (en) |
Vocabulary Size | 30,000 tokens |
Parameters | 109,113,648 |
Max Sequence Length | 512 tokens |
Training Step | 3000 |
Tokenizer | SentencePiece |
License | Apache 2.0 |
Quick Start
Installation
pip install transformers torch
Basic Usage
from transformers import RobertaForMaskedLM, PreTrainedTokenizerFast
import torch
# Load model and tokenizer
model_name = "metythorn/khmer-xlm-roberta-base"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)
model = RobertaForMaskedLM.from_pretrained(model_name)
# Set model to evaluation mode
model.eval()
def predict_mask(text):
# Tokenize input
inputs = tokenizer(text, return_tensors="pt")
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits
# Find masked token position
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
# Get top 5 predictions
mask_token_logits = predictions[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
return [tokenizer.decode([token]).strip() for token in top_5_tokens]
# Example usage
khmer_text = "ប្រទេសកម្ពុជា គឺជាប្រទេស <mask> នៅអាស៊ីអាគ្នេយ៍។"
english_text = "The capital of Cambodia is <mask>."
print("Khmer predictions:", predict_mask(khmer_text))
print("English predictions:", predict_mask(english_text))
Advanced Usage
Text Classification Fine-tuning
from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments
# Load model for classification
model = RobertaForSequenceClassification.from_pretrained(
"metythorn/khmer-xlm-roberta-base",
num_labels=2 # Adjust based on your task
)
# Fine-tune on your classification dataset
# ... (add your training data and training loop)
Feature Extraction
from transformers import RobertaModel
# Load model for feature extraction
model = RobertaModel.from_pretrained("metythorn/khmer-xlm-roberta-base")
def get_embeddings(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
# Use CLS token embedding or pool all token embeddings
embeddings = outputs.last_hidden_state.mean(dim=1) # Mean pooling
return embeddings
# Extract embeddings
khmer_emb = get_embeddings("នេះជាប្រយោគខ្មែរ។")
english_emb = get_embeddings("This is an English sentence.")
Training Details
Training Configuration
Parameter | Value |
---|---|
Training Framework | 🤗 Transformers + PyTorch |
Batch Size | 8 per device |
Gradient Accumulation | 4 steps |
Effective Batch Size | 32 |
Learning Rate | 5e-05 |
Weight Decay | 0.01 |
Warmup Steps | 2,000 |
Max Grad Norm | 1.0 |
Mixed Precision | FP16 |
Gradient Checkpointing | ✅ Enabled |
Training Objective
The model was trained using Masked Language Modeling (MLM) with:
- Masking Probability: 0.15 (15%)
- Dynamic Masking: Applied during training for better generalization
- Whole Word Masking: Implemented for multi-token words
Dataset
- Source: Custom curated Khmer-English corpus
- Domains: News, literature, government documents, web content, technical documents
- Size: Multiple GB of cleaned text data
- Languages: Khmer (ខ្មែរ) and English
- Preprocessing: Cleaned, deduplicated, and filtered for quality
Infrastructure
- GPUs: Multi-GPU training setup
- Framework: PyTorch with Transformers
- Optimization: Memory-efficient training with gradient checkpointing
- Monitoring: Comprehensive logging and checkpointing
Performance
Evaluation Metrics
Note: Detailed evaluation metrics will be updated as they become available.
Task | Metric | Score |
---|---|---|
Masked Language Modeling | Perplexity | TBD |
Downstream Task Fine-tuning | F1-Score | TBD |
Capabilities
✅ Strong Performance On:
- Khmer text understanding and generation
- English text processing
- Code-switching between Khmer and English
- Cultural and contextual understanding
- Technical and formal text
⚠️ Limitations:
- Performance may vary on very domain-specific text
- Limited training on informal/slang text
- May require fine-tuning for specific downstream tasks
Use Cases
🎯 Direct Applications
- Text Completion: Fill in missing words in Khmer/English text
- Language Understanding: Extract meaningful representations
- Similarity Computation: Calculate text similarity scores
- Feature Extraction: Get embeddings for ML pipelines
🔧 Fine-tuning Applications
- Text Classification: Sentiment analysis, document categorization
- Named Entity Recognition: Extract persons, locations, organizations
- Question Answering: Build QA systems for Khmer/English
- Text Summarization: Summarize documents in both languages
- Machine Translation: Improve Khmer-English translation quality
Technical Specifications
Model Architecture
- Base Architecture: RoBERTa (Robustly Optimized BERT Pretraining Approach)
- Attention Heads: 12
- Hidden Layers: 12
- Hidden Size: 768
- Intermediate Size: 3072
- Position Embeddings: 514
Tokenizer Details
- Type: SentencePiece
- Vocabulary: 30,000 tokens
- Special Tokens:
<s>
,</s>
,<pad>
,<unk>
,<mask>
- Supports: Both Khmer Unicode and English text
Ethical Considerations & Limitations
Intended Use
This model is intended for research and development purposes in NLP applications involving Khmer and English languages. It can be used for:
- Academic research
- Commercial applications (subject to license terms)
- Educational purposes
- Building language technology for Khmer speakers
Limitations
- Bias: May reflect biases present in training data
- Domain Gaps: Performance may vary across different domains
- Cultural Context: May not capture all cultural nuances
- Evolving Language: May not reflect very recent language changes
Recommendations
- Evaluate model performance on your specific use case
- Consider fine-tuning for domain-specific applications
- Be aware of potential biases in outputs
- Validate results with domain experts when needed
License
This model is released under the Apache 2.0 License. See the LICENSE file for more details.
Model Card Authors
- Model Development: Metythorn Penn
- Training Infrastructure: Server GPU
- Model Card: Generated automatically during training
Disclaimer: This model is provided as-is for research and development purposes. Users are responsible for ensuring appropriate use and compliance with applicable laws and regulations.
Last Updated: 2025-06-16 Training Step: 3000 Model Version: 1.0
- Downloads last month
- 78
Model tree for metythorn/khmer-xlm-roberta-base
Base model
FacebookAI/roberta-baseEvaluation results
- Perplexityself-reportedTBD