XLM-RoBERTa for Khmer-English Language Processing

Model Description

This is a custom-trained XLM-RoBERTa-base model specifically designed for Khmer (ខ្មែរ) and English language processing. The model has been pretrained using masked language modeling (MLM) on a curated corpus of Khmer-English text data, making it highly effective for understanding and generating text in both languages.

Key Features

🌟 Bilingual Proficiency: Understands both Khmer and English with high accuracy
🚀 State-of-the-art Architecture: Based on RoBERTa with optimized training
📚 Domain Versatile: Trained on diverse text covering multiple domains
🔧 Ready-to-use: Can be fine-tuned for downstream tasks or used directly
⚡ Efficient: Optimized for both inference speed and model size

Model Details

Attribute	Value
Model Type	XLM-RoBERTa (Transformer)
Architecture	RoBERTa-base
Languages	Khmer (km), English (en)
Vocabulary Size	30,000 tokens
Parameters	109,113,648
Max Sequence Length	512 tokens
Training Step	3000
Tokenizer	SentencePiece
License	Apache 2.0

Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import RobertaForMaskedLM, PreTrainedTokenizerFast
import torch

# Load model and tokenizer
model_name = "metythorn/khmer-xlm-roberta-base"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)
model = RobertaForMaskedLM.from_pretrained(model_name)

# Set model to evaluation mode
model.eval()

def predict_mask(text):
    # Tokenize input
    inputs = tokenizer(text, return_tensors="pt")
    
    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = outputs.logits
    
    # Find masked token position
    mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
    
    # Get top 5 predictions
    mask_token_logits = predictions[0, mask_token_index, :]
    top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
    
    return [tokenizer.decode([token]).strip() for token in top_5_tokens]

# Example usage
khmer_text = "ប្រទេសកម្ពុជា គឺជាប្រទេស <mask> នៅអាស៊ីអាគ្នេយ៍។"
english_text = "The capital of Cambodia is <mask>."

print("Khmer predictions:", predict_mask(khmer_text))
print("English predictions:", predict_mask(english_text))

Advanced Usage

Text Classification Fine-tuning

from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments

# Load model for classification
model = RobertaForSequenceClassification.from_pretrained(
    "metythorn/khmer-xlm-roberta-base", 
    num_labels=2  # Adjust based on your task
)

# Fine-tune on your classification dataset
# ... (add your training data and training loop)

Feature Extraction

from transformers import RobertaModel

# Load model for feature extraction
model = RobertaModel.from_pretrained("metythorn/khmer-xlm-roberta-base")

def get_embeddings(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
        # Use CLS token embedding or pool all token embeddings
        embeddings = outputs.last_hidden_state.mean(dim=1)  # Mean pooling
    
    return embeddings

# Extract embeddings
khmer_emb = get_embeddings("នេះជាប្រយោគខ្មែរ។")
english_emb = get_embeddings("This is an English sentence.")

Training Details

Training Configuration

Parameter	Value
Training Framework	🤗 Transformers + PyTorch
Batch Size	8 per device
Gradient Accumulation	4 steps
Effective Batch Size	32
Learning Rate	5e-05
Weight Decay	0.01
Warmup Steps	2,000
Max Grad Norm	1.0
Mixed Precision	FP16
Gradient Checkpointing	✅ Enabled

Training Objective

The model was trained using Masked Language Modeling (MLM) with:

Masking Probability: 0.15 (15%)
Dynamic Masking: Applied during training for better generalization
Whole Word Masking: Implemented for multi-token words

Dataset

Source: Custom curated Khmer-English corpus
Domains: News, literature, government documents, web content, technical documents
Size: Multiple GB of cleaned text data
Languages: Khmer (ខ្មែរ) and English
Preprocessing: Cleaned, deduplicated, and filtered for quality

Infrastructure

GPUs: Multi-GPU training setup
Framework: PyTorch with Transformers
Optimization: Memory-efficient training with gradient checkpointing
Monitoring: Comprehensive logging and checkpointing

Performance

Evaluation Metrics

Note: Detailed evaluation metrics will be updated as they become available.

Task	Metric	Score
Masked Language Modeling	Perplexity	TBD
Downstream Task Fine-tuning	F1-Score	TBD

Capabilities

✅ Strong Performance On:

Khmer text understanding and generation
English text processing
Code-switching between Khmer and English
Cultural and contextual understanding
Technical and formal text

⚠️ Limitations:

Performance may vary on very domain-specific text
Limited training on informal/slang text
May require fine-tuning for specific downstream tasks

Use Cases

🎯 Direct Applications

Text Completion: Fill in missing words in Khmer/English text
Language Understanding: Extract meaningful representations
Similarity Computation: Calculate text similarity scores
Feature Extraction: Get embeddings for ML pipelines

🔧 Fine-tuning Applications

Text Classification: Sentiment analysis, document categorization
Named Entity Recognition: Extract persons, locations, organizations
Question Answering: Build QA systems for Khmer/English
Text Summarization: Summarize documents in both languages
Machine Translation: Improve Khmer-English translation quality

Technical Specifications

Model Architecture

Base Architecture: RoBERTa (Robustly Optimized BERT Pretraining Approach)
Attention Heads: 12
Hidden Layers: 12
Hidden Size: 768
Intermediate Size: 3072
Position Embeddings: 514

Tokenizer Details

Type: SentencePiece
Vocabulary: 30,000 tokens
Special Tokens: <s>, </s>, <pad>, <unk>, <mask>
Supports: Both Khmer Unicode and English text

Ethical Considerations & Limitations

Intended Use

This model is intended for research and development purposes in NLP applications involving Khmer and English languages. It can be used for:

Academic research
Commercial applications (subject to license terms)
Educational purposes
Building language technology for Khmer speakers

Limitations

Bias: May reflect biases present in training data
Domain Gaps: Performance may vary across different domains
Cultural Context: May not capture all cultural nuances
Evolving Language: May not reflect very recent language changes

Recommendations

Evaluate model performance on your specific use case
Consider fine-tuning for domain-specific applications
Be aware of potential biases in outputs
Validate results with domain experts when needed

License

This model is released under the Apache 2.0 License. See the LICENSE file for more details.

Model Card Authors

Model Development: Metythorn Penn
Training Infrastructure: Server GPU
Model Card: Generated automatically during training

Disclaimer: This model is provided as-is for research and development purposes. Users are responsible for ensuring appropriate use and compliance with applicable laws and regulations.

Last Updated: 2025-06-16 Training Step: 3000 Model Version: 1.0

metythorn
/

khmer-xlm-roberta-base