XLM-RoBERTa for Khmer-English Language Processing

Model Description

This is a custom-trained XLM-RoBERTa-base model specifically designed for Khmer (ខ្មែរ) and English language processing. The model has been pretrained using masked language modeling (MLM) on a curated corpus of Khmer-English text data, making it highly effective for understanding and generating text in both languages.

Key Features

🌟 Bilingual Proficiency: Understands both Khmer and English with high accuracy
🚀 State-of-the-art Architecture: Based on RoBERTa with optimized training
📚 Domain Versatile: Trained on diverse text covering multiple domains
🔧 Ready-to-use: Can be fine-tuned for downstream tasks or used directly
Efficient: Optimized for both inference speed and model size

Model Details

Attribute Value
Model Type XLM-RoBERTa (Transformer)
Architecture RoBERTa-base
Languages Khmer (km), English (en)
Vocabulary Size 30,000 tokens
Parameters 109,113,648
Max Sequence Length 512 tokens
Training Step 3000
Tokenizer SentencePiece
License Apache 2.0

Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import RobertaForMaskedLM, PreTrainedTokenizerFast
import torch

# Load model and tokenizer
model_name = "metythorn/khmer-xlm-roberta-base"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)
model = RobertaForMaskedLM.from_pretrained(model_name)

# Set model to evaluation mode
model.eval()

def predict_mask(text):
    # Tokenize input
    inputs = tokenizer(text, return_tensors="pt")
    
    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = outputs.logits
    
    # Find masked token position
    mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
    
    # Get top 5 predictions
    mask_token_logits = predictions[0, mask_token_index, :]
    top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
    
    return [tokenizer.decode([token]).strip() for token in top_5_tokens]

# Example usage
khmer_text = "ប្រទេសកម្ពុជា គឺជាប្រទេស <mask> នៅអាស៊ីអាគ្នេយ៍។"
english_text = "The capital of Cambodia is <mask>."

print("Khmer predictions:", predict_mask(khmer_text))
print("English predictions:", predict_mask(english_text))

Advanced Usage

Text Classification Fine-tuning

from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments

# Load model for classification
model = RobertaForSequenceClassification.from_pretrained(
    "metythorn/khmer-xlm-roberta-base", 
    num_labels=2  # Adjust based on your task
)

# Fine-tune on your classification dataset
# ... (add your training data and training loop)

Feature Extraction

from transformers import RobertaModel

# Load model for feature extraction
model = RobertaModel.from_pretrained("metythorn/khmer-xlm-roberta-base")

def get_embeddings(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
        # Use CLS token embedding or pool all token embeddings
        embeddings = outputs.last_hidden_state.mean(dim=1)  # Mean pooling
    
    return embeddings

# Extract embeddings
khmer_emb = get_embeddings("នេះជាប្រយោគខ្មែរ។")
english_emb = get_embeddings("This is an English sentence.")

Training Details

Training Configuration

Parameter Value
Training Framework 🤗 Transformers + PyTorch
Batch Size 8 per device
Gradient Accumulation 4 steps
Effective Batch Size 32
Learning Rate 5e-05
Weight Decay 0.01
Warmup Steps 2,000
Max Grad Norm 1.0
Mixed Precision FP16
Gradient Checkpointing ✅ Enabled

Training Objective

The model was trained using Masked Language Modeling (MLM) with:

  • Masking Probability: 0.15 (15%)
  • Dynamic Masking: Applied during training for better generalization
  • Whole Word Masking: Implemented for multi-token words

Dataset

  • Source: Custom curated Khmer-English corpus
  • Domains: News, literature, government documents, web content, technical documents
  • Size: Multiple GB of cleaned text data
  • Languages: Khmer (ខ្មែរ) and English
  • Preprocessing: Cleaned, deduplicated, and filtered for quality

Infrastructure

  • GPUs: Multi-GPU training setup
  • Framework: PyTorch with Transformers
  • Optimization: Memory-efficient training with gradient checkpointing
  • Monitoring: Comprehensive logging and checkpointing

Performance

Evaluation Metrics

Note: Detailed evaluation metrics will be updated as they become available.

Task Metric Score
Masked Language Modeling Perplexity TBD
Downstream Task Fine-tuning F1-Score TBD

Capabilities

Strong Performance On:

  • Khmer text understanding and generation
  • English text processing
  • Code-switching between Khmer and English
  • Cultural and contextual understanding
  • Technical and formal text

⚠️ Limitations:

  • Performance may vary on very domain-specific text
  • Limited training on informal/slang text
  • May require fine-tuning for specific downstream tasks

Use Cases

🎯 Direct Applications

  • Text Completion: Fill in missing words in Khmer/English text
  • Language Understanding: Extract meaningful representations
  • Similarity Computation: Calculate text similarity scores
  • Feature Extraction: Get embeddings for ML pipelines

🔧 Fine-tuning Applications

  • Text Classification: Sentiment analysis, document categorization
  • Named Entity Recognition: Extract persons, locations, organizations
  • Question Answering: Build QA systems for Khmer/English
  • Text Summarization: Summarize documents in both languages
  • Machine Translation: Improve Khmer-English translation quality

Technical Specifications

Model Architecture

  • Base Architecture: RoBERTa (Robustly Optimized BERT Pretraining Approach)
  • Attention Heads: 12
  • Hidden Layers: 12
  • Hidden Size: 768
  • Intermediate Size: 3072
  • Position Embeddings: 514

Tokenizer Details

  • Type: SentencePiece
  • Vocabulary: 30,000 tokens
  • Special Tokens: <s>, </s>, <pad>, <unk>, <mask>
  • Supports: Both Khmer Unicode and English text

Ethical Considerations & Limitations

Intended Use

This model is intended for research and development purposes in NLP applications involving Khmer and English languages. It can be used for:

  • Academic research
  • Commercial applications (subject to license terms)
  • Educational purposes
  • Building language technology for Khmer speakers

Limitations

  • Bias: May reflect biases present in training data
  • Domain Gaps: Performance may vary across different domains
  • Cultural Context: May not capture all cultural nuances
  • Evolving Language: May not reflect very recent language changes

Recommendations

  • Evaluate model performance on your specific use case
  • Consider fine-tuning for domain-specific applications
  • Be aware of potential biases in outputs
  • Validate results with domain experts when needed

License

This model is released under the Apache 2.0 License. See the LICENSE file for more details.

Model Card Authors

  • Model Development: Metythorn Penn
  • Training Infrastructure: Server GPU
  • Model Card: Generated automatically during training

Disclaimer: This model is provided as-is for research and development purposes. Users are responsible for ensuring appropriate use and compliance with applicable laws and regulations.

Last Updated: 2025-06-16 Training Step: 3000 Model Version: 1.0

Downloads last month
78
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for metythorn/khmer-xlm-roberta-base

Finetuned
(1686)
this model