MarianMT Indonesian-English Translation (Fine-Tuned)
This model is a fine-tuned version of Helsinki-NLP/opus-mt-id-en
specialized for translating Indonesian to English, particularly within contexts found in TED Talks.
π― Model Highlights
- Specialized Context: Fine-tuned on the TED Talks parallel corpus for better performance on formal and presentation-style language.
- Optimized Training: Utilizes modern training techniques like layer freezing and a cosine annealing scheduler for stable and effective fine-tuning.
- Production Ready: Can be easily integrated into applications using the
transformers
library.
π Model Details
- Base Model:
Helsinki-NLP/opus-mt-id-en
- Fine-tuned Dataset: Cleaned and aligned TED Talks parallel corpus (Indonesian-English).
- Training Date: 2025-06-16
- Languages: Indonesian (
id
) β English (en
)
βοΈ Training Configuration
Hyperparameters
- Learning Rate: 5e-6
- Weight Decay: 0.001
- Gradient Clipping: 0.5
- Max Sequence Length: 96-128 tokens
- Scheduler: Cosine Annealing with Warmup
Architecture Optimizations
- Layer Freezing: Early encoder layers were frozen to preserve foundational language knowledge from the base model.
- Memory Optimization: Utilized gradient accumulation to simulate a larger batch size.
- Early Stopping: Implemented with a patience of 5 epochs to prevent overfitting.
π οΈ Usage Example
from transformers import MarianMTModel, MarianTokenizer
model_name = "dhintech/marian-tmx-tedtalks-id-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
# Pindahkan model ke GPU jika tersedia
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
def translate(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
with torch.no_grad():
outputs = model.generate(**inputs, num_beams=4, early_stopping=True)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Contoh penggunaan
indonesian_text = "Selamat pagi, mari kita mulai rapat hari ini."
english_translation = translate(indonesian_text)
print(f"ID: {indonesian_text}")
print(f"EN: {english_translation}")
π― Intended Use Cases
- Presentation Translation: Translating presentation scripts and materials.
- Formal Content: Translating articles, reports, and other formal documents.
- Educational Content: Assisting with the translation of academic and educational materials.
β‘ Performance Metrics
Performance metrics such as BLEU score, inference time, and human evaluation will be added here after the model has been fully trained and evaluated.
π¨ Limitations and Considerations
- Domain Specificity: While trained on a broad corpus, performance is best on formal language similar to TED Talks. It may not perform as well on very casual slang or regional dialects.
- Long Sequences: Performance might degrade for sentences significantly longer than the max length used in training (128 tokens).
π€ Contributing
Feedback and contributions are welcome! Please use the Community tab or open an issue on the repository if you encounter any problems or have suggestions for improvement.
- Downloads last month
- 35
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for dhintech/marian-tmx-tedtalks-id-en
Base model
Helsinki-NLP/opus-mt-id-en