metadata
base_model: facebook/mbart-large-50-many-to-many-mmt
tags:
- translation
- mbart50
- english
- telugu
- hackhedron
- neural-machine-translation
- huggingface
license: apache-2.0
datasets:
- hackhedron
metrics:
- sacrebleu
model-index:
- name: mbart50-en-te-hackhedron
language:
- en
- te
results:
- task:
name: Translation
type: translation
dataset:
name: HackHedron English-Telugu Parallel Corpus
type: hackhedron
args: en-te
metrics:
- name: SacreBLEU
type: sacrebleu
value: 66.924
🌐 mBART50 English ↔ Telugu | HackHedron Dataset
This model is fine-tuned from facebook/mbart-large-50-many-to-many-mmt on the HackHedron English-Telugu Parallel Corpus. It supports bidirectional translation between English ↔ Telugu.
🧠 Model Architecture
- Base model: mBART50 (Multilingual BART with 50 languages)
- Type: Seq2Seq Transformer
- Tokenizer: MBart50TokenizerFast
- Languages Used:
en_XX
for Englishte_IN
for Telugu
📚 Dataset
HackHedron English-Telugu Parallel Corpus
- ~390,000 training sentence pairs
- ~43,000 validation pairs
- Format:
{
"english": "Tom started his car and drove away.",
"telugu": "టామ్ తన కారును స్టార్ట్ చేసి దూరంగా నడిపాడు."
}
📈 Evaluation
Metric | Score | Loss |
---|---|---|
SacreBLEU | 66.924 | 0.0511 |
🧪 Evaluation done using Hugging Face
evaluate
library on validation set.
💻 How to Use
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model = MBartForConditionalGeneration.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")
tokenizer = MBart50TokenizerFast.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")
# Set source and target language
tokenizer.src_lang = "en_XX"
tokenizer.tgt_lang = "te_IN"
text = "How are you?"
inputs = tokenizer(text, return_tensors="pt")
generated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"])
translated = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translated[0])
📦 How to Fine-Tune Further
Use the Seq2SeqTrainer
from Hugging Face:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
Make sure to properly set forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"]
during generation.
🛠️ Training Details
- Optimizer: AdamW
- Learning Rate: 2e-05
- Epochs: 1
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- Truncation Length: 128 tokens
- Framework: 🤗 Transformers + Datasets
- Scheduler: Linear
- Mixed Precision: Enabled (fp16)
Training results
Training Loss | Epoch | Step | Validation Loss | Bleu |
---|---|---|---|---|
0.0455 | 1.0 | 48808 | 0.0511 | 66.9240 |
Framework versions
- Transformers 4.51.3
- Pytorch 2.6.0+cu124
- Datasets 3.6.0
- Tokenizers 0.21.1
🏷️ License
This model is licensed under Apache 2.0 License.
🤝 Acknowledgements
- 🤗 Hugging Face Transformers
- Facebook AI for mBART50
- HackHedron Parallel Corpus Contributors
Created by Koushik Reddy – Hugging Face Profile