๐ mBART50 English โ Telugu | HackHedron Dataset
This model is fine-tuned from facebook/mbart-large-50-many-to-many-mmt on the HackHedron English-Telugu Parallel Corpus. It supports bidirectional translation between English โ Telugu.
๐ง Model Architecture
- Base model: mBART50 (Multilingual BART with 50 languages)
- Type: Seq2Seq Transformer
- Tokenizer: MBart50TokenizerFast
- Languages Used:
en_XX
for Englishte_IN
for Telugu
๐ Dataset
HackHedron English-Telugu Parallel Corpus
- ~390,000 training sentence pairs
- ~43,000 validation pairs
- Format:
{
"english": "Tom started his car and drove away.",
"telugu": "เฐเฐพเฐฎเฑ เฐคเฐจ เฐเฐพเฐฐเฑเฐจเฑ เฐธเฑเฐเฐพเฐฐเฑเฐเฑ เฐเฑเฐธเฐฟ เฐฆเฑเฐฐเฐเฐเฐพ เฐจเฐกเฐฟเฐชเฐพเฐกเฑ."
}
๐ Evaluation
Metric | Score | Loss |
---|---|---|
SacreBLEU | 66.924 | 0.0511 |
๐งช Evaluation done using Hugging Face
evaluate
library on validation set.
๐ป How to Use
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model = MBartForConditionalGeneration.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")
tokenizer = MBart50TokenizerFast.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")
# Set source and target language
tokenizer.src_lang = "en_XX"
tokenizer.tgt_lang = "te_IN"
text = "How are you?"
inputs = tokenizer(text, return_tensors="pt")
generated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"])
translated = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translated[0])
๐ฆ How to Fine-Tune Further
Use the Seq2SeqTrainer
from Hugging Face:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
Make sure to properly set forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"]
during generation.
๐ ๏ธ Training Details
- Optimizer: AdamW
- Learning Rate: 2e-05
- Epochs: 1
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- Truncation Length: 128 tokens
- Framework: ๐ค Transformers + Datasets
- Scheduler: Linear
- Mixed Precision: Enabled (fp16)
Training results
Training Loss | Epoch | Step | Validation Loss | Bleu |
---|---|---|---|---|
0.0455 | 1.0 | 48808 | 0.0511 | 66.9240 |
Framework versions
- Transformers 4.51.3
- Pytorch 2.6.0+cu124
- Datasets 3.6.0
- Tokenizers 0.21.1
๐ท๏ธ License
This model is licensed under Apache 2.0 License.
๐ค Acknowledgements
- ๐ค Hugging Face Transformers
- Facebook AI for mBART50
- HackHedron Parallel Corpus Contributors
Created by Koushik Reddy โ Hugging Face Profile
- Downloads last month
- 37
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for Koushim/mbart50-en-te-hackhedron
Base model
facebook/mbart-large-50-many-to-many-mmtEvaluation results
- SacreBLEU on HackHedron English-Telugu Parallel Corpusself-reported66.924