๐ŸŒ mBART50 English โ†” Telugu | HackHedron Dataset

This model is fine-tuned from facebook/mbart-large-50-many-to-many-mmt on the HackHedron English-Telugu Parallel Corpus. It supports bidirectional translation between English โ†” Telugu.

๐Ÿง  Model Architecture

  • Base model: mBART50 (Multilingual BART with 50 languages)
  • Type: Seq2Seq Transformer
  • Tokenizer: MBart50TokenizerFast
  • Languages Used:
    • en_XX for English
    • te_IN for Telugu

๐Ÿ“š Dataset

HackHedron English-Telugu Parallel Corpus

  • ~390,000 training sentence pairs
  • ~43,000 validation pairs
  • Format:
{
  "english": "Tom started his car and drove away.",
  "telugu": "เฐŸเฐพเฐฎเฑ เฐคเฐจ เฐ•เฐพเฐฐเฑเฐจเฑ เฐธเฑเฐŸเฐพเฐฐเฑเฐŸเฑ เฐšเฑ‡เฐธเฐฟ เฐฆเฑ‚เฐฐเฐ‚เฐ—เฐพ เฐจเฐกเฐฟเฐชเฐพเฐกเฑ."
}

๐Ÿ“ˆ Evaluation

Metric Score Loss
SacreBLEU 66.924 0.0511

๐Ÿงช Evaluation done using Hugging Face evaluate library on validation set.


๐Ÿ’ป How to Use

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")
tokenizer = MBart50TokenizerFast.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")

# Set source and target language
tokenizer.src_lang = "en_XX"
tokenizer.tgt_lang = "te_IN"

text = "How are you?"
inputs = tokenizer(text, return_tensors="pt")
generated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"])
translated = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translated[0])

๐Ÿ“ฆ How to Fine-Tune Further

Use the Seq2SeqTrainer from Hugging Face:

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

Make sure to properly set forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"] during generation.


๐Ÿ› ๏ธ Training Details

  • Optimizer: AdamW
  • Learning Rate: 2e-05
  • Epochs: 1
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • Truncation Length: 128 tokens
  • Framework: ๐Ÿค— Transformers + Datasets
  • Scheduler: Linear
  • Mixed Precision: Enabled (fp16)

Training results

Training Loss Epoch Step Validation Loss Bleu
0.0455 1.0 48808 0.0511 66.9240

Framework versions

  • Transformers 4.51.3
  • Pytorch 2.6.0+cu124
  • Datasets 3.6.0
  • Tokenizers 0.21.1

๐Ÿท๏ธ License

This model is licensed under Apache 2.0 License.


๐Ÿค Acknowledgements

  • ๐Ÿค— Hugging Face Transformers
  • Facebook AI for mBART50
  • HackHedron Parallel Corpus Contributors

Created by Koushik Reddy โ€“ Hugging Face Profile

Downloads last month
37
Safetensors
Model size
611M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Koushim/mbart50-en-te-hackhedron

Finetuned
(155)
this model

Evaluation results

  • SacreBLEU on HackHedron English-Telugu Parallel Corpus
    self-reported
    66.924