Koushim's picture
Update README.md
01901ca verified
metadata
base_model: facebook/mbart-large-50-many-to-many-mmt
tags:
  - translation
  - mbart50
  - english
  - telugu
  - hackhedron
  - neural-machine-translation
  - huggingface
license: apache-2.0
datasets:
  - hackhedron
metrics:
  - sacrebleu
model-index:
  - name: mbart50-en-te-hackhedron
    language:
      - en
      - te
    results:
      - task:
          name: Translation
          type: translation
        dataset:
          name: HackHedron English-Telugu Parallel Corpus
          type: hackhedron
          args: en-te
        metrics:
          - name: SacreBLEU
            type: sacrebleu
            value: 66.924

🌐 mBART50 English ↔ Telugu | HackHedron Dataset

This model is fine-tuned from facebook/mbart-large-50-many-to-many-mmt on the HackHedron English-Telugu Parallel Corpus. It supports bidirectional translation between English ↔ Telugu.

🧠 Model Architecture

  • Base model: mBART50 (Multilingual BART with 50 languages)
  • Type: Seq2Seq Transformer
  • Tokenizer: MBart50TokenizerFast
  • Languages Used:
    • en_XX for English
    • te_IN for Telugu

📚 Dataset

HackHedron English-Telugu Parallel Corpus

  • ~390,000 training sentence pairs
  • ~43,000 validation pairs
  • Format:
{
  "english": "Tom started his car and drove away.",
  "telugu": "టామ్ తన కారును స్టార్ట్ చేసి దూరంగా నడిపాడు."
}

📈 Evaluation

Metric Score Loss
SacreBLEU 66.924 0.0511

🧪 Evaluation done using Hugging Face evaluate library on validation set.


💻 How to Use

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")
tokenizer = MBart50TokenizerFast.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")

# Set source and target language
tokenizer.src_lang = "en_XX"
tokenizer.tgt_lang = "te_IN"

text = "How are you?"
inputs = tokenizer(text, return_tensors="pt")
generated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"])
translated = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translated[0])

📦 How to Fine-Tune Further

Use the Seq2SeqTrainer from Hugging Face:

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

Make sure to properly set forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"] during generation.


🛠️ Training Details

  • Optimizer: AdamW
  • Learning Rate: 2e-05
  • Epochs: 1
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • Truncation Length: 128 tokens
  • Framework: 🤗 Transformers + Datasets
  • Scheduler: Linear
  • Mixed Precision: Enabled (fp16)

Training results

Training Loss Epoch Step Validation Loss Bleu
0.0455 1.0 48808 0.0511 66.9240

Framework versions

  • Transformers 4.51.3
  • Pytorch 2.6.0+cu124
  • Datasets 3.6.0
  • Tokenizers 0.21.1

🏷️ License

This model is licensed under Apache 2.0 License.


🤝 Acknowledgements

  • 🤗 Hugging Face Transformers
  • Facebook AI for mBART50
  • HackHedron Parallel Corpus Contributors

Created by Koushik ReddyHugging Face Profile