Urdu Text Correction Model

This model fine-tunes facebook/mbart-large-50 for automatic correction of Urdu text, addressing spelling mistakes, grammatical errors, and improving text fluency.

License: MIT Model: mBART-large-50

Overview

The Urdu Text Correction model is designed to automatically detect and correct errors in Urdu text, making it valuable for content editors, publishers, educational institutions, and applications requiring high-quality Urdu text.

Performance Metrics

The model achieves the following results on our evaluation set:

Metric Score
BLEU 0.6996
METEOR 0.8296
WER (Word Error Rate) 0.1795
CER (Character Error Rate) 0.0761
ROUGE-1 0.2025
ROUGE-2 0.0699
ROUGE-L 0.2023
Exact Match 0.1096
Generation Length 28.4033
Loss 0.4305

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("mahwizzzz/urdu_text_correction")
model = AutoModelForSeq2SeqLM.from_pretrained("mahwizzzz/urdu_text_correction")

# Example text with errors
incorrect_text = "یہہ ایک اچھی بات ہے"

# Tokenize and generate correction
inputs = tokenizer(incorrect_text, return_tensors="pt", max_length=128, truncation=True)
outputs = model.generate(**inputs, max_length=128)
corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Original: {incorrect_text}")
print(f"Corrected: {corrected_text}")

Model Description

This is an encoder-decoder model based on mBART-large-50, specifically fine-tuned on Urdu text correction pairs. The model learns to transform incorrect Urdu text into its corrected form, addressing issues such as:

  • Spelling mistakes
  • Grammar errors
  • Word order issues
  • Missing diacritics
  • Punctuation errors

Training Details

Data

The model was trained on a dataset of incorrect-correct Urdu text pairs.

Hyperparameters

The model was trained with the following hyperparameters:

  • Learning rate: 3e-05
  • Batch size: 128 (32 per device with gradient accumulation of 4)
  • Optimizer: AdamW with betas=(0.9, 0.999)
  • LR scheduler: Cosine with 500 warmup steps
  • Training epochs: 3
  • Mixed precision: Native AMP

Limitations

The model may not perform well on highly domain-specific text (technical, medical, etc.) Very long texts may need to be split into smaller chunks due to the model's maximum sequence length The model may sometimes over-correct dialectal variations or stylistic choices Performance is dependent on the quality and diversity of the training data

Future Work

Expanding the training dataset with more diverse text sources Domain adaptation for specific use cases (legal, medical, etc.) Performance optimization for faster inference Improved handling of complex grammatical structures

Citation

If you use this model in your research or project, please cite:

@model{urdu_text_correction,
  title = {Urdu Text Correction Model}
  author = {Mahwiz Khalil},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/mahwizzzz/urdu_text_correction}}
}
Downloads last month
26
Safetensors
Model size
611M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mahwizzzz/urdu_text_correction

Finetuned
(191)
this model

Dataset used to train mahwizzzz/urdu_text_correction