File size: 3,866 Bytes
955f44c ad465ee 955f44c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
---
library_name: transformers
license: mit
base_model: facebook/mbart-large-50
tags:
- generated_from_trainer
metrics:
- wer
- bleu
- rouge
model-index:
- name: urdu_text_correction
results: []
datasets:
- mahwizzzz/urdu_error_correction
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# Urdu Text Correction Model
This model fine-tunes [facebook/mbart-large-50](https://huggingface.co/facebook/mbart-large-50) for automatic correction of Urdu text, addressing spelling mistakes, grammatical errors, and improving text fluency.
[](https://opensource.org/licenses/MIT)
[](https://huggingface.co/facebook/mbart-large-50)
## Overview
The Urdu Text Correction model is designed to automatically detect and correct errors in Urdu text, making it valuable for content editors, publishers, educational institutions, and applications requiring high-quality Urdu text.
## Performance Metrics
The model achieves the following results on our evaluation set:
| Metric | Score |
|--------|-------|
| BLEU | 0.6996 |
| METEOR | 0.8296 |
| WER (Word Error Rate) | 0.1795 |
| CER (Character Error Rate) | 0.0761 |
| ROUGE-1 | 0.2025 |
| ROUGE-2 | 0.0699 |
| ROUGE-L | 0.2023 |
| Exact Match | 0.1096 |
| Generation Length | 28.4033 |
| Loss | 0.4305 |
## Usage
```bash
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("mahwizzzz/urdu_text_correction")
model = AutoModelForSeq2SeqLM.from_pretrained("mahwizzzz/urdu_text_correction")
# Example text with errors
incorrect_text = "یہہ ایک اچھی بات ہے"
# Tokenize and generate correction
inputs = tokenizer(incorrect_text, return_tensors="pt", max_length=128, truncation=True)
outputs = model.generate(**inputs, max_length=128)
corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Original: {incorrect_text}")
print(f"Corrected: {corrected_text}")
```
# Model Description
This is an encoder-decoder model based on mBART-large-50, specifically fine-tuned on Urdu text correction pairs. The model learns to transform incorrect Urdu text into its corrected form, addressing issues such as:
- Spelling mistakes
- Grammar errors
- Word order issues
- Missing diacritics
- Punctuation errors
# Training Details
## Data
The model was trained on a dataset of incorrect-correct Urdu text pairs.
## Hyperparameters
The model was trained with the following hyperparameters:
- Learning rate: 3e-05
- Batch size: 128 (32 per device with gradient accumulation of 4)
- Optimizer: AdamW with betas=(0.9, 0.999)
- LR scheduler: Cosine with 500 warmup steps
- Training epochs: 3
- Mixed precision: Native AMP
# Limitations
The model may not perform well on highly domain-specific text (technical, medical, etc.)
Very long texts may need to be split into smaller chunks due to the model's maximum sequence length
The model may sometimes over-correct dialectal variations or stylistic choices
Performance is dependent on the quality and diversity of the training data
# Future Work
Expanding the training dataset with more diverse text sources
Domain adaptation for specific use cases (legal, medical, etc.)
Performance optimization for faster inference
Improved handling of complex grammatical structures
# Citation
If you use this model in your research or project, please cite:
```
@model{urdu_text_correction,
title = {Urdu Text Correction Model}
author = {Mahwiz Khalil},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/mahwizzzz/urdu_text_correction}}
}
``` |