File size: 6,764 Bytes
9d05224 547be79 43631ad 547be79 43631ad 547be79 9d05224 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 |
---
license: gemma
language:
- ar
base_model:
- google/gemma-3-1b-pt
pipeline_tag: text-generation
tags:
- arabic
- grammatical-error-correction
- gemma
- unsloth
- arabic-nlp
---
# Gemma 3 1B Arabic Grammatical Error Correction v1
## Model Description
This model is a fine-tuned version of Google's Gemma 3 1B model, specifically trained for Arabic Grammatical Error Correction (GEC) by Alnnahwi. The model takes Arabic sentences as input and outputs their grammatically corrected versions.
**Developed by**: Bahjat Al Mostafa (Alnnahwi)
**Base Model:** google/gemma-3-1b
**Task:** Grammatical Error Correction
**Language:** Arabic
**Version:** 1.0.0
**Organization**: [Alnnahwi](https://alnnahwi.com/)
## Quick Start
### Installation
```bash
pip install transformers torch
```
### Basic Usage
```python
from transformers import pipeline, AutoTokenizer
import torch
MODEL_NAME = "alnnahwi/gemma-3-1b-arabic-gec-v1"
def extract_model_response(generated_text):
"""Extract just the model's response from the full generated text."""
# Find the position after "model" marker
model_marker = "\nmodel\n"
if model_marker in generated_text:
response_start = generated_text.find(model_marker) + len(model_marker)
return generated_text[response_start:].strip()
# Alternative format (in case formatting changes)
alt_marker = "model\n"
if alt_marker in generated_text:
response_start = generated_text.find(alt_marker) + len(alt_marker)
return generated_text[response_start:].strip()
# If markers not found, return the original text
return generated_text
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# Add Gemma chat template manually
tokenizer.chat_template = """{% for message in messages %}{{'<start_of_turn>' + message['role'] + '\n' + message['content'] + '<end_of_turn>\n'}}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"""
# Device selection
if torch.backends.mps.is_available():
device = "mps"
elif torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
# Create pipeline
pipe = pipeline(
"text-generation",
model=MODEL_NAME,
tokenizer=tokenizer,
device=device,
)
def correct_arabic_text(text):
"""Correct Arabic text using the fine-tuned model."""
messages = [{"role": "user", "content": text}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
outputs = pipe(
prompt,
max_new_tokens=512,
do_sample=False, # Use greedy decoding for evaluation consistency
temperature=None,
top_p=None,
top_k=None,
)
full_text = outputs[0]["generated_text"]
return extract_model_response(full_text)
# Example usage with real outputs
test_inputs = [
"كيف حالكي اليوم؟",
"وجدنا سبعون حالة",
"جاء في تسعة و سبعين سورة.",
"لاكن ما رايكم",
]
for text in test_inputs:
corrected = correct_arabic_text(text)
print(f"Original: {text}")
print(f"Corrected: {corrected}")
print("-" * 50)
# Expected output:
# Original: كيف حالكي اليوم؟
# Corrected: كيف حالك اليوم؟
# --------------------------------------------------
# Original: وجدنا سبعون حالة
# Corrected: وجدنا سبعين حالة
# --------------------------------------------------
# Original: جاء في تسعة و سبعين سورة.
# Corrected: جاء في تسع وسبعين سورة.
# --------------------------------------------------
# Original: لاكن ما رايكم
# Corrected: لكن ما رأيكم؟
# --------------------------------------------------
```
### Example Corrections
| Input (Incorrect) | Output (Corrected) | Error Type |
|---|---|---|
| كيف حالكي اليوم؟ | كيف حالك اليوم؟ | Gender agreement |
| وجدنا سبعون حالة | وجدنا سبعين حالة | Number declension |
| جاء في تسعة و سبعين سورة. | جاء في تسع وسبعين سورة. | Number gender + spacing |
| لاكن ما رايكم | لكن ما رأيكم؟ | Spelling + punctuation |
## Model Details
### Training Data
- **Dataset**: Custom Arabic GEC dataset
- **Training Epochs**: 7
- **Base Architecture**: Gemma 3 1B parameters
### Performance
- Designed for Modern Standard Arabic (MSA).
- Handles common grammatical errors.
### Limitations
- Primarily trained on Modern Standard Arabic
- May not handle dialectical Arabic variations optimally
- Performance may vary with very long texts (>512 tokens)
- Context-dependent corrections may sometimes be imperfect
## Use Cases
- **Educational Tools**: Helping Arabic learners with gender agreement and number declension
- **Content Creation**: Proofreading Arabic content for grammatical accuracy
- **Text Processing**: Preprocessing Arabic text for downstream NLP tasks
- **Writing Assistance**: Supporting writers with:
- Proper number-noun agreement
- Correct case declensions
- Spelling standardization
- Punctuation normalization
- **Academic Writing**: Ensuring grammatical correctness in formal Arabic texts
## Training Details
- **Fine-tuning Framework**: Unsloth
- **Base Model**: Gemma 3 1B
- **Training Epochs**: 7
- **Optimization**: Memory-efficient fine-tuning techniques
## Citation
If you use this model in your research or applications, please cite:
```bibtex
@misc{gemma3-arabic-gec-v1,
title={Gemma 3 1B Arabic Grammatical Error Correction v1},
author={Bahjat Al Mostafa},
organization={Alnnahwi},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/alnnahwi/gemma-3-1b-arabic-gec-v1},
website={https://alnnahwi.com/}
}
```
## License
This model is released under the same license as the base Gemma model. Please refer to Google's Gemma license for usage terms and conditions.
**Important**: This model is based on Google's Gemma and is subject to Google's AI Principles and licensing terms.
## Acknowledgments
- Built upon Google's Gemma 3 1B model
- Fine-tuned using Unsloth framework
- Trained for Arabic Grammatical Error Correction
- Developed by Bahjat Al Mostafa at Alnnahwi
- Visit [Alnnahwi](https://alnnahwi.com/) for more Arabic NLP resources
## Contact
**Author**: Bahjat Al Mostafa [@Bahjat](https://x.com/bahjat/)
**Email**: <[email protected]>
**Organization**: Alnnahwi
**Website**: [https://alnnahwi.com/](https://alnnahwi.com/)
For questions, issues, or collaboration opportunities, please open an issue in this repository or visit our website.
---
**Model Version**: v1.0.0
**Last Updated**: May 2025
**Model Size**: ~2.0GB |