File size: 6,764 Bytes
9d05224
 
 
 
 
 
 
 
 
 
 
 
 
 
547be79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43631ad
 
 
 
547be79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43631ad
547be79
 
 
 
 
 
 
 
 
 
9d05224
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
license: gemma
language:
- ar
base_model:
- google/gemma-3-1b-pt
pipeline_tag: text-generation
tags:
- arabic
- grammatical-error-correction
- gemma
- unsloth
- arabic-nlp
---
# Gemma 3 1B Arabic Grammatical Error Correction v1

## Model Description

This model is a fine-tuned version of Google's Gemma 3 1B model, specifically trained for Arabic Grammatical Error Correction (GEC) by Alnnahwi. The model takes Arabic sentences as input and outputs their grammatically corrected versions.

**Developed by**: Bahjat Al Mostafa (Alnnahwi)  
**Base Model:** google/gemma-3-1b  
**Task:** Grammatical Error Correction  
**Language:** Arabic  
**Version:** 1.0.0  
**Organization**: [Alnnahwi](https://alnnahwi.com/)

## Quick Start

### Installation

```bash
pip install transformers torch
```

### Basic Usage

```python
from transformers import pipeline, AutoTokenizer
import torch

MODEL_NAME = "alnnahwi/gemma-3-1b-arabic-gec-v1"

def extract_model_response(generated_text):
    """Extract just the model's response from the full generated text."""
    # Find the position after "model" marker
    model_marker = "\nmodel\n"
    if model_marker in generated_text:
        response_start = generated_text.find(model_marker) + len(model_marker)
        return generated_text[response_start:].strip()

    # Alternative format (in case formatting changes)
    alt_marker = "model\n"
    if alt_marker in generated_text:
        response_start = generated_text.find(alt_marker) + len(alt_marker)
        return generated_text[response_start:].strip()

    # If markers not found, return the original text
    return generated_text

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# Add Gemma chat template manually
tokenizer.chat_template = """{% for message in messages %}{{'<start_of_turn>' + message['role'] + '\n' + message['content'] + '<end_of_turn>\n'}}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"""

# Device selection
if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# Create pipeline
pipe = pipeline(
    "text-generation",
    model=MODEL_NAME,
    tokenizer=tokenizer,
    device=device,
)

def correct_arabic_text(text):
    """Correct Arabic text using the fine-tuned model."""
    messages = [{"role": "user", "content": text}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    
    outputs = pipe(
        prompt,
        max_new_tokens=512,
        do_sample=False,  # Use greedy decoding for evaluation consistency
        temperature=None,
        top_p=None,
        top_k=None,
    )
    
    full_text = outputs[0]["generated_text"]
    return extract_model_response(full_text)

# Example usage with real outputs
test_inputs = [
    "كيف حالكي اليوم؟",
    "وجدنا سبعون حالة",
    "جاء في تسعة و سبعين سورة.",
    "لاكن ما رايكم",
]

for text in test_inputs:
    corrected = correct_arabic_text(text)
    print(f"Original: {text}")
    print(f"Corrected: {corrected}")
    print("-" * 50)

# Expected output:
# Original: كيف حالكي اليوم؟
# Corrected: كيف حالك اليوم؟
# --------------------------------------------------
# Original: وجدنا سبعون حالة
# Corrected: وجدنا سبعين حالة
# --------------------------------------------------
# Original: جاء في تسعة و سبعين سورة.
# Corrected: جاء في تسع وسبعين سورة.
# --------------------------------------------------
# Original: لاكن ما رايكم
# Corrected: لكن ما رأيكم؟
# --------------------------------------------------
```

### Example Corrections

| Input (Incorrect) | Output (Corrected) | Error Type |
|---|---|---|
| كيف حالكي اليوم؟ | كيف حالك اليوم؟ | Gender agreement |
| وجدنا سبعون حالة | وجدنا سبعين حالة | Number declension |
| جاء في تسعة و سبعين سورة. | جاء في تسع وسبعين سورة. | Number gender + spacing |
| لاكن ما رايكم | لكن ما رأيكم؟ | Spelling + punctuation |

## Model Details

### Training Data

- **Dataset**: Custom Arabic GEC dataset
- **Training Epochs**: 7
- **Base Architecture**: Gemma 3 1B parameters

### Performance

- Designed for Modern Standard Arabic (MSA).
- Handles common grammatical errors.

### Limitations

- Primarily trained on Modern Standard Arabic
- May not handle dialectical Arabic variations optimally
- Performance may vary with very long texts (>512 tokens)
- Context-dependent corrections may sometimes be imperfect

## Use Cases

- **Educational Tools**: Helping Arabic learners with gender agreement and number declension
- **Content Creation**: Proofreading Arabic content for grammatical accuracy
- **Text Processing**: Preprocessing Arabic text for downstream NLP tasks
- **Writing Assistance**: Supporting writers with:
  - Proper number-noun agreement
  - Correct case declensions
  - Spelling standardization
  - Punctuation normalization
- **Academic Writing**: Ensuring grammatical correctness in formal Arabic texts

## Training Details

- **Fine-tuning Framework**: Unsloth
- **Base Model**: Gemma 3 1B
- **Training Epochs**: 7
- **Optimization**: Memory-efficient fine-tuning techniques

## Citation

If you use this model in your research or applications, please cite:

```bibtex
@misc{gemma3-arabic-gec-v1,
  title={Gemma 3 1B Arabic Grammatical Error Correction v1},
  author={Bahjat Al Mostafa},
  organization={Alnnahwi},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/alnnahwi/gemma-3-1b-arabic-gec-v1},
  website={https://alnnahwi.com/}
}
```

## License

This model is released under the same license as the base Gemma model. Please refer to Google's Gemma license for usage terms and conditions.

**Important**: This model is based on Google's Gemma and is subject to Google's AI Principles and licensing terms.

## Acknowledgments

- Built upon Google's Gemma 3 1B model
- Fine-tuned using Unsloth framework
- Trained for Arabic Grammatical Error Correction
- Developed by Bahjat Al Mostafa at Alnnahwi
- Visit [Alnnahwi](https://alnnahwi.com/) for more Arabic NLP resources

## Contact

**Author**: Bahjat Al Mostafa  [@Bahjat](https://x.com/bahjat/)
**Email**: <[email protected]>  
**Organization**: Alnnahwi  
**Website**: [https://alnnahwi.com/](https://alnnahwi.com/)

For questions, issues, or collaboration opportunities, please open an issue in this repository or visit our website.

---

**Model Version**: v1.0.0  
**Last Updated**: May 2025  
**Model Size**: ~2.0GB