|
--- |
|
license: gemma |
|
language: |
|
- ar |
|
base_model: |
|
- google/gemma-3-1b-pt |
|
pipeline_tag: text-generation |
|
tags: |
|
- arabic |
|
- grammatical-error-correction |
|
- gemma |
|
- unsloth |
|
- arabic-nlp |
|
--- |
|
# Gemma 3 1B Arabic Grammatical Error Correction v1 |
|
|
|
## Model Description |
|
|
|
This model is a fine-tuned version of Google's Gemma 3 1B model, specifically trained for Arabic Grammatical Error Correction (GEC) by Alnnahwi. The model takes Arabic sentences as input and outputs their grammatically corrected versions. |
|
|
|
**Developed by**: Bahjat Al Mostafa (Alnnahwi) |
|
**Base Model:** google/gemma-3-1b |
|
**Task:** Grammatical Error Correction |
|
**Language:** Arabic |
|
**Version:** 1.0.0 |
|
**Organization**: [Alnnahwi](https://alnnahwi.com/) |
|
|
|
## Quick Start |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install transformers torch |
|
``` |
|
|
|
### Basic Usage |
|
|
|
```python |
|
from transformers import pipeline, AutoTokenizer |
|
import torch |
|
|
|
MODEL_NAME = "alnnahwi/gemma-3-1b-arabic-gec-v1" |
|
|
|
def extract_model_response(generated_text): |
|
"""Extract just the model's response from the full generated text.""" |
|
# Find the position after "model" marker |
|
model_marker = "\nmodel\n" |
|
if model_marker in generated_text: |
|
response_start = generated_text.find(model_marker) + len(model_marker) |
|
return generated_text[response_start:].strip() |
|
|
|
# Alternative format (in case formatting changes) |
|
alt_marker = "model\n" |
|
if alt_marker in generated_text: |
|
response_start = generated_text.find(alt_marker) + len(alt_marker) |
|
return generated_text[response_start:].strip() |
|
|
|
# If markers not found, return the original text |
|
return generated_text |
|
|
|
# Initialize the tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) |
|
# Add Gemma chat template manually |
|
tokenizer.chat_template = """{% for message in messages %}{{'<start_of_turn>' + message['role'] + '\n' + message['content'] + '<end_of_turn>\n'}}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}""" |
|
|
|
# Device selection |
|
if torch.backends.mps.is_available(): |
|
device = "mps" |
|
elif torch.cuda.is_available(): |
|
device = "cuda" |
|
else: |
|
device = "cpu" |
|
|
|
# Create pipeline |
|
pipe = pipeline( |
|
"text-generation", |
|
model=MODEL_NAME, |
|
tokenizer=tokenizer, |
|
device=device, |
|
) |
|
|
|
def correct_arabic_text(text): |
|
"""Correct Arabic text using the fine-tuned model.""" |
|
messages = [{"role": "user", "content": text}] |
|
prompt = tokenizer.apply_chat_template( |
|
messages, tokenize=False, add_generation_prompt=True |
|
) |
|
|
|
outputs = pipe( |
|
prompt, |
|
max_new_tokens=512, |
|
do_sample=False, # Use greedy decoding for evaluation consistency |
|
temperature=None, |
|
top_p=None, |
|
top_k=None, |
|
) |
|
|
|
full_text = outputs[0]["generated_text"] |
|
return extract_model_response(full_text) |
|
|
|
# Example usage with real outputs |
|
test_inputs = [ |
|
"كيف حالكي اليوم؟", |
|
"وجدنا سبعون حالة", |
|
"جاء في تسعة و سبعين سورة.", |
|
"لاكن ما رايكم", |
|
] |
|
|
|
for text in test_inputs: |
|
corrected = correct_arabic_text(text) |
|
print(f"Original: {text}") |
|
print(f"Corrected: {corrected}") |
|
print("-" * 50) |
|
|
|
# Expected output: |
|
# Original: كيف حالكي اليوم؟ |
|
# Corrected: كيف حالك اليوم؟ |
|
# -------------------------------------------------- |
|
# Original: وجدنا سبعون حالة |
|
# Corrected: وجدنا سبعين حالة |
|
# -------------------------------------------------- |
|
# Original: جاء في تسعة و سبعين سورة. |
|
# Corrected: جاء في تسع وسبعين سورة. |
|
# -------------------------------------------------- |
|
# Original: لاكن ما رايكم |
|
# Corrected: لكن ما رأيكم؟ |
|
# -------------------------------------------------- |
|
``` |
|
|
|
### Example Corrections |
|
|
|
| Input (Incorrect) | Output (Corrected) | Error Type | |
|
|---|---|---| |
|
| كيف حالكي اليوم؟ | كيف حالك اليوم؟ | Gender agreement | |
|
| وجدنا سبعون حالة | وجدنا سبعين حالة | Number declension | |
|
| جاء في تسعة و سبعين سورة. | جاء في تسع وسبعين سورة. | Number gender + spacing | |
|
| لاكن ما رايكم | لكن ما رأيكم؟ | Spelling + punctuation | |
|
|
|
## Model Details |
|
|
|
### Training Data |
|
|
|
- **Dataset**: Custom Arabic GEC dataset |
|
- **Training Epochs**: 7 |
|
- **Base Architecture**: Gemma 3 1B parameters |
|
|
|
### Performance |
|
|
|
- Designed for Modern Standard Arabic (MSA). |
|
- Handles common grammatical errors. |
|
|
|
### Limitations |
|
|
|
- Primarily trained on Modern Standard Arabic |
|
- May not handle dialectical Arabic variations optimally |
|
- Performance may vary with very long texts (>512 tokens) |
|
- Context-dependent corrections may sometimes be imperfect |
|
|
|
## Use Cases |
|
|
|
- **Educational Tools**: Helping Arabic learners with gender agreement and number declension |
|
- **Content Creation**: Proofreading Arabic content for grammatical accuracy |
|
- **Text Processing**: Preprocessing Arabic text for downstream NLP tasks |
|
- **Writing Assistance**: Supporting writers with: |
|
- Proper number-noun agreement |
|
- Correct case declensions |
|
- Spelling standardization |
|
- Punctuation normalization |
|
- **Academic Writing**: Ensuring grammatical correctness in formal Arabic texts |
|
|
|
## Training Details |
|
|
|
- **Fine-tuning Framework**: Unsloth |
|
- **Base Model**: Gemma 3 1B |
|
- **Training Epochs**: 7 |
|
- **Optimization**: Memory-efficient fine-tuning techniques |
|
|
|
## Citation |
|
|
|
If you use this model in your research or applications, please cite: |
|
|
|
```bibtex |
|
@misc{gemma3-arabic-gec-v1, |
|
title={Gemma 3 1B Arabic Grammatical Error Correction v1}, |
|
author={Bahjat Al Mostafa}, |
|
organization={Alnnahwi}, |
|
year={2025}, |
|
publisher={Hugging Face}, |
|
url={https://huggingface.co/alnnahwi/gemma-3-1b-arabic-gec-v1}, |
|
website={https://alnnahwi.com/} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
This model is released under the same license as the base Gemma model. Please refer to Google's Gemma license for usage terms and conditions. |
|
|
|
**Important**: This model is based on Google's Gemma and is subject to Google's AI Principles and licensing terms. |
|
|
|
## Acknowledgments |
|
|
|
- Built upon Google's Gemma 3 1B model |
|
- Fine-tuned using Unsloth framework |
|
- Trained for Arabic Grammatical Error Correction |
|
- Developed by Bahjat Al Mostafa at Alnnahwi |
|
- Visit [Alnnahwi](https://alnnahwi.com/) for more Arabic NLP resources |
|
|
|
## Contact |
|
|
|
**Author**: Bahjat Al Mostafa [@Bahjat](https://x.com/bahjat/) |
|
**Email**: <[email protected]> |
|
**Organization**: Alnnahwi |
|
**Website**: [https://alnnahwi.com/](https://alnnahwi.com/) |
|
|
|
For questions, issues, or collaboration opportunities, please open an issue in this repository or visit our website. |
|
|
|
--- |
|
|
|
**Model Version**: v1.0.0 |
|
**Last Updated**: May 2025 |
|
**Model Size**: ~2.0GB |