File size: 6,764 Bytes

---
license: gemma
language:
- ar
base_model:
- google/gemma-3-1b-pt
pipeline_tag: text-generation
tags:
- arabic
- grammatical-error-correction
- gemma
- unsloth
- arabic-nlp
---
# Gemma 3 1B Arabic Grammatical Error Correction v1

## Model Description

This model is a fine-tuned version of Google's Gemma 3 1B model, specifically trained for Arabic Grammatical Error Correction (GEC) by Alnnahwi. The model takes Arabic sentences as input and outputs their grammatically corrected versions.

**Developed by**: Bahjat Al Mostafa (Alnnahwi)  
**Base Model:** google/gemma-3-1b  
**Task:** Grammatical Error Correction  
**Language:** Arabic  
**Version:** 1.0.0  
**Organization**: [Alnnahwi](https://alnnahwi.com/)

## Quick Start

### Installation

```bash
pip install transformers torch
```

### Basic Usage

```python
from transformers import pipeline, AutoTokenizer
import torch

MODEL_NAME = "alnnahwi/gemma-3-1b-arabic-gec-v1"

def extract_model_response(generated_text):
    """Extract just the model's response from the full generated text."""
    # Find the position after "model" marker
    model_marker = "\nmodel\n"
    if model_marker in generated_text:
        response_start = generated_text.find(model_marker) + len(model_marker)
        return generated_text[response_start:].strip()

    # Alternative format (in case formatting changes)
    alt_marker = "model\n"
    if alt_marker in generated_text:
        response_start = generated_text.find(alt_marker) + len(alt_marker)
        return generated_text[response_start:].strip()

    # If markers not found, return the original text
    return generated_text

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# Add Gemma chat template manually
tokenizer.chat_template = """{% for message in messages %}{{'<start_of_turn>' + message['role'] + '\n' + message['content'] + '<end_of_turn>\n'}}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"""

# Device selection
if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# Create pipeline
pipe = pipeline(
    "text-generation",
    model=MODEL_NAME,
    tokenizer=tokenizer,
    device=device,
)

def correct_arabic_text(text):
    """Correct Arabic text using the fine-tuned model."""
    messages = [{"role": "user", "content": text}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    
    outputs = pipe(
        prompt,
        max_new_tokens=512,
        do_sample=False,  # Use greedy decoding for evaluation consistency
        temperature=None,
        top_p=None,
        top_k=None,
    )
    
    full_text = outputs[0]["generated_text"]
    return extract_model_response(full_text)

# Example usage with real outputs
test_inputs = [
    "كيف حالكي اليوم؟",
    "وجدنا سبعون حالة",
    "جاء في تسعة و سبعين سورة.",
    "لاكن ما رايكم",
]

for text in test_inputs:
    corrected = correct_arabic_text(text)
    print(f"Original: {text}")
    print(f"Corrected: {corrected}")
    print("-" * 50)

# Expected output:
# Original: كيف حالكي اليوم؟
# Corrected: كيف حالك اليوم؟
# --------------------------------------------------
# Original: وجدنا سبعون حالة
# Corrected: وجدنا سبعين حالة
# --------------------------------------------------
# Original: جاء في تسعة و سبعين سورة.
# Corrected: جاء في تسع وسبعين سورة.
# --------------------------------------------------
# Original: لاكن ما رايكم
# Corrected: لكن ما رأيكم؟
# --------------------------------------------------
```

### Example Corrections

| Input (Incorrect) | Output (Corrected) | Error Type |
|---|---|---|
| كيف حالكي اليوم؟ | كيف حالك اليوم؟ | Gender agreement |
| وجدنا سبعون حالة | وجدنا سبعين حالة | Number declension |
| جاء في تسعة و سبعين سورة. | جاء في تسع وسبعين سورة. | Number gender + spacing |
| لاكن ما رايكم | لكن ما رأيكم؟ | Spelling + punctuation |

## Model Details

### Training Data

- **Dataset**: Custom Arabic GEC dataset
- **Training Epochs**: 7
- **Base Architecture**: Gemma 3 1B parameters

### Performance

- Designed for Modern Standard Arabic (MSA).
- Handles common grammatical errors.

### Limitations

- Primarily trained on Modern Standard Arabic
- May not handle dialectical Arabic variations optimally
- Performance may vary with very long texts (>512 tokens)
- Context-dependent corrections may sometimes be imperfect

## Use Cases

- **Educational Tools**: Helping Arabic learners with gender agreement and number declension
- **Content Creation**: Proofreading Arabic content for grammatical accuracy
- **Text Processing**: Preprocessing Arabic text for downstream NLP tasks
- **Writing Assistance**: Supporting writers with:
  - Proper number-noun agreement
  - Correct case declensions
  - Spelling standardization
  - Punctuation normalization
- **Academic Writing**: Ensuring grammatical correctness in formal Arabic texts

## Training Details

- **Fine-tuning Framework**: Unsloth
- **Base Model**: Gemma 3 1B
- **Training Epochs**: 7
- **Optimization**: Memory-efficient fine-tuning techniques

## Citation

If you use this model in your research or applications, please cite:

```bibtex
@misc{gemma3-arabic-gec-v1,
  title={Gemma 3 1B Arabic Grammatical Error Correction v1},
  author={Bahjat Al Mostafa},
  organization={Alnnahwi},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/alnnahwi/gemma-3-1b-arabic-gec-v1},
  website={https://alnnahwi.com/}
}
```

## License

This model is released under the same license as the base Gemma model. Please refer to Google's Gemma license for usage terms and conditions.

**Important**: This model is based on Google's Gemma and is subject to Google's AI Principles and licensing terms.

## Acknowledgments

- Built upon Google's Gemma 3 1B model
- Fine-tuned using Unsloth framework
- Trained for Arabic Grammatical Error Correction
- Developed by Bahjat Al Mostafa at Alnnahwi
- Visit [Alnnahwi](https://alnnahwi.com/) for more Arabic NLP resources

## Contact

**Author**: Bahjat Al Mostafa  [@Bahjat](https://x.com/bahjat/)
**Email**: <[email protected]>  
**Organization**: Alnnahwi  
**Website**: [https://alnnahwi.com/](https://alnnahwi.com/)

For questions, issues, or collaboration opportunities, please open an issue in this repository or visit our website.

---

**Model Version**: v1.0.0  
**Last Updated**: May 2025  
**Model Size**: ~2.0GB