Gemma 3 1B Arabic Grammatical Error Correction v1

Model Description

This model is a fine-tuned version of Google's Gemma 3 1B model, specifically trained for Arabic Grammatical Error Correction (GEC) by Alnnahwi. The model takes Arabic sentences as input and outputs their grammatically corrected versions.

Developed by: Bahjat Al Mostafa (Alnnahwi)
Base Model: google/gemma-3-1b
Task: Grammatical Error Correction
Language: Arabic
Version: 1.0.0
Organization: Alnnahwi

Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import pipeline, AutoTokenizer
import torch

MODEL_NAME = "alnnahwi/gemma-3-1b-arabic-gec-v1"

def extract_model_response(generated_text):
    """Extract just the model's response from the full generated text."""
    # Find the position after "model" marker
    model_marker = "\nmodel\n"
    if model_marker in generated_text:
        response_start = generated_text.find(model_marker) + len(model_marker)
        return generated_text[response_start:].strip()

    # Alternative format (in case formatting changes)
    alt_marker = "model\n"
    if alt_marker in generated_text:
        response_start = generated_text.find(alt_marker) + len(alt_marker)
        return generated_text[response_start:].strip()

    # If markers not found, return the original text
    return generated_text

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# Add Gemma chat template manually
tokenizer.chat_template = """{% for message in messages %}{{'<start_of_turn>' + message['role'] + '\n' + message['content'] + '<end_of_turn>\n'}}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"""

# Device selection
if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# Create pipeline
pipe = pipeline(
    "text-generation",
    model=MODEL_NAME,
    tokenizer=tokenizer,
    device=device,
)

def correct_arabic_text(text):
    """Correct Arabic text using the fine-tuned model."""
    messages = [{"role": "user", "content": text}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    
    outputs = pipe(
        prompt,
        max_new_tokens=512,
        do_sample=False,  # Use greedy decoding for evaluation consistency
        temperature=None,
        top_p=None,
        top_k=None,
    )
    
    full_text = outputs[0]["generated_text"]
    return extract_model_response(full_text)

# Example usage with real outputs
test_inputs = [
    "كيف حالكي اليوم؟",
    "وجدنا سبعون حالة",
    "جاء في تسعة و سبعين سورة.",
    "لاكن ما رايكم",
]

for text in test_inputs:
    corrected = correct_arabic_text(text)
    print(f"Original: {text}")
    print(f"Corrected: {corrected}")
    print("-" * 50)

# Expected output:
# Original: كيف حالكي اليوم؟
# Corrected: كيف حالك اليوم؟
# --------------------------------------------------
# Original: وجدنا سبعون حالة
# Corrected: وجدنا سبعين حالة
# --------------------------------------------------
# Original: جاء في تسعة و سبعين سورة.
# Corrected: جاء في تسع وسبعين سورة.
# --------------------------------------------------
# Original: لاكن ما رايكم
# Corrected: لكن ما رأيكم؟
# --------------------------------------------------

Example Corrections

Input (Incorrect)	Output (Corrected)	Error Type
كيف حالكي اليوم؟	كيف حالك اليوم؟	Gender agreement
وجدنا سبعون حالة	وجدنا سبعين حالة	Number declension
جاء في تسعة و سبعين سورة.	جاء في تسع وسبعين سورة.	Number gender + spacing
لاكن ما رايكم	لكن ما رأيكم؟	Spelling + punctuation

Model Details

Training Data

Dataset: Custom Arabic GEC dataset
Training Epochs: 7
Base Architecture: Gemma 3 1B parameters

Performance

Designed for Modern Standard Arabic (MSA).
Handles common grammatical errors.

Limitations

Primarily trained on Modern Standard Arabic
May not handle dialectical Arabic variations optimally
Performance may vary with very long texts (>512 tokens)
Context-dependent corrections may sometimes be imperfect

Use Cases

Educational Tools: Helping Arabic learners with gender agreement and number declension
Content Creation: Proofreading Arabic content for grammatical accuracy
Text Processing: Preprocessing Arabic text for downstream NLP tasks
Writing Assistance: Supporting writers with:
- Proper number-noun agreement
- Correct case declensions
- Spelling standardization
- Punctuation normalization
Academic Writing: Ensuring grammatical correctness in formal Arabic texts

Training Details

Fine-tuning Framework: Unsloth
Base Model: Gemma 3 1B
Training Epochs: 7
Optimization: Memory-efficient fine-tuning techniques

Citation

If you use this model in your research or applications, please cite:

@misc{gemma3-arabic-gec-v1,
  title={Gemma 3 1B Arabic Grammatical Error Correction v1},
  author={Bahjat Al Mostafa},
  organization={Alnnahwi},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/alnnahwi/gemma-3-1b-arabic-gec-v1},
  website={https://alnnahwi.com/}
}

License

This model is released under the same license as the base Gemma model. Please refer to Google's Gemma license for usage terms and conditions.

Important: This model is based on Google's Gemma and is subject to Google's AI Principles and licensing terms.

Acknowledgments

Built upon Google's Gemma 3 1B model
Fine-tuned using Unsloth framework
Trained for Arabic Grammatical Error Correction
Developed by Bahjat Al Mostafa at Alnnahwi
Visit Alnnahwi for more Arabic NLP resources

Contact

Author: Bahjat Al Mostafa @Bahjat Email: [email protected]
Organization: Alnnahwi
Website: https://alnnahwi.com/

For questions, issues, or collaboration opportunities, please open an issue in this repository or visit our website.

Model Version: v1.0.0
Last Updated: May 2025
Model Size: ~2.0GB

alnnahwi
/

gemma-3-1b-arabic-gec-v1