Gemma 3 1B Arabic Grammatical Error Correction v1

Model Description

This model is a fine-tuned version of Google's Gemma 3 1B model, specifically trained for Arabic Grammatical Error Correction (GEC) by Alnnahwi. The model takes Arabic sentences as input and outputs their grammatically corrected versions.

Developed by: Bahjat Al Mostafa (Alnnahwi)
Base Model: google/gemma-3-1b
Task: Grammatical Error Correction
Language: Arabic
Version: 1.0.0
Organization: Alnnahwi

Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import pipeline, AutoTokenizer
import torch

MODEL_NAME = "alnnahwi/gemma-3-1b-arabic-gec-v1"

def extract_model_response(generated_text):
    """Extract just the model's response from the full generated text."""
    # Find the position after "model" marker
    model_marker = "\nmodel\n"
    if model_marker in generated_text:
        response_start = generated_text.find(model_marker) + len(model_marker)
        return generated_text[response_start:].strip()

    # Alternative format (in case formatting changes)
    alt_marker = "model\n"
    if alt_marker in generated_text:
        response_start = generated_text.find(alt_marker) + len(alt_marker)
        return generated_text[response_start:].strip()

    # If markers not found, return the original text
    return generated_text

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# Add Gemma chat template manually
tokenizer.chat_template = """{% for message in messages %}{{'<start_of_turn>' + message['role'] + '\n' + message['content'] + '<end_of_turn>\n'}}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"""

# Device selection
if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# Create pipeline
pipe = pipeline(
    "text-generation",
    model=MODEL_NAME,
    tokenizer=tokenizer,
    device=device,
)

def correct_arabic_text(text):
    """Correct Arabic text using the fine-tuned model."""
    messages = [{"role": "user", "content": text}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    
    outputs = pipe(
        prompt,
        max_new_tokens=512,
        do_sample=False,  # Use greedy decoding for evaluation consistency
        temperature=None,
        top_p=None,
        top_k=None,
    )
    
    full_text = outputs[0]["generated_text"]
    return extract_model_response(full_text)

# Example usage with real outputs
test_inputs = [
    "ูƒูŠู ุญุงู„ูƒูŠ ุงู„ูŠูˆู…ุŸ",
    "ูˆุฌุฏู†ุง ุณุจุนูˆู† ุญุงู„ุฉ",
    "ุฌุงุก ููŠ ุชุณุนุฉ ูˆ ุณุจุนูŠู† ุณูˆุฑุฉ.",
    "ู„ุงูƒู† ู…ุง ุฑุงูŠูƒู…",
]

for text in test_inputs:
    corrected = correct_arabic_text(text)
    print(f"Original: {text}")
    print(f"Corrected: {corrected}")
    print("-" * 50)

# Expected output:
# Original: ูƒูŠู ุญุงู„ูƒูŠ ุงู„ูŠูˆู…ุŸ
# Corrected: ูƒูŠู ุญุงู„ูƒ ุงู„ูŠูˆู…ุŸ
# --------------------------------------------------
# Original: ูˆุฌุฏู†ุง ุณุจุนูˆู† ุญุงู„ุฉ
# Corrected: ูˆุฌุฏู†ุง ุณุจุนูŠู† ุญุงู„ุฉ
# --------------------------------------------------
# Original: ุฌุงุก ููŠ ุชุณุนุฉ ูˆ ุณุจุนูŠู† ุณูˆุฑุฉ.
# Corrected: ุฌุงุก ููŠ ุชุณุน ูˆุณุจุนูŠู† ุณูˆุฑุฉ.
# --------------------------------------------------
# Original: ู„ุงูƒู† ู…ุง ุฑุงูŠูƒู…
# Corrected: ู„ูƒู† ู…ุง ุฑุฃูŠูƒู…ุŸ
# --------------------------------------------------

Example Corrections

Input (Incorrect) Output (Corrected) Error Type
ูƒูŠู ุญุงู„ูƒูŠ ุงู„ูŠูˆู…ุŸ ูƒูŠู ุญุงู„ูƒ ุงู„ูŠูˆู…ุŸ Gender agreement
ูˆุฌุฏู†ุง ุณุจุนูˆู† ุญุงู„ุฉ ูˆุฌุฏู†ุง ุณุจุนูŠู† ุญุงู„ุฉ Number declension
ุฌุงุก ููŠ ุชุณุนุฉ ูˆ ุณุจุนูŠู† ุณูˆุฑุฉ. ุฌุงุก ููŠ ุชุณุน ูˆุณุจุนูŠู† ุณูˆุฑุฉ. Number gender + spacing
ู„ุงูƒู† ู…ุง ุฑุงูŠูƒู… ู„ูƒู† ู…ุง ุฑุฃูŠูƒู…ุŸ Spelling + punctuation

Model Details

Training Data

  • Dataset: Custom Arabic GEC dataset
  • Training Epochs: 7
  • Base Architecture: Gemma 3 1B parameters

Performance

  • Designed for Modern Standard Arabic (MSA).
  • Handles common grammatical errors.

Limitations

  • Primarily trained on Modern Standard Arabic
  • May not handle dialectical Arabic variations optimally
  • Performance may vary with very long texts (>512 tokens)
  • Context-dependent corrections may sometimes be imperfect

Use Cases

  • Educational Tools: Helping Arabic learners with gender agreement and number declension
  • Content Creation: Proofreading Arabic content for grammatical accuracy
  • Text Processing: Preprocessing Arabic text for downstream NLP tasks
  • Writing Assistance: Supporting writers with:
    • Proper number-noun agreement
    • Correct case declensions
    • Spelling standardization
    • Punctuation normalization
  • Academic Writing: Ensuring grammatical correctness in formal Arabic texts

Training Details

  • Fine-tuning Framework: Unsloth
  • Base Model: Gemma 3 1B
  • Training Epochs: 7
  • Optimization: Memory-efficient fine-tuning techniques

Citation

If you use this model in your research or applications, please cite:

@misc{gemma3-arabic-gec-v1,
  title={Gemma 3 1B Arabic Grammatical Error Correction v1},
  author={Bahjat Al Mostafa},
  organization={Alnnahwi},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/alnnahwi/gemma-3-1b-arabic-gec-v1},
  website={https://alnnahwi.com/}
}

License

This model is released under the same license as the base Gemma model. Please refer to Google's Gemma license for usage terms and conditions.

Important: This model is based on Google's Gemma and is subject to Google's AI Principles and licensing terms.

Acknowledgments

  • Built upon Google's Gemma 3 1B model
  • Fine-tuned using Unsloth framework
  • Trained for Arabic Grammatical Error Correction
  • Developed by Bahjat Al Mostafa at Alnnahwi
  • Visit Alnnahwi for more Arabic NLP resources

Contact

Author: Bahjat Al Mostafa @Bahjat Email: [email protected]
Organization: Alnnahwi
Website: https://alnnahwi.com/

For questions, issues, or collaboration opportunities, please open an issue in this repository or visit our website.


Model Version: v1.0.0
Last Updated: May 2025
Model Size: ~2.0GB

Downloads last month
22
Safetensors
Model size
1,000M params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for alnnahwi/gemma-3-1b-arabic-gec-v1

Finetuned
(168)
this model