--- license: gemma language: - ar base_model: - google/gemma-3-1b-pt pipeline_tag: text-generation tags: - arabic - grammatical-error-correction - gemma - unsloth - arabic-nlp --- # Gemma 3 1B Arabic Grammatical Error Correction v1 ## Model Description This model is a fine-tuned version of Google's Gemma 3 1B model, specifically trained for Arabic Grammatical Error Correction (GEC) by Alnnahwi. The model takes Arabic sentences as input and outputs their grammatically corrected versions. **Developed by**: Bahjat Al Mostafa (Alnnahwi) **Base Model:** google/gemma-3-1b **Task:** Grammatical Error Correction **Language:** Arabic **Version:** 1.0.0 **Organization**: [Alnnahwi](https://alnnahwi.com/) ## Quick Start ### Installation ```bash pip install transformers torch ``` ### Basic Usage ```python from transformers import pipeline, AutoTokenizer import torch MODEL_NAME = "alnnahwi/gemma-3-1b-arabic-gec-v1" def extract_model_response(generated_text): """Extract just the model's response from the full generated text.""" # Find the position after "model" marker model_marker = "\nmodel\n" if model_marker in generated_text: response_start = generated_text.find(model_marker) + len(model_marker) return generated_text[response_start:].strip() # Alternative format (in case formatting changes) alt_marker = "model\n" if alt_marker in generated_text: response_start = generated_text.find(alt_marker) + len(alt_marker) return generated_text[response_start:].strip() # If markers not found, return the original text return generated_text # Initialize the tokenizer tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) # Add Gemma chat template manually tokenizer.chat_template = """{% for message in messages %}{{'' + message['role'] + '\n' + message['content'] + '\n'}}{% endfor %}{% if add_generation_prompt %}{{'model\n'}}{% endif %}""" # Device selection if torch.backends.mps.is_available(): device = "mps" elif torch.cuda.is_available(): device = "cuda" else: device = "cpu" # Create pipeline pipe = pipeline( "text-generation", model=MODEL_NAME, tokenizer=tokenizer, device=device, ) def correct_arabic_text(text): """Correct Arabic text using the fine-tuned model.""" messages = [{"role": "user", "content": text}] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) outputs = pipe( prompt, max_new_tokens=512, do_sample=False, # Use greedy decoding for evaluation consistency temperature=None, top_p=None, top_k=None, ) full_text = outputs[0]["generated_text"] return extract_model_response(full_text) # Example usage with real outputs test_inputs = [ "كيف حالكي اليوم؟", "وجدنا سبعون حالة", "جاء في تسعة و سبعين سورة.", "لاكن ما رايكم", ] for text in test_inputs: corrected = correct_arabic_text(text) print(f"Original: {text}") print(f"Corrected: {corrected}") print("-" * 50) # Expected output: # Original: كيف حالكي اليوم؟ # Corrected: كيف حالك اليوم؟ # -------------------------------------------------- # Original: وجدنا سبعون حالة # Corrected: وجدنا سبعين حالة # -------------------------------------------------- # Original: جاء في تسعة و سبعين سورة. # Corrected: جاء في تسع وسبعين سورة. # -------------------------------------------------- # Original: لاكن ما رايكم # Corrected: لكن ما رأيكم؟ # -------------------------------------------------- ``` ### Example Corrections | Input (Incorrect) | Output (Corrected) | Error Type | |---|---|---| | كيف حالكي اليوم؟ | كيف حالك اليوم؟ | Gender agreement | | وجدنا سبعون حالة | وجدنا سبعين حالة | Number declension | | جاء في تسعة و سبعين سورة. | جاء في تسع وسبعين سورة. | Number gender + spacing | | لاكن ما رايكم | لكن ما رأيكم؟ | Spelling + punctuation | ## Model Details ### Training Data - **Dataset**: Custom Arabic GEC dataset - **Training Epochs**: 7 - **Base Architecture**: Gemma 3 1B parameters ### Performance - Designed for Modern Standard Arabic (MSA). - Handles common grammatical errors. ### Limitations - Primarily trained on Modern Standard Arabic - May not handle dialectical Arabic variations optimally - Performance may vary with very long texts (>512 tokens) - Context-dependent corrections may sometimes be imperfect ## Use Cases - **Educational Tools**: Helping Arabic learners with gender agreement and number declension - **Content Creation**: Proofreading Arabic content for grammatical accuracy - **Text Processing**: Preprocessing Arabic text for downstream NLP tasks - **Writing Assistance**: Supporting writers with: - Proper number-noun agreement - Correct case declensions - Spelling standardization - Punctuation normalization - **Academic Writing**: Ensuring grammatical correctness in formal Arabic texts ## Training Details - **Fine-tuning Framework**: Unsloth - **Base Model**: Gemma 3 1B - **Training Epochs**: 7 - **Optimization**: Memory-efficient fine-tuning techniques ## Citation If you use this model in your research or applications, please cite: ```bibtex @misc{gemma3-arabic-gec-v1, title={Gemma 3 1B Arabic Grammatical Error Correction v1}, author={Bahjat Al Mostafa}, organization={Alnnahwi}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/alnnahwi/gemma-3-1b-arabic-gec-v1}, website={https://alnnahwi.com/} } ``` ## License This model is released under the same license as the base Gemma model. Please refer to Google's Gemma license for usage terms and conditions. **Important**: This model is based on Google's Gemma and is subject to Google's AI Principles and licensing terms. ## Acknowledgments - Built upon Google's Gemma 3 1B model - Fine-tuned using Unsloth framework - Trained for Arabic Grammatical Error Correction - Developed by Bahjat Al Mostafa at Alnnahwi - Visit [Alnnahwi](https://alnnahwi.com/) for more Arabic NLP resources ## Contact **Author**: Bahjat Al Mostafa [@Bahjat](https://x.com/bahjat/) **Email**: **Organization**: Alnnahwi **Website**: [https://alnnahwi.com/](https://alnnahwi.com/) For questions, issues, or collaboration opportunities, please open an issue in this repository or visit our website. --- **Model Version**: v1.0.0 **Last Updated**: May 2025 **Model Size**: ~2.0GB