Initial upload: Arabic GEC Gemma 3 1B v1

Browse files

Files changed (11) hide show

.gitattributes +1 -0
README.md +206 -0
added_tokens.json +3 -0
chat_template.jinja +5 -0
config.json +36 -0
generation_config.json +14 -0
model.safetensors +3 -0
special_tokens_map.json +33 -0
tokenizer.json +3 -0
tokenizer.model +3 -0
tokenizer_config.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,206 @@

+# Gemma 3 1B Arabic Grammatical Error Correction v1
+## Model Description
+This model is a fine-tuned version of Google's Gemma 3 1B model, specifically trained for Arabic Grammatical Error Correction (GEC) by Alnnahwi. The model takes Arabic sentences as input and outputs their grammatically corrected versions.
+**Developed by**: Bahjat Al Mostafa (Alnnahwi)
+**Base Model:** google/gemma-3-1b
+**Task:** Grammatical Error Correction
+**Language:** Arabic
+**Version:** 1.0.0
+**Organization**: [Alnnahwi](https://alnnahwi.com/)
+## Quick Start
+### Installation
+```bash
+pip install transformers torch
+```
+### Basic Usage
+```python
+from transformers import pipeline, AutoTokenizer
+import torch
+MODEL_NAME = "alnnahwi/gemma-3-1b-arabic-gec-v1"
+def extract_model_response(generated_text):
+    """Extract just the model's response from the full generated text."""
+    # Find the position after "model" marker
+    model_marker = "\nmodel\n"
+    if model_marker in generated_text:
+        response_start = generated_text.find(model_marker) + len(model_marker)
+        return generated_text[response_start:].strip()
+    # Alternative format (in case formatting changes)
+    alt_marker = "model\n"
+    if alt_marker in generated_text:
+        response_start = generated_text.find(alt_marker) + len(alt_marker)
+        return generated_text[response_start:].strip()
+    # If markers not found, return the original text
+    return generated_text
+# Initialize the tokenizer
+tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
+# Add Gemma chat template manually
+tokenizer.chat_template = """{% for message in messages %}{{'<start_of_turn>' + message['role'] + '\n' + message['content'] + '<end_of_turn>\n'}}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"""
+# Device selection
+if torch.backends.mps.is_available():
+    device = "mps"
+elif torch.cuda.is_available():
+    device = "cuda"
+else:
+    device = "cpu"
+# Create pipeline
+pipe = pipeline(
+    "text-generation",
+    model=MODEL_NAME,
+    tokenizer=tokenizer,
+    device=device,
+)
+def correct_arabic_text(text):
+    """Correct Arabic text using the fine-tuned model."""
+    messages = [{"role": "user", "content": text}]
+    prompt = tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    outputs = pipe(
+        prompt,
+        max_new_tokens=512,
+        do_sample=True,
+        temperature=0.7,
+        top_p=0.9,
+    )
+    full_text = outputs[0]["generated_text"]
+    return extract_model_response(full_text)
+# Example usage with real outputs
+test_inputs = [
+    "كيف حالكي اليوم؟",
+    "وجدنا سبعون حالة",
+    "جاء في تسعة و سبعين سورة.",
+    "لاكن ما رايكم",
+]
+for text in test_inputs:
+    corrected = correct_arabic_text(text)
+    print(f"Original: {text}")
+    print(f"Corrected: {corrected}")
+    print("-" * 50)
+# Expected output:
+# Original: كيف حالكي اليوم؟
+# Corrected: كيف حالك اليوم؟
+# --------------------------------------------------
+# Original: وجدنا سبعون حالة
+# Corrected: وجدنا سبعين حالة
+# --------------------------------------------------
+# Original: جاء في تسعة و سبعين سورة.
+# Corrected: جاء في تسع وسبعين سورة.
+# --------------------------------------------------
+# Original: لاكن ما رايكم
+# Corrected: لكن ما رأيكم؟
+# --------------------------------------------------
+```
+### Example Corrections
+| Input (Incorrect) | Output (Corrected) | Error Type |
+|---|---|---|
+| كيف حالكي اليوم؟ | كيف حالك اليوم؟ | Gender agreement |
+| وجدنا سبعون حالة | وجدنا سبعين حالة | Number declension |
+| جاء في تسعة و سبعين سورة. | جاء في تسع وسبعين سورة. | Number gender + spacing |
+| لاكن ما رايكم | لكن ما رأيكم؟ | Spelling + punctuation |
+## Model Details
+### Training Data
+- **Dataset**: Custom Arabic GEC dataset
+- **Training Epochs**: 7
+- **Base Architecture**: Gemma 3 1B parameters
+### Performance
+- Designed for Modern Standard Arabic (MSA).
+- Handles common grammatical errors.
+### Limitations
+- Primarily trained on Modern Standard Arabic
+- May not handle dialectical Arabic variations optimally
+- Performance may vary with very long texts (>512 tokens)
+- Context-dependent corrections may sometimes be imperfect
+## Use Cases
+- **Educational Tools**: Helping Arabic learners with gender agreement and number declension
+- **Content Creation**: Proofreading Arabic content for grammatical accuracy
+- **Text Processing**: Preprocessing Arabic text for downstream NLP tasks
+- **Writing Assistance**: Supporting writers with:
+  - Proper number-noun agreement
+  - Correct case declensions
+  - Spelling standardization
+  - Punctuation normalization
+- **Academic Writing**: Ensuring grammatical correctness in formal Arabic texts
+## Training Details
+- **Fine-tuning Framework**: Unsloth
+- **Base Model**: Gemma 3 1B
+- **Training Epochs**: 7
+- **Optimization**: Memory-efficient fine-tuning techniques
+## Citation
+If you use this model in your research or applications, please cite:
+```bibtex
+@misc{gemma3-arabic-gec-v1,
+  title={Gemma 3 1B Arabic Grammatical Error Correction v1},
+  author={Bahjat Al Mostafa},
+  organization={Alnnahwi},
+  year={2025},
+  publisher={Hugging Face},
+  url={https://huggingface.co/alnnahwi/gemma-3-1b-arabic-gec-v1},
+  website={https://alnnahwi.com/}
+}
+```
+## License
+This model is released under the same license as the base Gemma model. Please refer to Google's Gemma license for usage terms and conditions.
+**Important**: This model is based on Google's Gemma and is subject to Google's AI Principles and licensing terms.
+## Acknowledgments
+- Built upon Google's Gemma 3 1B model
+- Fine-tuned using Unsloth framework
+- Trained for Arabic Grammatical Error Correction
+- Developed by Bahjat Al Mostafa at Alnnahwi
+- Visit [Alnnahwi](https://alnnahwi.com/) for more Arabic NLP resources
+## Contact
+**Author**: Bahjat Al Mostafa
+**Email**: <[email protected]>
+**Organization**: Alnnahwi
+**Website**: [https://alnnahwi.com/](https://alnnahwi.com/)
+For questions, issues, or collaboration opportunities, please open an issue in this repository or visit our website.
+---
+**Model Version**: v1.0.0
+**Last Updated**: May 2025
+**Model Size**: ~1.9GB

added_tokens.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "<image_soft_token>": 262144
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,5 @@

+{% for message in messages %}{% if message['role'] == 'user' %}<start_of_turn>user
+{{ message['content'] }}<end_of_turn>
+{% elif message['role'] == 'model' %}<start_of_turn>model
+{{ message['content'] }}<end_of_turn>
+{% endif %}{% endfor %}

config.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "architectures": [
+    "Gemma3ForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "attn_logit_softcapping": null,
+  "bos_token_id": 2,
+  "cache_implementation": "hybrid",
+  "eos_token_id": 1,
+  "final_logit_softcapping": null,
+  "head_dim": 256,
+  "hidden_activation": "gelu_pytorch_tanh",
+  "hidden_size": 1152,
+  "initializer_range": 0.02,
+  "intermediate_size": 6912,
+  "max_position_embeddings": 32768,
+  "model_type": "gemma3_text",
+  "num_attention_heads": 4,
+  "num_hidden_layers": 26,
+  "num_key_value_heads": 1,
+  "pad_token_id": 0,
+  "query_pre_attn_scalar": 256,
+  "rms_norm_eps": 1e-06,
+  "rope_local_base_freq": 10000,
+  "rope_scaling": null,
+  "rope_theta": 1000000,
+  "sliding_window": 512,
+  "sliding_window_pattern": 6,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.52.4",
+  "unsloth_fixed": true,
+  "unsloth_version": "2025.5.9",
+  "use_cache": true,
+  "vocab_size": 262144
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "bos_token_id": 2,
+  "cache_implementation": "hybrid",
+  "do_sample": true,
+  "eos_token_id": [
+    1,
+    106
+  ],
+  "max_length": 32768,
+  "pad_token_id": 0,
+  "top_k": 64,
+  "top_p": 0.95,
+  "transformers_version": "4.52.4"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6175daad5548277650840a3f4d25e4967a8ca3d4b87393eefdb6ea8b8f6bc6df
+size 1999811208

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "boi_token": "<start_of_image>",
+  "bos_token": {
+    "content": "<bos>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eoi_token": "<end_of_image>",
+  "eos_token": {
+    "content": "<eos>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "image_token": "<image_soft_token>",
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a872e3bb510a751b26bd65f61aad05f948c9cf78fe4f787aebd197b393cc4081
+size 33384667

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1299c11d7cf632ef3b4e11937501358ada021bbdf7c47638d13c0ee982f2e79c
+size 4689074

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff