add x handler

43631ad verified 14 days ago

6.76 kB

	---
	license: gemma
	language:
	- ar
	base_model:
	- google/gemma-3-1b-pt
	pipeline_tag: text-generation
	tags:
	- arabic
	- grammatical-error-correction
	- gemma
	- unsloth
	- arabic-nlp
	---
	# Gemma 3 1B Arabic Grammatical Error Correction v1

	## Model Description

	This model is a fine-tuned version of Google's Gemma 3 1B model, specifically trained for Arabic Grammatical Error Correction (GEC) by Alnnahwi. The model takes Arabic sentences as input and outputs their grammatically corrected versions.

	Developed by: Bahjat Al Mostafa (Alnnahwi)
	Base Model: google/gemma-3-1b
	Task: Grammatical Error Correction
	Language: Arabic
	Version: 1.0.0
	Organization: [Alnnahwi](https://alnnahwi.com/)

	## Quick Start

	### Installation

	```bash
	pip install transformers torch
	```

	### Basic Usage

	```python
	from transformers import pipeline, AutoTokenizer
	import torch

	MODEL_NAME = "alnnahwi/gemma-3-1b-arabic-gec-v1"

	def extract_model_response(generated_text):
	"""Extract just the model's response from the full generated text."""
	# Find the position after "model" marker
	model_marker = "\nmodel\n"
	if model_marker in generated_text:
	response_start = generated_text.find(model_marker) + len(model_marker)
	return generated_text[response_start:].strip()

	# Alternative format (in case formatting changes)
	alt_marker = "model\n"
	if alt_marker in generated_text:
	response_start = generated_text.find(alt_marker) + len(alt_marker)
	return generated_text[response_start:].strip()

	# If markers not found, return the original text
	return generated_text

	# Initialize the tokenizer
	tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
	# Add Gemma chat template manually
	tokenizer.chat_template = """{% for message in messages %}{{'<start_of_turn>' + message['role'] + '\n' + message['content'] + '<end_of_turn>\n'}}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"""

	# Device selection
	if torch.backends.mps.is_available():
	device = "mps"
	elif torch.cuda.is_available():
	device = "cuda"
	else:
	device = "cpu"

	# Create pipeline
	pipe = pipeline(
	"text-generation",
	model=MODEL_NAME,
	tokenizer=tokenizer,
	device=device,
	)

	def correct_arabic_text(text):
	"""Correct Arabic text using the fine-tuned model."""
	messages = [{"role": "user", "content": text}]
	prompt = tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)

	outputs = pipe(
	prompt,
	max_new_tokens=512,
	do_sample=False, # Use greedy decoding for evaluation consistency
	temperature=None,
	top_p=None,
	top_k=None,
	)

	full_text = outputs[0]["generated_text"]
	return extract_model_response(full_text)

	# Example usage with real outputs
	test_inputs = [
	"كيف حالكي اليوم؟",
	"وجدنا سبعون حالة",
	"جاء في تسعة و سبعين سورة.",
	"لاكن ما رايكم",
	]

	for text in test_inputs:
	corrected = correct_arabic_text(text)
	print(f"Original: {text}")
	print(f"Corrected: {corrected}")
	print("-" * 50)

	# Expected output:
	# Original: كيف حالكي اليوم؟
	# Corrected: كيف حالك اليوم؟
	# --------------------------------------------------
	# Original: وجدنا سبعون حالة
	# Corrected: وجدنا سبعين حالة
	# --------------------------------------------------
	# Original: جاء في تسعة و سبعين سورة.
	# Corrected: جاء في تسع وسبعين سورة.
	# --------------------------------------------------
	# Original: لاكن ما رايكم
	# Corrected: لكن ما رأيكم؟
	# --------------------------------------------------
	```

	### Example Corrections

	\| Input (Incorrect) \| Output (Corrected) \| Error Type \|
	\|---\|---\|---\|
	\| كيف حالكي اليوم؟ \| كيف حالك اليوم؟ \| Gender agreement \|
	\| وجدنا سبعون حالة \| وجدنا سبعين حالة \| Number declension \|
	\| جاء في تسعة و سبعين سورة. \| جاء في تسع وسبعين سورة. \| Number gender + spacing \|
	\| لاكن ما رايكم \| لكن ما رأيكم؟ \| Spelling + punctuation \|

	## Model Details

	### Training Data

	- Dataset: Custom Arabic GEC dataset
	- Training Epochs: 7
	- Base Architecture: Gemma 3 1B parameters

	### Performance

	- Designed for Modern Standard Arabic (MSA).
	- Handles common grammatical errors.

	### Limitations

	- Primarily trained on Modern Standard Arabic
	- May not handle dialectical Arabic variations optimally
	- Performance may vary with very long texts (>512 tokens)
	- Context-dependent corrections may sometimes be imperfect

	## Use Cases

	- Educational Tools: Helping Arabic learners with gender agreement and number declension
	- Content Creation: Proofreading Arabic content for grammatical accuracy
	- Text Processing: Preprocessing Arabic text for downstream NLP tasks
	- Writing Assistance: Supporting writers with:
	- Proper number-noun agreement
	- Correct case declensions
	- Spelling standardization
	- Punctuation normalization
	- Academic Writing: Ensuring grammatical correctness in formal Arabic texts

	## Training Details

	- Fine-tuning Framework: Unsloth
	- Base Model: Gemma 3 1B
	- Training Epochs: 7
	- Optimization: Memory-efficient fine-tuning techniques

	## Citation

	If you use this model in your research or applications, please cite:

	```bibtex
	@misc{gemma3-arabic-gec-v1,
	title={Gemma 3 1B Arabic Grammatical Error Correction v1},
	author={Bahjat Al Mostafa},
	organization={Alnnahwi},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/alnnahwi/gemma-3-1b-arabic-gec-v1},
	website={https://alnnahwi.com/}
	}
	```

	## License

	This model is released under the same license as the base Gemma model. Please refer to Google's Gemma license for usage terms and conditions.

	Important: This model is based on Google's Gemma and is subject to Google's AI Principles and licensing terms.

	## Acknowledgments

	- Built upon Google's Gemma 3 1B model
	- Fine-tuned using Unsloth framework
	- Trained for Arabic Grammatical Error Correction
	- Developed by Bahjat Al Mostafa at Alnnahwi
	- Visit [Alnnahwi](https://alnnahwi.com/) for more Arabic NLP resources

	## Contact

	Author: Bahjat Al Mostafa [@Bahjat](https://x.com/bahjat/)
	Email: <[email protected]>
	Organization: Alnnahwi
	Website: [https://alnnahwi.com/](https://alnnahwi.com/)

	For questions, issues, or collaboration opportunities, please open an issue in this repository or visit our website.

	---

	Model Version: v1.0.0
	Last Updated: May 2025
	Model Size: ~2.0GB