gemma-3-1b-it-reasoning-grpo-lora / README.md

Update README.md

eb4c1f5 verified about 1 month ago

4.68 kB

	---
	base_model: google/gemma-3-1b-it
	tags:
	- ellora
	- lora
	- reasoning
	- chain-of-thought
	- grpo
	- thinking
	- preference-learning
	- self-improvement
	- peft
	- gemma
	library_name: peft
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	inference: true
	model_type: gemma
	datasets:
	- codelion/gemma-3-1b-it-magpie-reasoning
	---

	# codelion/gemma-3-1b-it-reasoning-grpo-lora

	## 🧠 Reasoning LoRA with GRPO Training

	This LoRA adapter enhances google/gemma-3-1b-it with structured reasoning capabilities using `<think></think>` tags. Trained with GRPO (Group Relative Policy Optimization) on self-generated preference data.

	## 🎯 Key Features

	- Structured Thinking: Teaches models to use `<think></think>` tags for chain-of-thought reasoning
	- GRPO Training: Uses preference learning to optimize reasoning quality
	- Self-Generated Data: No external datasets required - uses Magpie approach
	- Multi-Domain: Effective across mathematics, logic, science, and problem-solving

	## 📊 Performance Metrics

	- Base Model: google/gemma-3-1b-it
	- Training Method: GRPO (Group Relative Policy Optimization)
	- LoRA Rank: 64
	- LoRA Alpha: 128
	- Training Samples: 107
	- Thinking Tag Usage: 60.0%
	- Average Quality Score: 5.60

	## 🔧 Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	# Load base model and tokenizer
	model = AutoModelForCausalLM.from_pretrained(
	"google/gemma-3-1b-it",
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")

	# Load reasoning LoRA adapter
	model = PeftModel.from_pretrained(model, "codelion/gemma-3-1b-it-reasoning-grpo-lora")

	# Use with thinking prompt
	prompt = '''Think step by step and use <think></think> tags to show your reasoning process.

	Problem: If a train travels 120 miles in 2 hours, then increases its speed by 30 mph for the next hour, how many total miles does it travel?

	Response:'''

	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.2)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	## 📈 Expected Output Format

	The model will generate responses with structured thinking:

	```
	<think>
	First, I need to find the train's initial speed.
	Speed = Distance / Time = 120 miles / 2 hours = 60 mph

	For the first 2 hours: 120 miles
	For the next hour, speed increases by 30 mph: 60 + 30 = 90 mph
	Distance in third hour: 90 mph × 1 hour = 90 miles

	Total distance = 120 + 90 = 210 miles
	</think>

	To solve this step by step:

	First, I'll find the train's initial speed:
	- Distance = 120 miles, Time = 2 hours
	- Speed = 120 ÷ 2 = 60 mph

	Next, I'll calculate the distance for each segment:
	- First 2 hours: 120 miles (given)
	- Third hour: speed increases by 30 mph → 60 + 30 = 90 mph
	- Distance in third hour: 90 × 1 = 90 miles

	Total distance = 120 + 90 = 210 miles
	```

	## 🧪 Training Details

	- Method: GRPO (Group Relative Policy Optimization)
	- Data Generation: Magpie approach with reasoning-focused prompts
	- Preference Learning: Multiple responses ranked by reasoning quality
	- Domains: Mathematics, logic puzzles, science, programming, philosophy
	- Quality Scoring: Based on thinking tag usage, reasoning markers, and structure

	## 📚 Training Data

	The model was trained on self-generated reasoning problems across multiple domains:
	- Mathematical problem-solving
	- Logic puzzles and riddles
	- Scientific analysis
	- Programming challenges
	- Philosophical reasoning
	- Decision-making scenarios

	## 🎭 Reasoning Patterns Learned

	- Step-by-step analysis: Breaking complex problems into smaller parts
	- Causal reasoning: Using "because", "therefore", "since" connections
	- Sequential thinking: "First", "next", "then", "finally" progression
	- Structured output: Clear separation of thinking and final response

	## 🔬 Evaluation

	The adapter was evaluated on diverse reasoning tasks:
	- Thinking tag usage rate: 60.0%
	- Average reasoning quality score: 5.60
	- Response comprehensiveness: 454 words average

	## 🏷️ Related

	- Dataset: [codelion/gemma-3-1b-it-magpie-reasoning](https://huggingface.co/datasets/codelion/gemma-3-1b-it-magpie-reasoning)
	- Base Model: [google/gemma-3-1b-it](https://huggingface.co/google/gemma-3-1b-it)
	- Framework: [PEFT](https://github.com/huggingface/peft)
	- Training Method: GRPO (Group Relative Policy Optimization)

	---

	This adapter is part of the [Ellora project](https://github.com/codelion/ellora) - standardized recipes for enhancing LLM capabilities.