---
base_model: google/gemma-3-1b-it
tags:
- ellora
- lora
- reasoning
- chain-of-thought
- grpo
- thinking
- preference-learning
- self-improvement
- peft
- gemma
library_name: peft
license: apache-2.0
language:
- en
pipeline_tag: text-generation
inference: true
model_type: gemma
datasets:
- codelion/gemma-3-1b-it-magpie-reasoning
---
# codelion/gemma-3-1b-it-reasoning-grpo-lora
## ๐ง Reasoning LoRA with GRPO Training
This LoRA adapter enhances google/gemma-3-1b-it with structured reasoning capabilities using `` tags. Trained with GRPO (Group Relative Policy Optimization) on self-generated preference data.
## ๐ฏ Key Features
- **Structured Thinking**: Teaches models to use `` tags for chain-of-thought reasoning
- **GRPO Training**: Uses preference learning to optimize reasoning quality
- **Self-Generated Data**: No external datasets required - uses Magpie approach
- **Multi-Domain**: Effective across mathematics, logic, science, and problem-solving
## ๐ Performance Metrics
- **Base Model**: google/gemma-3-1b-it
- **Training Method**: GRPO (Group Relative Policy Optimization)
- **LoRA Rank**: 64
- **LoRA Alpha**: 128
- **Training Samples**: 107
- **Thinking Tag Usage**: 60.0%
- **Average Quality Score**: 5.60
## ๐ง Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-3-1b-it",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")
# Load reasoning LoRA adapter
model = PeftModel.from_pretrained(model, "codelion/gemma-3-1b-it-reasoning-grpo-lora")
# Use with thinking prompt
prompt = '''Think step by step and use tags to show your reasoning process.
Problem: If a train travels 120 miles in 2 hours, then increases its speed by 30 mph for the next hour, how many total miles does it travel?
Response:'''
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.2)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
## ๐ Expected Output Format
The model will generate responses with structured thinking:
```
First, I need to find the train's initial speed.
Speed = Distance / Time = 120 miles / 2 hours = 60 mph
For the first 2 hours: 120 miles
For the next hour, speed increases by 30 mph: 60 + 30 = 90 mph
Distance in third hour: 90 mph ร 1 hour = 90 miles
Total distance = 120 + 90 = 210 miles
To solve this step by step:
First, I'll find the train's initial speed:
- Distance = 120 miles, Time = 2 hours
- Speed = 120 รท 2 = 60 mph
Next, I'll calculate the distance for each segment:
- First 2 hours: 120 miles (given)
- Third hour: speed increases by 30 mph โ 60 + 30 = 90 mph
- Distance in third hour: 90 ร 1 = 90 miles
Total distance = 120 + 90 = 210 miles
```
## ๐งช Training Details
- **Method**: GRPO (Group Relative Policy Optimization)
- **Data Generation**: Magpie approach with reasoning-focused prompts
- **Preference Learning**: Multiple responses ranked by reasoning quality
- **Domains**: Mathematics, logic puzzles, science, programming, philosophy
- **Quality Scoring**: Based on thinking tag usage, reasoning markers, and structure
## ๐ Training Data
The model was trained on self-generated reasoning problems across multiple domains:
- Mathematical problem-solving
- Logic puzzles and riddles
- Scientific analysis
- Programming challenges
- Philosophical reasoning
- Decision-making scenarios
## ๐ญ Reasoning Patterns Learned
- **Step-by-step analysis**: Breaking complex problems into smaller parts
- **Causal reasoning**: Using "because", "therefore", "since" connections
- **Sequential thinking**: "First", "next", "then", "finally" progression
- **Structured output**: Clear separation of thinking and final response
## ๐ฌ Evaluation
The adapter was evaluated on diverse reasoning tasks:
- Thinking tag usage rate: 60.0%
- Average reasoning quality score: 5.60
- Response comprehensiveness: 454 words average
## ๐ท๏ธ Related
- **Dataset**: [codelion/gemma-3-1b-it-magpie-reasoning](https://huggingface.co/datasets/codelion/gemma-3-1b-it-magpie-reasoning)
- **Base Model**: [google/gemma-3-1b-it](https://huggingface.co/google/gemma-3-1b-it)
- **Framework**: [PEFT](https://github.com/huggingface/peft)
- **Training Method**: GRPO (Group Relative Policy Optimization)
---
*This adapter is part of the [Ellora project](https://github.com/codelion/ellora) - standardized recipes for enhancing LLM capabilities.*