File size: 4,684 Bytes
51f1662 16849da eb4c1f5 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 4c1cc67 83b7938 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 4c1cc67 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 16849da 83b7938 51f1662 16849da 51f1662 16849da 51f1662 16849da 51f1662 eb4c1f5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
---
base_model: google/gemma-3-1b-it
tags:
- ellora
- lora
- reasoning
- chain-of-thought
- grpo
- thinking
- preference-learning
- self-improvement
- peft
- gemma
library_name: peft
license: apache-2.0
language:
- en
pipeline_tag: text-generation
inference: true
model_type: gemma
datasets:
- codelion/gemma-3-1b-it-magpie-reasoning
---
# codelion/gemma-3-1b-it-reasoning-grpo-lora
## π§ Reasoning LoRA with GRPO Training
This LoRA adapter enhances google/gemma-3-1b-it with structured reasoning capabilities using `<think></think>` tags. Trained with GRPO (Group Relative Policy Optimization) on self-generated preference data.
## π― Key Features
- **Structured Thinking**: Teaches models to use `<think></think>` tags for chain-of-thought reasoning
- **GRPO Training**: Uses preference learning to optimize reasoning quality
- **Self-Generated Data**: No external datasets required - uses Magpie approach
- **Multi-Domain**: Effective across mathematics, logic, science, and problem-solving
## π Performance Metrics
- **Base Model**: google/gemma-3-1b-it
- **Training Method**: GRPO (Group Relative Policy Optimization)
- **LoRA Rank**: 64
- **LoRA Alpha**: 128
- **Training Samples**: 107
- **Thinking Tag Usage**: 60.0%
- **Average Quality Score**: 5.60
## π§ Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-3-1b-it",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")
# Load reasoning LoRA adapter
model = PeftModel.from_pretrained(model, "codelion/gemma-3-1b-it-reasoning-grpo-lora")
# Use with thinking prompt
prompt = '''Think step by step and use <think></think> tags to show your reasoning process.
Problem: If a train travels 120 miles in 2 hours, then increases its speed by 30 mph for the next hour, how many total miles does it travel?
Response:'''
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.2)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
## π Expected Output Format
The model will generate responses with structured thinking:
```
<think>
First, I need to find the train's initial speed.
Speed = Distance / Time = 120 miles / 2 hours = 60 mph
For the first 2 hours: 120 miles
For the next hour, speed increases by 30 mph: 60 + 30 = 90 mph
Distance in third hour: 90 mph Γ 1 hour = 90 miles
Total distance = 120 + 90 = 210 miles
</think>
To solve this step by step:
First, I'll find the train's initial speed:
- Distance = 120 miles, Time = 2 hours
- Speed = 120 Γ· 2 = 60 mph
Next, I'll calculate the distance for each segment:
- First 2 hours: 120 miles (given)
- Third hour: speed increases by 30 mph β 60 + 30 = 90 mph
- Distance in third hour: 90 Γ 1 = 90 miles
Total distance = 120 + 90 = 210 miles
```
## π§ͺ Training Details
- **Method**: GRPO (Group Relative Policy Optimization)
- **Data Generation**: Magpie approach with reasoning-focused prompts
- **Preference Learning**: Multiple responses ranked by reasoning quality
- **Domains**: Mathematics, logic puzzles, science, programming, philosophy
- **Quality Scoring**: Based on thinking tag usage, reasoning markers, and structure
## π Training Data
The model was trained on self-generated reasoning problems across multiple domains:
- Mathematical problem-solving
- Logic puzzles and riddles
- Scientific analysis
- Programming challenges
- Philosophical reasoning
- Decision-making scenarios
## π Reasoning Patterns Learned
- **Step-by-step analysis**: Breaking complex problems into smaller parts
- **Causal reasoning**: Using "because", "therefore", "since" connections
- **Sequential thinking**: "First", "next", "then", "finally" progression
- **Structured output**: Clear separation of thinking and final response
## π¬ Evaluation
The adapter was evaluated on diverse reasoning tasks:
- Thinking tag usage rate: 60.0%
- Average reasoning quality score: 5.60
- Response comprehensiveness: 454 words average
## π·οΈ Related
- **Dataset**: [codelion/gemma-3-1b-it-magpie-reasoning](https://huggingface.co/datasets/codelion/gemma-3-1b-it-magpie-reasoning)
- **Base Model**: [google/gemma-3-1b-it](https://huggingface.co/google/gemma-3-1b-it)
- **Framework**: [PEFT](https://github.com/huggingface/peft)
- **Training Method**: GRPO (Group Relative Policy Optimization)
---
*This adapter is part of the [Ellora project](https://github.com/codelion/ellora) - standardized recipes for enhancing LLM capabilities.* |