File size: 4,684 Bytes
51f1662
16849da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb4c1f5
 
51f1662
 
16849da
51f1662
16849da
51f1662
16849da
51f1662
16849da
51f1662
16849da
 
 
 
51f1662
16849da
51f1662
16849da
 
 
 
4c1cc67
83b7938
 
51f1662
16849da
51f1662
16849da
 
 
51f1662
16849da
 
 
 
 
 
 
51f1662
16849da
 
51f1662
16849da
 
51f1662
16849da
51f1662
16849da
51f1662
16849da
4c1cc67
16849da
 
 
51f1662
16849da
51f1662
16849da
51f1662
16849da
 
 
 
51f1662
16849da
 
 
51f1662
16849da
 
51f1662
16849da
51f1662
16849da
 
 
51f1662
16849da
 
 
 
51f1662
16849da
 
51f1662
16849da
51f1662
16849da
 
 
 
 
51f1662
16849da
51f1662
16849da
 
 
 
 
 
 
51f1662
16849da
51f1662
16849da
 
 
 
51f1662
16849da
51f1662
16849da
83b7938
 
 
51f1662
16849da
51f1662
16849da
 
 
 
51f1662
16849da
51f1662
eb4c1f5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
base_model: google/gemma-3-1b-it
tags:
- ellora
- lora
- reasoning
- chain-of-thought
- grpo
- thinking
- preference-learning
- self-improvement
- peft
- gemma
library_name: peft
license: apache-2.0
language:
- en
pipeline_tag: text-generation
inference: true
model_type: gemma
datasets:
- codelion/gemma-3-1b-it-magpie-reasoning
---

# codelion/gemma-3-1b-it-reasoning-grpo-lora

## 🧠 Reasoning LoRA with GRPO Training

This LoRA adapter enhances google/gemma-3-1b-it with structured reasoning capabilities using `<think></think>` tags. Trained with GRPO (Group Relative Policy Optimization) on self-generated preference data.

## 🎯 Key Features

- **Structured Thinking**: Teaches models to use `<think></think>` tags for chain-of-thought reasoning
- **GRPO Training**: Uses preference learning to optimize reasoning quality
- **Self-Generated Data**: No external datasets required - uses Magpie approach
- **Multi-Domain**: Effective across mathematics, logic, science, and problem-solving

## πŸ“Š Performance Metrics

- **Base Model**: google/gemma-3-1b-it
- **Training Method**: GRPO (Group Relative Policy Optimization)
- **LoRA Rank**: 64
- **LoRA Alpha**: 128
- **Training Samples**: 107
- **Thinking Tag Usage**: 60.0%
- **Average Quality Score**: 5.60

## πŸ”§ Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-3-1b-it",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")

# Load reasoning LoRA adapter
model = PeftModel.from_pretrained(model, "codelion/gemma-3-1b-it-reasoning-grpo-lora")

# Use with thinking prompt
prompt = '''Think step by step and use <think></think> tags to show your reasoning process.

Problem: If a train travels 120 miles in 2 hours, then increases its speed by 30 mph for the next hour, how many total miles does it travel?

Response:'''

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.2)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

## πŸ“ˆ Expected Output Format

The model will generate responses with structured thinking:

```
<think>
First, I need to find the train's initial speed.
Speed = Distance / Time = 120 miles / 2 hours = 60 mph

For the first 2 hours: 120 miles
For the next hour, speed increases by 30 mph: 60 + 30 = 90 mph
Distance in third hour: 90 mph Γ— 1 hour = 90 miles

Total distance = 120 + 90 = 210 miles
</think>

To solve this step by step:

First, I'll find the train's initial speed:
- Distance = 120 miles, Time = 2 hours
- Speed = 120 Γ· 2 = 60 mph

Next, I'll calculate the distance for each segment:
- First 2 hours: 120 miles (given)
- Third hour: speed increases by 30 mph β†’ 60 + 30 = 90 mph
- Distance in third hour: 90 Γ— 1 = 90 miles

Total distance = 120 + 90 = 210 miles
```

## πŸ§ͺ Training Details

- **Method**: GRPO (Group Relative Policy Optimization)
- **Data Generation**: Magpie approach with reasoning-focused prompts
- **Preference Learning**: Multiple responses ranked by reasoning quality
- **Domains**: Mathematics, logic puzzles, science, programming, philosophy
- **Quality Scoring**: Based on thinking tag usage, reasoning markers, and structure

## πŸ“š Training Data

The model was trained on self-generated reasoning problems across multiple domains:
- Mathematical problem-solving
- Logic puzzles and riddles
- Scientific analysis
- Programming challenges
- Philosophical reasoning
- Decision-making scenarios

## 🎭 Reasoning Patterns Learned

- **Step-by-step analysis**: Breaking complex problems into smaller parts
- **Causal reasoning**: Using "because", "therefore", "since" connections
- **Sequential thinking**: "First", "next", "then", "finally" progression
- **Structured output**: Clear separation of thinking and final response

## πŸ”¬ Evaluation

The adapter was evaluated on diverse reasoning tasks:
- Thinking tag usage rate: 60.0%
- Average reasoning quality score: 5.60
- Response comprehensiveness: 454 words average

## 🏷️ Related

- **Dataset**: [codelion/gemma-3-1b-it-magpie-reasoning](https://huggingface.co/datasets/codelion/gemma-3-1b-it-magpie-reasoning)
- **Base Model**: [google/gemma-3-1b-it](https://huggingface.co/google/gemma-3-1b-it)
- **Framework**: [PEFT](https://github.com/huggingface/peft)
- **Training Method**: GRPO (Group Relative Policy Optimization)

---

*This adapter is part of the [Ellora project](https://github.com/codelion/ellora) - standardized recipes for enhancing LLM capabilities.*