|
--- |
|
base_model: google/gemma-3-1b-it |
|
tags: |
|
- ellora |
|
- lora |
|
- reasoning |
|
- chain-of-thought |
|
- grpo |
|
- thinking |
|
- preference-learning |
|
- self-improvement |
|
- peft |
|
- gemma |
|
library_name: peft |
|
license: apache-2.0 |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
inference: true |
|
model_type: gemma |
|
datasets: |
|
- codelion/gemma-3-1b-it-magpie-reasoning |
|
--- |
|
|
|
# codelion/gemma-3-1b-it-reasoning-grpo-lora |
|
|
|
## π§ Reasoning LoRA with GRPO Training |
|
|
|
This LoRA adapter enhances google/gemma-3-1b-it with structured reasoning capabilities using `<think></think>` tags. Trained with GRPO (Group Relative Policy Optimization) on self-generated preference data. |
|
|
|
## π― Key Features |
|
|
|
- **Structured Thinking**: Teaches models to use `<think></think>` tags for chain-of-thought reasoning |
|
- **GRPO Training**: Uses preference learning to optimize reasoning quality |
|
- **Self-Generated Data**: No external datasets required - uses Magpie approach |
|
- **Multi-Domain**: Effective across mathematics, logic, science, and problem-solving |
|
|
|
## π Performance Metrics |
|
|
|
- **Base Model**: google/gemma-3-1b-it |
|
- **Training Method**: GRPO (Group Relative Policy Optimization) |
|
- **LoRA Rank**: 64 |
|
- **LoRA Alpha**: 128 |
|
- **Training Samples**: 107 |
|
- **Thinking Tag Usage**: 60.0% |
|
- **Average Quality Score**: 5.60 |
|
|
|
## π§ Usage |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from peft import PeftModel |
|
|
|
# Load base model and tokenizer |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"google/gemma-3-1b-it", |
|
torch_dtype="auto", |
|
device_map="auto" |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it") |
|
|
|
# Load reasoning LoRA adapter |
|
model = PeftModel.from_pretrained(model, "codelion/gemma-3-1b-it-reasoning-grpo-lora") |
|
|
|
# Use with thinking prompt |
|
prompt = '''Think step by step and use <think></think> tags to show your reasoning process. |
|
|
|
Problem: If a train travels 120 miles in 2 hours, then increases its speed by 30 mph for the next hour, how many total miles does it travel? |
|
|
|
Response:''' |
|
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.2) |
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(response) |
|
``` |
|
|
|
## π Expected Output Format |
|
|
|
The model will generate responses with structured thinking: |
|
|
|
``` |
|
<think> |
|
First, I need to find the train's initial speed. |
|
Speed = Distance / Time = 120 miles / 2 hours = 60 mph |
|
|
|
For the first 2 hours: 120 miles |
|
For the next hour, speed increases by 30 mph: 60 + 30 = 90 mph |
|
Distance in third hour: 90 mph Γ 1 hour = 90 miles |
|
|
|
Total distance = 120 + 90 = 210 miles |
|
</think> |
|
|
|
To solve this step by step: |
|
|
|
First, I'll find the train's initial speed: |
|
- Distance = 120 miles, Time = 2 hours |
|
- Speed = 120 Γ· 2 = 60 mph |
|
|
|
Next, I'll calculate the distance for each segment: |
|
- First 2 hours: 120 miles (given) |
|
- Third hour: speed increases by 30 mph β 60 + 30 = 90 mph |
|
- Distance in third hour: 90 Γ 1 = 90 miles |
|
|
|
Total distance = 120 + 90 = 210 miles |
|
``` |
|
|
|
## π§ͺ Training Details |
|
|
|
- **Method**: GRPO (Group Relative Policy Optimization) |
|
- **Data Generation**: Magpie approach with reasoning-focused prompts |
|
- **Preference Learning**: Multiple responses ranked by reasoning quality |
|
- **Domains**: Mathematics, logic puzzles, science, programming, philosophy |
|
- **Quality Scoring**: Based on thinking tag usage, reasoning markers, and structure |
|
|
|
## π Training Data |
|
|
|
The model was trained on self-generated reasoning problems across multiple domains: |
|
- Mathematical problem-solving |
|
- Logic puzzles and riddles |
|
- Scientific analysis |
|
- Programming challenges |
|
- Philosophical reasoning |
|
- Decision-making scenarios |
|
|
|
## π Reasoning Patterns Learned |
|
|
|
- **Step-by-step analysis**: Breaking complex problems into smaller parts |
|
- **Causal reasoning**: Using "because", "therefore", "since" connections |
|
- **Sequential thinking**: "First", "next", "then", "finally" progression |
|
- **Structured output**: Clear separation of thinking and final response |
|
|
|
## π¬ Evaluation |
|
|
|
The adapter was evaluated on diverse reasoning tasks: |
|
- Thinking tag usage rate: 60.0% |
|
- Average reasoning quality score: 5.60 |
|
- Response comprehensiveness: 454 words average |
|
|
|
## π·οΈ Related |
|
|
|
- **Dataset**: [codelion/gemma-3-1b-it-magpie-reasoning](https://huggingface.co/datasets/codelion/gemma-3-1b-it-magpie-reasoning) |
|
- **Base Model**: [google/gemma-3-1b-it](https://huggingface.co/google/gemma-3-1b-it) |
|
- **Framework**: [PEFT](https://github.com/huggingface/peft) |
|
- **Training Method**: GRPO (Group Relative Policy Optimization) |
|
|
|
--- |
|
|
|
*This adapter is part of the [Ellora project](https://github.com/codelion/ellora) - standardized recipes for enhancing LLM capabilities.* |