--- base_model: google/gemma-3-1b-it tags: - ellora - lora - reasoning - chain-of-thought - grpo - thinking - preference-learning - self-improvement - peft - gemma library_name: peft license: apache-2.0 language: - en pipeline_tag: text-generation inference: true model_type: gemma datasets: - codelion/gemma-3-1b-it-magpie-reasoning --- # codelion/gemma-3-1b-it-reasoning-grpo-lora ## ๐Ÿง  Reasoning LoRA with GRPO Training This LoRA adapter enhances google/gemma-3-1b-it with structured reasoning capabilities using `` tags. Trained with GRPO (Group Relative Policy Optimization) on self-generated preference data. ## ๐ŸŽฏ Key Features - **Structured Thinking**: Teaches models to use `` tags for chain-of-thought reasoning - **GRPO Training**: Uses preference learning to optimize reasoning quality - **Self-Generated Data**: No external datasets required - uses Magpie approach - **Multi-Domain**: Effective across mathematics, logic, science, and problem-solving ## ๐Ÿ“Š Performance Metrics - **Base Model**: google/gemma-3-1b-it - **Training Method**: GRPO (Group Relative Policy Optimization) - **LoRA Rank**: 64 - **LoRA Alpha**: 128 - **Training Samples**: 107 - **Thinking Tag Usage**: 60.0% - **Average Quality Score**: 5.60 ## ๐Ÿ”ง Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load base model and tokenizer model = AutoModelForCausalLM.from_pretrained( "google/gemma-3-1b-it", torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it") # Load reasoning LoRA adapter model = PeftModel.from_pretrained(model, "codelion/gemma-3-1b-it-reasoning-grpo-lora") # Use with thinking prompt prompt = '''Think step by step and use tags to show your reasoning process. Problem: If a train travels 120 miles in 2 hours, then increases its speed by 30 mph for the next hour, how many total miles does it travel? Response:''' inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.2) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ## ๐Ÿ“ˆ Expected Output Format The model will generate responses with structured thinking: ``` First, I need to find the train's initial speed. Speed = Distance / Time = 120 miles / 2 hours = 60 mph For the first 2 hours: 120 miles For the next hour, speed increases by 30 mph: 60 + 30 = 90 mph Distance in third hour: 90 mph ร— 1 hour = 90 miles Total distance = 120 + 90 = 210 miles To solve this step by step: First, I'll find the train's initial speed: - Distance = 120 miles, Time = 2 hours - Speed = 120 รท 2 = 60 mph Next, I'll calculate the distance for each segment: - First 2 hours: 120 miles (given) - Third hour: speed increases by 30 mph โ†’ 60 + 30 = 90 mph - Distance in third hour: 90 ร— 1 = 90 miles Total distance = 120 + 90 = 210 miles ``` ## ๐Ÿงช Training Details - **Method**: GRPO (Group Relative Policy Optimization) - **Data Generation**: Magpie approach with reasoning-focused prompts - **Preference Learning**: Multiple responses ranked by reasoning quality - **Domains**: Mathematics, logic puzzles, science, programming, philosophy - **Quality Scoring**: Based on thinking tag usage, reasoning markers, and structure ## ๐Ÿ“š Training Data The model was trained on self-generated reasoning problems across multiple domains: - Mathematical problem-solving - Logic puzzles and riddles - Scientific analysis - Programming challenges - Philosophical reasoning - Decision-making scenarios ## ๐ŸŽญ Reasoning Patterns Learned - **Step-by-step analysis**: Breaking complex problems into smaller parts - **Causal reasoning**: Using "because", "therefore", "since" connections - **Sequential thinking**: "First", "next", "then", "finally" progression - **Structured output**: Clear separation of thinking and final response ## ๐Ÿ”ฌ Evaluation The adapter was evaluated on diverse reasoning tasks: - Thinking tag usage rate: 60.0% - Average reasoning quality score: 5.60 - Response comprehensiveness: 454 words average ## ๐Ÿท๏ธ Related - **Dataset**: [codelion/gemma-3-1b-it-magpie-reasoning](https://huggingface.co/datasets/codelion/gemma-3-1b-it-magpie-reasoning) - **Base Model**: [google/gemma-3-1b-it](https://huggingface.co/google/gemma-3-1b-it) - **Framework**: [PEFT](https://github.com/huggingface/peft) - **Training Method**: GRPO (Group Relative Policy Optimization) --- *This adapter is part of the [Ellora project](https://github.com/codelion/ellora) - standardized recipes for enhancing LLM capabilities.*