Model Card for Infused Phi-2
A reasoning-enhanced version of Microsoft's Phi-2 model, fine-tuned using Group Relative Policy Optimization (GRPO) to develop step-by-step reasoning capabilities with <think>
and <answer>
tags.
Model Details
Model Description
Infused Phi-2 is a 2.7B parameter reasoning model built upon Microsoft's Phi-2 foundation model. Through reinforcement learning with GRPO (Group Relative Policy Optimization), this model has been trained to generate structured reasoning responses with explicit thinking processes before providing final answers. The model follows a specific format with <think>
and <answer>
tags to separate the thought process from the conclusion.
Unlike models that rely on pre-collected reasoning data, Infused Phi-2 learned to reason through pure reinforcement learning, developing its own reasoning patterns and occasionally exhibiting "aha moments" where it spontaneously improves its problem-solving approach.
- Developed by: pierizvi
- Model type: Causal Language Model with Reasoning Enhancement
- Language(s): English
- License: MIT (inherits from base Phi-2 model)
- Finetuned from model: microsoft/phi-2
- Training method: GRPO (Group Relative Policy Optimization) with LoRA
- Parameters: 2.7B (with LoRA adapters)
- Context length: 768 tokens (training), up to 2048 (inference)
Model Sources
- Base Repository: https://huggingface.co/microsoft/phi-2
- Training Framework: Custom GRPO implementation with PyTorch + PEFT
- Inspiration: DeepSeek-R1 methodology and research
Uses
Direct Use
Infused Phi-2 is designed for tasks requiring explicit reasoning and step-by-step problem solving:
- Mathematical problem solving (arithmetic, algebra, word problems)
- Logical reasoning tasks
- Multi-step analysis and planning
- Educational assistance where showing work is important
- Problem decomposition and verification
The model generates responses in this format:
<think>
Let me think about this step by step.
First, I need to understand what the problem is asking...
Then I'll work through the calculations...
</think>
<answer>
The final answer is 42.
</answer>
Downstream Use
This model can be further fine-tuned for domain-specific reasoning tasks such as:
- Mathematical tutoring systems
- Logic puzzle solving
- Scientific problem analysis
- Step-by-step troubleshooting guides
- Educational content generation
Out-of-Scope Use
- Real-time applications requiring immediate responses (due to longer generation time for reasoning)
- Tasks not requiring reasoning where the base model might be more efficient
- Factual knowledge retrieval without reasoning requirements
- Creative writing where structured reasoning format may be inappropriate
- Critical applications without human verification of reasoning steps
Bias, Risks, and Limitations
- Inherited biases from the base Phi-2 model
- Reasoning format dependency - may struggle with tasks not suited to explicit reasoning
- Computational overhead from longer response generation
- Training scope - primarily trained on mathematical reasoning (GSM8K dataset)
- Potential for verbose responses that may include unnecessary reasoning steps
- Format compliance - may occasionally deviate from expected tag structure
- Mathematical accuracy - reasoning steps should be verified for critical applications
Recommendations
Users should:
- Validate reasoning steps for critical applications
- Be aware that longer reasoning doesn't guarantee correctness
- Test on domain-specific examples before deployment
- Consider computational costs of inference
- Implement output parsing to handle format variations
- Use for educational/learning purposes where showing work is valuable
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained(
"microsoft/phi-2",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
# Load the fine-tuned adapter
model = PeftModel.from_pretrained(base_model, "pierizvi/infused-phi2")
# Set up for inference
model.eval()
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# System prompt (important for consistent behavior)
system_prompt = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively."""
# Example usage
def generate_response(question):
prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{question}\n<|assistant|>\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=384,
temperature=0.2, # Lower for more consistent reasoning
top_p=0.95,
do_sample=True,
pad_token_id=tokenizer.pad_token_id
)
response = tokenizer.decode(outputs[0, inputs.input_ids.shape[1]:], skip_special_tokens=True)
return response
# Test the model
question = "If I have 12 apples and give away 3 to my friend and eat 2 myself, how many apples do I have left?"
response = generate_response(question)
print(response)
Training Details
Training Data
- Primary dataset: GSM8K (Grade School Math 8K) - mathematical word problems
- Dataset size: 2,000 examples (subset for efficient training)
- Data format: Question-answer pairs with step-by-step solutions
- Filtering: Responses validated for proper tag structure and mathematical consistency
Training Procedure
Training Framework: Custom GRPO (Group Relative Policy Optimization) implementation
- Base architecture: LoRA (Low-Rank Adaptation) on microsoft/phi-2
- LoRA rank: 16
- LoRA alpha: 32
- Target modules: ["q_proj", "k_proj", "v_proj", "dense"]
- Dropout: 0.05
Training Hyperparameters
- Learning rate: 1e-5
- Training steps: 2,000 max steps
- Batch size: 1 (with gradient accumulation)
- Gradient accumulation steps: 8
- Max sequence length: 768 tokens
- Max new tokens: 384
- Temperature: 0.8 (during training generation)
- Number of generations per prompt: 3
- Optimizer: AdamW with weight decay 0.01
- Scheduler: Cosine with warmup (3% of total steps)
- Gradient clipping: 1.0
GRPO-Specific Settings
Generation per training step:
- 3 responses generated for each question
- Each response evaluated using multiple reward functions
- Advantages calculated as (individual_reward - group_average)
- Model trained to maximize high-advantage responses
Reward Functions
Three main reward functions guided the training:
Correctness reward (weight: 1.0)
- Exact answer matching with expected solution
- Partial credit for numerical matches
- Normalized answer comparison
Reasoning reward (weight: 0.7)
- Presence of thinking content within
<think>
tags - Quality indicators: step-by-step structure, calculations
- Length bonus for detailed reasoning (following DeepSeek findings)
- Presence of thinking content within
Format reward (weight: 0.3)
- Proper use of
<think>
and<answer>
tags - Correct ordering (thinking before answering)
- Balanced tag structure
- Proper use of
Training Infrastructure
- Hardware: Google Colab Free Tier / Consumer GPU (8-16GB VRAM)
- Memory optimization: 4-bit quantization with BitsAndBytesConfig
- VRAM usage: ~8GB for Phi-2 2.7B with LoRA
- Training time: 8-12 hours for full training
- Checkpointing: Every 50 steps with Google Drive backup
Special Features
- Aha moment detection: Automatic detection of significant improvements in reasoning length/quality
- Memory-efficient implementation: Custom GRPO loss computation to fit in limited VRAM
- Robust error handling: Graceful handling of generation failures during training
- Progress monitoring: Real-time tracking of reward trends and response characteristics
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Primary: GSM8K validation set (mathematical reasoning)
- Custom test cases: Simple arithmetic and word problems
- Format evaluation: Tag structure compliance assessment
Factors
Evaluation considers:
- Answer correctness - Exact match with expected numerical results
- Reasoning quality - Presence and coherence of logical steps
- Format compliance - Proper
<think>
and<answer>
tag usage - Response completeness - Both reasoning and answer sections present
- Reasoning length - Tracking development of longer, more detailed thinking
Metrics
- Correctness rate - Percentage of mathematically correct final answers
- Format compliance rate - Percentage of properly formatted responses
- Reasoning presence rate - Percentage with meaningful thinking content
- Average thinking length - Words in reasoning sections (key metric for "aha moments")
Results
Based on training progression and limited evaluation:
- Answer correctness: 65-80% on GSM8K-style problems (varies by complexity)
- Format compliance: 90%+ of responses follow expected tag structure
- Reasoning presence: 95%+ of responses include substantive thinking content
- Aha moment detection: Multiple instances of spontaneous reasoning improvement
- Thinking length growth: 3x increase in average reasoning length during training
Notable achievements:
- Spontaneous development of multi-step verification
- Self-correction behaviors emerging without explicit training
- Consistent format adoption after GRPO training
- Evidence of "aha moments" similar to those observed in DeepSeek-R1-Zero
Note: Comprehensive benchmarking pending. Results based on training monitoring and qualitative assessment.
Model Examination
Emergent Behaviors
The model demonstrates several interesting reasoning patterns that emerged during GRPO training:
- Step-by-step decomposition of complex problems
- Self-verification attempts where the model checks its own work
- Alternative approach exploration when initial methods seem insufficient
- Metacognitive awareness - reasoning about its own reasoning process
- Progressive refinement of answers through extended thinking
Aha Moments
During training, the model exhibited several "aha moments" - sudden improvements in reasoning capability:
- Significant increases in thinking length (tracked automatically)
- Spontaneous adoption of verification strategies
- Development of more sophisticated problem-solving approaches
- Self-correction behaviors without explicit training data
Areas for Improvement
- Conciseness vs. thoroughness - balancing detailed reasoning with efficiency
- Mathematical accuracy - occasional arithmetic errors in complex calculations
- Domain transfer - extending reasoning capabilities beyond mathematics
- Consistency - maintaining reasoning quality across all problem types
Environmental Impact
Training was conducted with efficiency in mind:
- Hardware Type: Consumer GPU (equivalent to RTX 3090/4090)
- Hours used: ~12 hours total training time
- Training method: Parameter-efficient LoRA fine-tuning
- Memory optimization: 4-bit quantization reduces energy consumption
- Carbon footprint: Minimal due to efficient training approach and short duration
Technical Specifications
Model Architecture and Objective
- Base architecture: Microsoft Phi-2 (2.7B parameters)
- Adaptation method: LoRA (Low-Rank Adaptation)
- Training objective: GRPO loss maximizing reward-weighted policy gradients
- Context length: 768 tokens (training), expandable to 2048 (inference)
- Vocabulary size: 51,200 tokens
- Precision: Mixed precision (FP16) with 4-bit quantization during training
Compute Infrastructure
Hardware
- Training: Single GPU (8-16GB VRAM sufficient)
- Inference: CPU or single GPU (4GB+ VRAM recommended)
- Memory requirements: 6-8GB for inference, 8-12GB for training
Software
- Framework: PyTorch 2.0+, Transformers 4.36.2+, PEFT 0.15.2
- Quantization: BitsAndBytesConfig for 4-bit training
- Optimization: Gradient checkpointing, memory-efficient attention
- Platform compatibility: Google Colab, local GPU setups, cloud instances
Citation
If you use this model in your research or applications, please cite:
BibTeX:
@misc{infused_phi2_2024,
title={Infused Phi-2: A Reasoning-Enhanced Language Model via GRPO},
author={pierizvi},
year={2024},
howpublished={HuggingFace Model Hub},
note={Fine-tuned from microsoft/phi-2 using Group Relative Policy Optimization},
url={https://huggingface.co/pierizvi/infused-phi2}
}
APA: pierizvi. (2024). Infused Phi-2: A Reasoning-Enhanced Language Model via GRPO. HuggingFace Model Hub. https://huggingface.co/pierizvi/infused-phi2
Acknowledgments
- Microsoft for the excellent Phi-2 base model
- DeepSeek for pioneering GRPO methodology and demonstrating the potential of RL-based reasoning
- OpenAI for the GSM8K dataset used in training
- HuggingFace community for the transformers and PEFT libraries
- Google Colab for providing accessible compute resources
Model Card Authors
pierizvi - Model development, training implementation, and documentation
Model Card Contact
For questions, issues, or collaboration opportunities related to this model, please reach out through:
- HuggingFace: @pierizvi
- GitHub: Issues and discussions welcome
Framework Versions
- PyTorch: 2.0+
- Transformers: 4.36.2+
- PEFT: 0.15.2
- BitsAndBytesConfig: 0.41.1+
- Python: 3.8+
This model represents an exploration into reasoning enhancement through reinforcement learning, inspired by DeepSeek's groundbreaking work on R1. While not achieving the scale of DeepSeek-R1, it demonstrates that meaningful reasoning capabilities can emerge even in smaller models through careful training.
- Downloads last month
- 81
Model tree for Pierizvi/infused-reasoning-phi2
Base model
microsoft/phi-2