Model Card for Infused Phi-2

A reasoning-enhanced version of Microsoft's Phi-2 model, fine-tuned using Group Relative Policy Optimization (GRPO) to develop step-by-step reasoning capabilities with <think> and <answer> tags.

Model Details

Model Description

Infused Phi-2 is a 2.7B parameter reasoning model built upon Microsoft's Phi-2 foundation model. Through reinforcement learning with GRPO (Group Relative Policy Optimization), this model has been trained to generate structured reasoning responses with explicit thinking processes before providing final answers. The model follows a specific format with <think> and <answer> tags to separate the thought process from the conclusion.

Unlike models that rely on pre-collected reasoning data, Infused Phi-2 learned to reason through pure reinforcement learning, developing its own reasoning patterns and occasionally exhibiting "aha moments" where it spontaneously improves its problem-solving approach.

Developed by: pierizvi
Model type: Causal Language Model with Reasoning Enhancement
Language(s): English
License: MIT (inherits from base Phi-2 model)
Finetuned from model: microsoft/phi-2
Training method: GRPO (Group Relative Policy Optimization) with LoRA
Parameters: 2.7B (with LoRA adapters)
Context length: 768 tokens (training), up to 2048 (inference)

Model Sources

Base Repository: https://huggingface.co/microsoft/phi-2
Training Framework: Custom GRPO implementation with PyTorch + PEFT
Inspiration: DeepSeek-R1 methodology and research

Uses

Direct Use

Infused Phi-2 is designed for tasks requiring explicit reasoning and step-by-step problem solving:

Mathematical problem solving (arithmetic, algebra, word problems)
Logical reasoning tasks
Multi-step analysis and planning
Educational assistance where showing work is important
Problem decomposition and verification

The model generates responses in this format:

<think>
Let me think about this step by step.
First, I need to understand what the problem is asking...
Then I'll work through the calculations...
</think>
<answer>
The final answer is 42.
</answer>

Downstream Use

This model can be further fine-tuned for domain-specific reasoning tasks such as:

Mathematical tutoring systems
Logic puzzle solving
Scientific problem analysis
Step-by-step troubleshooting guides
Educational content generation

Out-of-Scope Use

Real-time applications requiring immediate responses (due to longer generation time for reasoning)
Tasks not requiring reasoning where the base model might be more efficient
Factual knowledge retrieval without reasoning requirements
Creative writing where structured reasoning format may be inappropriate
Critical applications without human verification of reasoning steps

Bias, Risks, and Limitations

Inherited biases from the base Phi-2 model
Reasoning format dependency - may struggle with tasks not suited to explicit reasoning
Computational overhead from longer response generation
Training scope - primarily trained on mathematical reasoning (GSM8K dataset)
Potential for verbose responses that may include unnecessary reasoning steps
Format compliance - may occasionally deviate from expected tag structure
Mathematical accuracy - reasoning steps should be verified for critical applications

Recommendations

Users should:

Validate reasoning steps for critical applications
Be aware that longer reasoning doesn't guarantee correctness
Test on domain-specific examples before deployment
Consider computational costs of inference
Implement output parsing to handle format variations
Use for educational/learning purposes where showing work is valuable

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")

# Load the fine-tuned adapter
model = PeftModel.from_pretrained(base_model, "pierizvi/infused-phi2")

# Set up for inference
model.eval()
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# System prompt (important for consistent behavior)
system_prompt = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively."""

# Example usage
def generate_response(question):
    prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{question}\n<|assistant|>\n"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=384,
            temperature=0.2,  # Lower for more consistent reasoning
            top_p=0.95,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id
        )
    
    response = tokenizer.decode(outputs[0, inputs.input_ids.shape[1]:], skip_special_tokens=True)
    return response

# Test the model
question = "If I have 12 apples and give away 3 to my friend and eat 2 myself, how many apples do I have left?"
response = generate_response(question)
print(response)

Training Details

Training Data

Primary dataset: GSM8K (Grade School Math 8K) - mathematical word problems
Dataset size: 2,000 examples (subset for efficient training)
Data format: Question-answer pairs with step-by-step solutions
Filtering: Responses validated for proper tag structure and mathematical consistency

Training Procedure

Training Framework: Custom GRPO (Group Relative Policy Optimization) implementation

Base architecture: LoRA (Low-Rank Adaptation) on microsoft/phi-2
LoRA rank: 16
LoRA alpha: 32
Target modules: ["q_proj", "k_proj", "v_proj", "dense"]
Dropout: 0.05

Training Hyperparameters

Learning rate: 1e-5
Training steps: 2,000 max steps
Batch size: 1 (with gradient accumulation)
Gradient accumulation steps: 8
Max sequence length: 768 tokens
Max new tokens: 384
Temperature: 0.8 (during training generation)
Number of generations per prompt: 3
Optimizer: AdamW with weight decay 0.01
Scheduler: Cosine with warmup (3% of total steps)
Gradient clipping: 1.0

GRPO-Specific Settings

Generation per training step:

3 responses generated for each question
Each response evaluated using multiple reward functions
Advantages calculated as (individual_reward - group_average)
Model trained to maximize high-advantage responses

Reward Functions

Three main reward functions guided the training:

Correctness reward (weight: 1.0)
- Exact answer matching with expected solution
- Partial credit for numerical matches
- Normalized answer comparison
Reasoning reward (weight: 0.7)
- Presence of thinking content within <think> tags
- Quality indicators: step-by-step structure, calculations
- Length bonus for detailed reasoning (following DeepSeek findings)
Format reward (weight: 0.3)
- Proper use of <think> and <answer> tags
- Correct ordering (thinking before answering)
- Balanced tag structure

Training Infrastructure

Hardware: Google Colab Free Tier / Consumer GPU (8-16GB VRAM)
Memory optimization: 4-bit quantization with BitsAndBytesConfig
VRAM usage: ~8GB for Phi-2 2.7B with LoRA
Training time: 8-12 hours for full training
Checkpointing: Every 50 steps with Google Drive backup

Special Features

Aha moment detection: Automatic detection of significant improvements in reasoning length/quality
Memory-efficient implementation: Custom GRPO loss computation to fit in limited VRAM
Robust error handling: Graceful handling of generation failures during training
Progress monitoring: Real-time tracking of reward trends and response characteristics

Evaluation

Testing Data, Factors & Metrics

Testing Data

Primary: GSM8K validation set (mathematical reasoning)
Custom test cases: Simple arithmetic and word problems
Format evaluation: Tag structure compliance assessment

Factors

Evaluation considers:

Answer correctness - Exact match with expected numerical results
Reasoning quality - Presence and coherence of logical steps
Format compliance - Proper <think> and <answer> tag usage
Response completeness - Both reasoning and answer sections present
Reasoning length - Tracking development of longer, more detailed thinking

Metrics

Correctness rate - Percentage of mathematically correct final answers
Format compliance rate - Percentage of properly formatted responses
Reasoning presence rate - Percentage with meaningful thinking content
Average thinking length - Words in reasoning sections (key metric for "aha moments")

Results

Based on training progression and limited evaluation:

Answer correctness: 65-80% on GSM8K-style problems (varies by complexity)
Format compliance: 90%+ of responses follow expected tag structure
Reasoning presence: 95%+ of responses include substantive thinking content
Aha moment detection: Multiple instances of spontaneous reasoning improvement
Thinking length growth: 3x increase in average reasoning length during training

Notable achievements:

Spontaneous development of multi-step verification
Self-correction behaviors emerging without explicit training
Consistent format adoption after GRPO training
Evidence of "aha moments" similar to those observed in DeepSeek-R1-Zero

Note: Comprehensive benchmarking pending. Results based on training monitoring and qualitative assessment.

Model Examination

Emergent Behaviors

The model demonstrates several interesting reasoning patterns that emerged during GRPO training:

Step-by-step decomposition of complex problems
Self-verification attempts where the model checks its own work
Alternative approach exploration when initial methods seem insufficient
Metacognitive awareness - reasoning about its own reasoning process
Progressive refinement of answers through extended thinking

Aha Moments

During training, the model exhibited several "aha moments" - sudden improvements in reasoning capability:

Significant increases in thinking length (tracked automatically)
Spontaneous adoption of verification strategies
Development of more sophisticated problem-solving approaches
Self-correction behaviors without explicit training data

Areas for Improvement

Conciseness vs. thoroughness - balancing detailed reasoning with efficiency
Mathematical accuracy - occasional arithmetic errors in complex calculations
Domain transfer - extending reasoning capabilities beyond mathematics
Consistency - maintaining reasoning quality across all problem types

Environmental Impact

Training was conducted with efficiency in mind:

Hardware Type: Consumer GPU (equivalent to RTX 3090/4090)
Hours used: ~12 hours total training time
Training method: Parameter-efficient LoRA fine-tuning
Memory optimization: 4-bit quantization reduces energy consumption
Carbon footprint: Minimal due to efficient training approach and short duration

Technical Specifications

Model Architecture and Objective

Base architecture: Microsoft Phi-2 (2.7B parameters)
Adaptation method: LoRA (Low-Rank Adaptation)
Training objective: GRPO loss maximizing reward-weighted policy gradients
Context length: 768 tokens (training), expandable to 2048 (inference)
Vocabulary size: 51,200 tokens
Precision: Mixed precision (FP16) with 4-bit quantization during training

Compute Infrastructure

Hardware

Training: Single GPU (8-16GB VRAM sufficient)
Inference: CPU or single GPU (4GB+ VRAM recommended)
Memory requirements: 6-8GB for inference, 8-12GB for training

Software

Framework: PyTorch 2.0+, Transformers 4.36.2+, PEFT 0.15.2
Quantization: BitsAndBytesConfig for 4-bit training
Optimization: Gradient checkpointing, memory-efficient attention
Platform compatibility: Google Colab, local GPU setups, cloud instances

Citation

If you use this model in your research or applications, please cite:

BibTeX:

@misc{infused_phi2_2024,
  title={Infused Phi-2: A Reasoning-Enhanced Language Model via GRPO},
  author={pierizvi},
  year={2024},
  howpublished={HuggingFace Model Hub},
  note={Fine-tuned from microsoft/phi-2 using Group Relative Policy Optimization},
  url={https://huggingface.co/pierizvi/infused-phi2}
}

APA: pierizvi. (2024). Infused Phi-2: A Reasoning-Enhanced Language Model via GRPO. HuggingFace Model Hub. https://huggingface.co/pierizvi/infused-phi2

Acknowledgments

Microsoft for the excellent Phi-2 base model
DeepSeek for pioneering GRPO methodology and demonstrating the potential of RL-based reasoning
OpenAI for the GSM8K dataset used in training
HuggingFace community for the transformers and PEFT libraries
Google Colab for providing accessible compute resources

Model Card Authors

pierizvi - Model development, training implementation, and documentation

Model Card Contact

For questions, issues, or collaboration opportunities related to this model, please reach out through:

HuggingFace: @pierizvi
GitHub: Issues and discussions welcome

Framework Versions

PyTorch: 2.0+
Transformers: 4.36.2+
PEFT: 0.15.2
BitsAndBytesConfig: 0.41.1+
Python: 3.8+

This model represents an exploration into reasoning enhancement through reinforcement learning, inspired by DeepSeek's groundbreaking work on R1. While not achieving the scale of DeepSeek-R1, it demonstrates that meaningful reasoning capabilities can emerge even in smaller models through careful training.