Qwen2.5-3B-UFO

This model is based on Qwen2.5-3B-Instruct and trained with PPO (Proximal Policy Optimization) on the MetaMathQA dataset for mathematical reasoning.

Github: https://github.com/lichengliu03/unary-feedback

Model Info

  • Base model: Qwen/Qwen2.5-3B-Instruct
  • Training method: PPO (full-parameter fine-tuning, not LoRA)
  • Training data: MATH_MetaMathQA
  • Training steps: 200 steps
  • Framework: VERL
  • Tensor parallel: 2x GPU distributed training
  • Model size: ~6GB

Training Config

  • Micro Batch Size: 1 per GPU
  • PPO Mini Batch Size: 8
  • Actor Learning Rate: auto
  • Critic Learning Rate: auto
  • KL Penalty: 0.001
  • Clip Ratio: 0.2-0.28
  • Temperature: 1.0 (train), 0.5 (eval)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LichengLiu03/qwen2.5-3b-ppo-metamath-full")
model = AutoModelForCausalLM.from_pretrained(
    "LichengLiu03/qwen2.5-3b-ppo-metamath-full",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Example math problem
prompt = "Solve this math problem: If a circle has a radius of 5cm, what is its area?"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate answer
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Features

This model is optimized for mathematical reasoning with PPO, and compared to the base model, it improves:

  • โœ… Math problem understanding
  • โœ… Logical reasoning accuracy
  • โœ… Clarity of solution steps
  • โœ… Calculation accuracy

Technical Details

  • Tensor parallel training: 2 GPUs, distributed
  • Memory optimization: gradient checkpointing and mixed precision
  • Reward modeling: based on MetaMathQA correctness and reasoning quality
  • Policy optimization: PPO for stable training

Limitations

  • Mainly optimized for mathematical reasoning
  • May not perform as well on general tasks
  • Recommended for math, logic, and reasoning tasks

License

This model is licensed under Apache 2.0.

Downloads last month
11
Safetensors
Model size
3.09B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for LichengLiu03/Qwen2.5-3B-UFO

Base model

Qwen/Qwen2.5-3B
Finetuned
(611)
this model