sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH
📄 Paper | 🌐 Project Page | 💻 GitHub
Description:
This model is a GRPO-fine-tuned version of Qwen3-14B, specifically trained on the MATH dataset. It is part of the Intuitor project, presented in the paper "Learning to Reason without External Rewards".
Intuitor is a novel reinforcement learning method that leverages self-certainty—the model’s own internal confidence—as its sole reward signal to fine-tune large language models (LLMs). This approach falls under a new framework called Reinforcement Learning from Internal Feedback (RLIF), which enables LLMs to learn effectively from intrinsic signals, circumventing the need for costly external rewards, gold labels, or verifiers. This makes RLIF a scalable and domain-agnostic alternative to traditional RL methods, particularly useful when verifiable rewards are unavailable.
This particular model demonstrates Intuitor's ability to match GRPO's performance on mathematical benchmarks while showing superior generalization to out-of-domain tasks like code generation, all without requiring gold solutions or test cases.
Usage
You can use this model with the transformers
library for text generation.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
model.eval()
# Example using a chat-like template, typical for instruction-tuned models like Qwen.
# Adjust prompt format as needed for your specific use case.
messages = [
{"role": "user", "content": "Question: Solve the following equation: $x + 7 = 15$. Show your steps. Answer:"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text)
Citation
If you use Intuitor in your research, please cite our paper:
@article{zhao2025learning,
title = {Learning to Reason without External Rewards},
author = {Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
journal = {arXiv preprint arXiv:2505.19590},
year = {2025}
}
- Downloads last month
- 7