Safetensors
English
qwen2

Model Details

The generative reward model used in paper "Expanding RL with Verifiable Rewards Across Diverse Domains".

Inputting the question, label and the response to be evaluated, the model will judge if the response is right.

Quick start

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("virtuoussy/Qwen2.5-7B-Instruct-RLVR")
model = AutoModelForCausalLM.from_pretrained("virtuoussy/Qwen2.5-7B-Instruct-RLVR")

PROMPT= '''
Given a problem, determine whether the final answer in the provided (incomplete) solution process matches the reference answer.  
The reference answer may be one single option character (e.g., A, B, C, D), a numerical value, an expression, or a list of answers if multiple questions are involved.  
**The reference answer may be in Chinese or another language, but your evaluation should be language-agnostic.**  

Your task:  
- Compare the final output of the solution process with the reference answer.  
- If they **match exactly**, output **YES**.  
- If they **do not match**, output **NO**.  
- If the solution process is unclear, incomplete, or ambiguous, assume it is incorrect and output **NO**.  

Your output must be strictly **'YES'** or **'NO'**, with no additional words, punctuation, or explanation.  

---

**Question:**  
{question}  

**Solution Process (Final Step Only):**  
{response}  

**Reference Answer:**  
{reference}  

**Output:**  
'''


question="The founder of China's first public kindergarten teacher training school - Jiangxi Experimental Kindergarten Teacher School is (  )."
label="Chen Heqin"
answer="heqin chen"

prompt_question = PROMPT.format(question=question, reference=label, response=answer)
messages=[
           {"role": "system", "content": "You are a helpful assistant."},
           {"role": "user", "content": prompt_question},
         ]
input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt")
output=model.generate(input_ids,do_sample=False)
judgement=tokenizer.decode(output[0][input_ids.shape[1]:],skip_special_tokens=True)
print("Model judgement: ",judgement)

Use as a remote reward

# launch a remote reward
bash launch_reward.sh {MODEL_PATH} {ANSWER_PATH} {METRIC}

# MODEL_PATH: the path of our generative reward model.
# ANSWER_PATH: the path of the training data.
# METRIC: greedy/prob
# This will launch a reward at http://127.0.0.1:8000/get_reward

# train
bash train.sh {METHOD} {PRETRAIN_PATH} {DATA_PATH} {REWARD_API}

# Both train.sh and launch_reward.sh can be found in the model directory.
# We will release our github repo soon!

Citation

@misc{su2025expandingrlverifiablerewards,
      title={Expanding RL with Verifiable Rewards Across Diverse Domains}, 
      author={Yi Su and Dian Yu and Linfeng Song and Juntao Li and Haitao Mi and Zhaopeng Tu and Min Zhang and Dong Yu},
      year={2025},
      eprint={2503.23829},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.23829}, 
}
Downloads last month
57
Safetensors
Model size
7.62B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for virtuoussy/Qwen2.5-7B-Instruct-RLVR

Base model

Qwen/Qwen2.5-7B
Finetuned
(970)
this model
Quantizations
1 model

Datasets used to train virtuoussy/Qwen2.5-7B-Instruct-RLVR

Collection including virtuoussy/Qwen2.5-7B-Instruct-RLVR