File size: 11,163 Bytes
b84282e 5be04c0 b84282e 5be04c0 68da516 5be04c0 9bdbf53 5be04c0 a559cc7 5be04c0 a9a7dc9 5be04c0 9bdbf53 5be04c0 a9a7dc9 5be04c0 fe0561b 5be04c0 3eb74f3 5be04c0 21503fb 5be04c0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 |
---
license: apache-2.0
language:
- en
base_model:
- meta-llama/Llama-3.2-3B-Instruct
pipeline_tag: text-generation
tags:
- Reward
- RewardModel
- RewardReasoning
- Reasoning
- RLHF
- Best-of-N
---
### Introduction
This repository contains the released reward reasoning models for the paper [GRAM-R^2: Self-Training Generative Foundation Reward Models for Reward Reasoning 📝]().
<img src="https://raw.githubusercontent.com/wangclnlp/GRAM/refs/heads/main/gram-rr.png" width="1000px"></img>
We propose a self-training approach that enables reward models to elicit reward reasoning from both rationale-free labeled data and unlabeled data. This approach avoids the need for costly rationale-based annotations, enabling scalability in building foundation reward models. Specifically, we first train a preference-proving model that, given an input, a response pair, and a preference label, generates a proof explaining why the labeled preference holds. For rationale-free labeled data, this model is used to synthesize rationales for each example. For unlabeled data, the reward model improves its reasoning capability through an iterative self-training loop: (1) predicting preference labels for unlabeled examples, (2) generating corresponding rationales with the preference-proving model, and (3) updating the reward model using the synthesized data. This process scales reward reasoning by leveraging large amounts of unlabeled data. The dataset is available at this [link](https://huggingface.co/datasets/wangclnlp/GRAM-RR-TrainingData).
This reward model is fine-tuned from [LLaMA-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).
### Evaluation Results
We evaluate our model on two challenging reward benchmarks, [RM-Bench](https://github.com/THU-KEG/RM-Bench) and [JudgeBench](https://huggingface.co/datasets/ScalerLab/JudgeBench). We compare its performance against three categories of baselines: (1) LLM-as-a-Judge approaches that prompt large language models to generate preferences, (2) open-source reward models, (3) reasoning reward models, and (4) reward models trained using unlabeled data.
- Results on the RM-Bench.
| **Model** | **Params.** | **Chat** | **Math** | **Code** | **Safety** | **Overall** |
|:-|-:|:-:|:-:|:-:|:-:|:-:|
|**LLM-as-a-Judge**||||||
|GPT-4o |- |67.2 | 67.5 | 63.6 | 91.7 | 72.5|
|Claude-3.5-Sonnet|- |62.5 | 62.6 | 54.4 | 64.4 | 61.0|
|DeepSeek-R1-0528 |671B|76.7 | 74.3 | 51.0 | 89.2 | 72.8|
|**Open-Source Reward Models**||||||
|Llama-3.1-Nemotron-70B-Reward | 70B | 70.7 | 64.3 | 57.4 | 90.3 | 70.7|
|Skywork-Reward-Gemma-2-27B | 27B | 71.8 | 59.2 | 56.6 | 94.3 | 70.5|
|Skywork-Reward-Llama-3.1-8B | 8B | 69.5 | 60.6 | 54.5 | 95.7 | 70.1|
|Nemotron-Super | 49B | 73.7 | 91.4 | 75.0 | 90.6 | 82.7 |
|Nemotron-Super-Multilingual | 49B | **77.2** | **91.9** | 74.7 | 92.9 | 84.2|
|**Reasoning Reward Models**||||||
|RM-R1-Distilled-Qwen-32B | 32B | 74.2 | 91.8 | 74.1 | 95.4 | 83.9 |
|RM-R1-Distilled-Qwen-14B | 14B | 71.8 | 90.5 | 69.5 | 94.1 | 81.5 |
|RRM-32B | 32B | 66.6 | 81.4 | 65.2 | 79.4 | 73.1 |
|**Training with Unlabeled Preference Data**||||||
|GRAM-Qwen3-14B | 14B | 67.4 | 55.2 | 62.8 | 94.3 | 69.9 |
|GRAM-Qwen3-8B | 8B | 63.5 | 53.9 | 62.9 | 92.8 | 68.3 |
|**Ours**|||||
|GRAM-RR-LLaMA-3.2-3B-RewardModel | 3B | 74.4 | 88.8 | 76.6 | 95.5 | 83.8 |
|+voting@16 | 3B | 74.8 | 89.4 | 78.4 | 95.7 | 84.6 | 93.5 |
|GRAM-RR-LLaMA-3.1-8B-RewardModel | 8B | 76.0 | 89.8 | 80.6 | 96.2 | 85.7 |
|+voting@16 | 8B | 76.3 | 90.4 | **81.2** | **96.4** | **86.1** |
- Results on the JudgeBench.
| **Model** | **Params.** | **Knowl.** | **Reason.** | **Math** | **Coding** | **Overall** |
|:-|-:|:-:|:-:|:-:|:-:|:-:|
|**LLM-as-a-Judge**||||||
|GPT-4o |- |50.6 | 54.1 | 75.0 | 59.5 | 59.8 |
|Claude-3.5-Sonnet|- |62.3 | 66.3 | 66.1 | 64.3 | 64.8|
|DeepSeek-R1-0528 |671B|59.1 | 82.7 | 80.4 | **92.9** | 78.8|
|**Open-Source Reward Models**||||||
|Llama-3.1-Nemotron-70B-Reward | 70B | 62.3 | 72.5 | 76.8 | 57.1 | 67.2|
|Skywork-Reward-Gemma-2-27B | 27B | 59.7 | 66.3 | 83.9 | 50.0 | 65.0|
|Skywork-Reward-Llama-3.1-8B | 8B | 59.1 | 64.3 | 76.8 | 50.0 | 62.5|
|Nemotron-Super | 49B | 71.4 | 73.5 | 87.5 | 76.2 | 77.2 |
|Nemotron-Super-Multilingual | 49B | 64.9 | 74.5 | 87.5 | 73.8 | 75.2|
|**Reasoning Reward Models**||||||
|RM-R1-Distilled-Qwen-32B | 32B | 76.0 | 80.6 | 88.1 | 70.5 | 78.8 |
|RM-R1-Distilled-Qwen-14B | 14B | 68.1 | 72.4 | 87.8 | 84.2 | 78.1 |
|RRM-32B | 32B | 79.9 | 70.4 | 87.5 | 65.0 | 75.7 |
|**Training with Unlabeled Preference Data**||||||
|GRAM-Qwen3-14B | 14B | 63.0 | 64.3 | **89.3** | 69.1 | 71.4 |
|GRAM-Qwen3-8B | 8B | 62.3 | 64.3 | 80.4 | 64.3 | 67.8 |
|**Ours**|||||
|GRAM-RR-LLaMA-3.2-3B-RewardModel | 3B | 93.0 | 78.1 | 81.6 | 68.5 | 80.3 |
|+voting@16 | 3B | **93.5** | 78.6 | 82.1 | 69.0 | 80.8 |
|GRAM-RR-LLaMA-3.1-8B-RewardModel | 8B | 90.9 | 83.7 | 87.5 | 61.9 | 81.0 |
|+voting@16 | 8B | 91.2 | **84.3** | 88.1 | 62.8 | **81.6** |
### Usage
You can directly run the GRAM model using the demo provided below. You can also train GRAM-RR using the code available [here](https://github.com/NiuTrans/GRAM).
```python
import torch
import accelerate
from functools import cmp_to_key
from transformers import AutoTokenizer, AutoModelForCausalLM
pairwise_prompt = '''
You are a helpful assistant in evaluating the quality of the responses for a given instruction. Your goal is to select the better response for the given user question.
Select Response A or Response B, that is better for the given instruction. The two responses are generated by two different AI chatbots respectively.
Do NOT say both / neither are good.
Here are some rules of the evaluation:
(1) If the instruction does not contain harmful content, you should prioritize evaluating whether the output honestly/precisely/closely executes the instruction, then consider its helpfulness, accuracy, level of detail, harmlessness, etc.
(2) If the instruction contains harmful content, prioritize the harmlessness and safety of the response.
(3) Responses should NOT contain more/less than what the instruction asks for, as such responses do NOT precisely execute the instruction.
(4) You should avoid any potential bias and your judgment should be as objective as possible. Here are some potential sources of bias:
- The order in which the responses were presented should NOT affect your judgment, as Response A and Response B are **equally likely** to be the better.
- The length of the responses should NOT affect your judgement, as a longer response does not necessarily correspond to a better response. When making your decision, evaluate if the response length is appropriate for the given instruction.
Your reply should strictly follow this format:
<think>
Follow this format:
Feedback:
<provide free-text feedback on the overall helpfulness of the assistant response>
Comparision:
<give a brief analysis on which is better>
Conclusion:
<make your conclusion>
</think>
<answer>
A or B
</answer>
Here is the data.
[User Question]
{user_input}
[The Start of Assistant A's Response]
{response_1}
[The End of Assistant A's Response]
[The Start of Assistant B's Response]
{response_2}
[The End of Assistant B's Response]
'''.strip()
# an input example
user_input = '10 words to apologize for being late.'
responses = [
"My sincere apologies for being late today.",
"Apologies for making you wait; punctuality isn't my strong suit.",
"I'm sorry I couldn’t be on time today; unexpected issues delayed me, and I appreciate your patience."
]
print('='*25 + '\n' + 'The user input is:\n\n' + user_input + '\n\n' + '='*25 + '\n')
for idx, response in enumerate(responses):
print('='*25 + '\n' + f'The response {idx} is:\n\n' + response + '\n\n' + '='*25 + '\n')
# init model
model_name = "/path/to/the/model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# pairwise ranking
# 1 for response_1 is better, -1 for response_2 is better, 0 for no answer
def pairwise_ranking(user_input, response_1, response_2):
messages = [
{
"role": "user",
"content": pairwise_prompt.format(
user_input=user_input,
response_1=response_1,
response_2=response_2
)
}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
model_res = tokenizer.decode(output_ids, skip_special_tokens=True)
# print(model_res)
model_res = model_res.rsplit("</answer>")[-1].strip().upper()
# print(model_res)
if len(model_res) == 0:
return -1
return 1 if model_res.strip().upper().startswith("A") else -1
# the better one between responses[0] and responses[1]
better_response = 0 if pairwise_ranking(user_input, responses[0], responses[1])>0 else 1
print(f'Response {better_response} is better between response 0 and response 1.')
# listwise ranking
responses_id = [idx for idx, _ in enumerate(responses)]
sorted(
responses_id,
key=cmp_to_key(lambda response_1, response_2: pairwise_ranking(user_input, response_1, response_2))
)
print(f"The ranking among responses: {' > '.join([str(i) for i in responses_id])}")
# best-of-n
best = 0
for idx in range(1, len(responses)):
best = idx if pairwise_ranking(user_input, responses[idx], responses[best])>0 else best
print(f"The best response is response {best}.")
# vote in k (take pairwise ranking as an example.)
k = 8
res = [pairwise_ranking(user_input, responses[0], responses[1]) for i in range(k)]
print(f"The better response is response{max(set(res), key=res.count)} in {k} votes.")
```
Tips: To accelerate inference, GRAM-R^2 can be run with [vLLM](https://github.com/vllm-project/vllm) using multiple processes and threads. We also provide this script as a reference implementation at [this](https://github.com/wangclnlp/GRAM/tree/main/extensions/GRAM-RR).
### Citation
```
@misc{wang2025gramr2,
title={GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning},
author={Chenglong Wang and Yongyu Mu and Hang Zhou and Yifu Huo and Ziming Zhu and Jiali Zeng and Murun Yang and Bei Li and Tong Xiao and Xiaoyang Hao and Chunliang Zhang and Fandong Meng and Jingbo Zhu},
year={2025},
eprint={2509.02492},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.02492},
}
``` |