Reward-Reasoning
/

RRM-32B

Safetensors

Model card Files Files and versions Community

Reward-Reasoning commited on May 21

Commit

5f7beed

verified ·

1 Parent(s): 0b24fae

Update README.md

Browse files

Files changed (1) hide show

README.md +19 -0

README.md CHANGED Viewed

@@ -3,6 +3,11 @@
 <!-- markdownlint-disable html -->
 <!-- markdownlint-disable no-duplicate-header -->
 ## 1. Introduction
 We propose **Reward Reasoning Models (RRMs)**. Unlike existing reward models, RRMs frames reward modeling as a reasoning task, wherein the model first produces a long chain-of-thought reasoning process before generating the final rewards.
@@ -190,3 +195,17 @@ The generated text:
 Okay, so I need to figure out which assistant is better for the given instruction. The instruction is asking, \"How do I detail a car?\" \n\nLooking at Assistant 1's response, it gives a brief overview of what detailing involves. It mentions washing the exterior and interior, polishing and waxing the exterior, and some interior cleaning steps like vacuuming, cleaning upholstery, and dusting. It also notes that polishing and waxing depend on the paint condition. But it's pretty short and doesn't go into much detail about the process.\n\nAssistant 2, on the other hand, provides a much more comprehensive guide. It breaks down the process into exterior and interior detailing with specific steps. For the exterior, it includes washing, drying, clay bar treatment, polishing, waxing, cleaning windows, and applying tire and trim dressing. For the interior, it covers removing trash, vacuuming, shampooing carpets, cleaning hard surfaces, windows, air vents, and final touches. It also adds additional tips about working in the shade, using separate buckets, gentle materials, and systematic cleaning.\n\nThe instruction is straightforward, asking how to detail a car. Assistant 1 gives a general idea but lacks the detailed steps that someone might need to actually perform the task. Assistant 2 provides a thorough, step-by-step guide that would be more helpful for someone looking to detail their car properly. \n\nConsidering the evaluation rules, the response should precisely execute the instruction. Assistant 2 does this by providing a detailed, organized approach, which is more helpful and accurate. It doesn't add unnecessary information but covers all necessary steps. \n\nTherefore, Assistant 2 is better because it offers a more precise, helpful, and detailed response to the instruction.\n</think>\n\nThe better response is provided by Assistant 2, as it offers a comprehensive and detailed guide to car detailing, which is more helpful and precise than the brief overview given by Assistant 1.\n\n\\boxed{Assistant 2}
 ```

 <!-- markdownlint-disable html -->
 <!-- markdownlint-disable no-duplicate-header -->
+<center>
+  <h2><a href="https://arxiv.org/abs/2505.14674">Paper Link👀</a></h2>
+</center>
 ## 1. Introduction
 We propose **Reward Reasoning Models (RRMs)**. Unlike existing reward models, RRMs frames reward modeling as a reasoning task, wherein the model first produces a long chain-of-thought reasoning process before generating the final rewards.
 Okay, so I need to figure out which assistant is better for the given instruction. The instruction is asking, \"How do I detail a car?\" \n\nLooking at Assistant 1's response, it gives a brief overview of what detailing involves. It mentions washing the exterior and interior, polishing and waxing the exterior, and some interior cleaning steps like vacuuming, cleaning upholstery, and dusting. It also notes that polishing and waxing depend on the paint condition. But it's pretty short and doesn't go into much detail about the process.\n\nAssistant 2, on the other hand, provides a much more comprehensive guide. It breaks down the process into exterior and interior detailing with specific steps. For the exterior, it includes washing, drying, clay bar treatment, polishing, waxing, cleaning windows, and applying tire and trim dressing. For the interior, it covers removing trash, vacuuming, shampooing carpets, cleaning hard surfaces, windows, air vents, and final touches. It also adds additional tips about working in the shade, using separate buckets, gentle materials, and systematic cleaning.\n\nThe instruction is straightforward, asking how to detail a car. Assistant 1 gives a general idea but lacks the detailed steps that someone might need to actually perform the task. Assistant 2 provides a thorough, step-by-step guide that would be more helpful for someone looking to detail their car properly. \n\nConsidering the evaluation rules, the response should precisely execute the instruction. Assistant 2 does this by providing a detailed, organized approach, which is more helpful and accurate. It doesn't add unnecessary information but covers all necessary steps. \n\nTherefore, Assistant 2 is better because it offers a more precise, helpful, and detailed response to the instruction.\n</think>\n\nThe better response is provided by Assistant 2, as it offers a comprehensive and detailed guide to car detailing, which is more helpful and precise than the brief overview given by Assistant 1.\n\n\\boxed{Assistant 2}
 ```
+## 6. Citation
+```
+@misc{rewardreasoningmodel,
+      title={Reward Reasoning Model},
+      author={Jiaxin Guo and Zewen Chi and Li Dong and Qingxiu Dong and Xun Wu and Shaohan Huang and Furu Wei},
+      year={2025},
+      eprint={2505.14674},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2505.14674},
+}
+```