Reward Reasoning Model

Paper Link👀

1. Introduction

We propose Reward Reasoning Models (RRMs). Unlike existing reward models, RRMs frames reward modeling as a reasoning task, wherein the model first produces a long chain-of-thought reasoning process before generating the final rewards.

Since supervised data providing reward reasoning traces are not readily available, we develop a training framework called Reward Reasoning via Reinforcement Learning, which encourages RRMs to self-evolve their reward reasoning capabilities within a rule-based reward environment. Furthermore, we introduce multi-response rewarding strategies, including the ELO rating system and knockout tournament, enabling RRMs to flexibly allocate test-time compute to practical application scenarios.

2. Model Summary

Core Concept: Reward Reasoning

RRMs frame reward modeling into a reasoning task. Before assigning a reward, the model generates an explicit chain-of-thought to analyze and compare candidate responses. This allows RRMs to adaptively allocate computational resources, dedicating more thought to complex evaluation scenarios.

Training: Reward Reasoning via Reinforcement Learning

RRMs are trained using a framework called Reward Reasoning via Reinforcement Learning. This approach enables the model to self-evolve sophisticated reward reasoning capabilities.
Crucially, this training process does not require supervised data in the form of explicit reasoning traces. Instead, it uses rule-based rewards derived from whether the RRM correctly prefers a ground-truth response, guiding the model to develop effective reasoning patterns.

Key Advantages & Capabilities

Enhanced Accuracy: RRMs consistently outperform strong baseline reward models across diverse domains, including reasoning, general knowledge, and alignment with human preferences.
Adaptive Test-Time Compute Utilization: RRMs can effectively scale their test-time compute (both through parallel and sequential scaling of reasoning steps) to achieve better performance.
Practical Applications: RRMs are effective for reward-guided best-of-N inference and can provide high-quality preference signals for post-training LLMs (e.g., via DPO or RL).

3. Model Downloads

Model	Parameters	Download Link
RRM-7B	7B	🤗 HuggingFace
RRM-32B	32B	🤗 HuggingFace

*Note: RRMs utilize the Qwen2 architecture and are trained using DeepSeek-R1-Distill-Qwen models as initialization.

4. Evaluation Results

RRMs have been extensively evaluated on several benchmarks.

Agreement with Human Preference (RewardBench & PandaLM Test)

Model	RewardBench (Overall)	PandaLM Test (Agreement)
Skywork-Reward-Gemma-2-27B-v0.2	94.3	76.6
JudgeLM-7B	63.5	65.1
JudgeLM-33B	72.3	75.2
Claude-3.5-Sonnet-20240620	84.2	-
DeepSeek-R1	84.9	78.7
DeepSeek-GRM-27B	86.0	-
GPT-4-0125-preview	86.0	66.5
GPT-4o-0806	86.7	-
RM-R1-DeepSeek-Distilled-Qwen-7B	80.1	-
RM-R1-DeepSeek-Distilled-Qwen-14B	88.9	-
RM-R1-DeepSeek-Distilled-Qwen-32B	90.9	-
RRM-7B	82.2	72.9
RRM-7B (voting@16)	84.8	75.9
RRM-32B	91.2	78.8
RRM-32B (voting@16)	91.9	80.2

Binary Preference Classification (PPE Benchmark)

Model	MMLU-Pro	MATH	GPQA	Overall
Skywork-Reward-Gemma-2-27B	55.0	46.2	44.7	48.6
J1-Llama-8B (SC@32)	67.5	76.6	55.7	66.7
J1-Llama-70B (SC@32)	79.9	88.1	66.5	78.2
DeepSeek-GRM-27B (MetaRM) (voting@32)	68.1	70.0	56.9	65.0
RRM-7B	66.5	88.0	57.9	70.3
RRM-7B (voting@5)	68.3	90.5	58.3	72.4
RRM-32B	80.5	94.3	67.4	80.7
RRM-32B (voting@5)	81.3	95.4	68.4	81.7

5. How to Use RRM

Here we show a code snippet to show you how to use RRMs with transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import accelerate

Skywork_SYSTEM_PROMPT = """You are a helpful assistant in evaluating the quality of the responses for a given instruction. Your goal is to select the best response for the given instruction.
Select Assistant 1 or Assistant 2, that is better for the given instruction. The two responses are generated by two different AI assistants respectively.
Do NOT say both / neither are good.

Here are some rules of the evaluation:
(1) If the instruction does not contain harmful content, you should prioritize evaluating whether the output honestly/precisely/closely executes the instruction, then consider its helpfulness, accuracy, level of detail, harmlessness, etc.
(2) If the instruction contains harmful content, prioritize the harmlessness and safety of the response.
(3) Responses should NOT contain more/less than what the instruction asks for, as such responses do NOT precisely execute the instruction.
(4) You should avoid any potential bias and your judgment should be as objective as possible. Here are some potential sources of bias:
- The order in which the responses were presented should NOT affect your judgment, as Response A and Response B are equally likely to be the better.
- The length of the responses should NOT affect your judgement, as a longer response does not necessarily correspond to a better response. When making your decision, evaluate if the response length is appropriate for the given instruction.
(5) Your output should only consist of '\\boxed{Assistant 1}' if assistant 1 is better, or '\\boxed{Assistant 2}' if assistant 2 is better. Omit any other output.

"""

Skywork_PROMPT = """## Query

{question}

## Assistant responses

### Assistant 1

{answer1}


### Assistant 2

{answer2}

"""

Skywork_ASSISTANT_PROMPT = """## Analysis

Let's analyze this step by step and decide which assistant is better, and then answer \\boxed{Assistant 1} or \\boxed{Assistant 2}."""


model_id = "Reward-Reasoning/RRM-32B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

query = "How do I detail a car?"
response1 = "Detailing a car involves washing the exterior and interior of the car, as well as polishing and waxing the exterior. Interior detailing typically involves vacuuming, cleaning the upholstery and air vents, polishing the dashboard and console, and dusting. Polishing and waxing the exterior will depend on the condition of the paint, but typically involves applying a polish and wax to make it shine."
response2 = "Detailing a car involves a thorough cleaning inside and out, as well as polishing and waxing to protect the vehicle's surfaces. Here's a step-by-step guide to detailing a car:\n\n**Exterior Detailing:**\n\n1. **Wash the Car:**\n   - Rinse the car with water to remove loose dirt.\n   - Use a car wash soap and microfiber wash mitt to clean the car from top to bottom.\n   - Clean the wheels and tires with a brush and a wheel cleaner.\n   - Rinse the car thoroughly to remove all soap.\n\n2. **Dry the Car:**\n   - Use a microfiber towel or a chamois to dry the car to prevent water spots.\n\n3. **Clay Bar Treatment:**\n   - Use a clay bar with a lubricant to remove embedded surface contaminants from the paint.\n\n4. **Polishing:**\n   - Apply car polish with a dual-action polisher or by hand to correct paint imperfections and create a smooth surface.\n\n5. **Waxing:**\n   - Apply a coat of wax or paint sealant to protect the paint and give it a glossy finish.\n\n6. **Windows and Mirrors:**\n   - Clean the windows and mirrors with a glass cleaner and a microfiber towel.\n\n7. **Tire and Trim Dressing:**\n   - Apply a tire dressing to the tires for a shiny finish.\n   - Use a trim restorer or protectant on plastic and rubber parts to prevent fading.\n\n**Interior Detailing:**\n\n1. **Remove Trash:**\n   - Clear out any trash and remove personal items from the car.\n\n2. **Vacuum:**\n   - Vacuum the seats, carpets, floor mats, and trunk.\n   - Use a brush attachment for the dashboard and door panels.\n\n3. **Shampoo Carpets and Upholstery:**\n   - Use a carpet cleaner and a brush to clean the carpets and upholstery.\n   - For leather interiors, use a leather cleaner and conditioner.\n\n4. **Clean Hard Surfaces:**\n   - Wipe down all hard surfaces (dashboard, center console, door panels, etc.) with a mild all-purpose cleaner and a microfiber cloth.\n\n5. **Windows and Mirrors:**\n   - Clean the interior side of windows and mirrors.\n\n6. **Air Vents and Crevices:**\n   - Use a detailing brush or compressed air to clean out air vents and hard-to-reach crevices.\n\n7. **Final Touches:**\n   - Apply a protectant to the dashboard and other plastic components.\n   - Replace air fresheners if needed.\n\n**Additional Tips:**\n\n- Work in the shade or a cool, well-ventilated garage to prevent products from drying too quickly and leaving residue.\n- Use separate buckets for washing and rinsing to avoid contaminating the clean water with dirt.\n- Always use gentle, non-abrasive materials and cleaners specifically designed for automotive use to avoid damaging surfaces.\n- Move in a systematic way to ensure you don't miss any spots.\n\nBy following these steps, you'll give your car a thorough clean that not only makes it look great but also helps in maintaining its value. Remember, regular detailing can prevent wear and tear and keep your car looking new for years to come."

system_prompt = Skywork_SYSTEM_PROMPT

user_prompt = Skywork_PROMPT.format(
    question=query, answer1=response1, answer2=response2
) + Skywork_ASSISTANT_PROMPT

messages = [
    {
        "role": "system",
        "content": system_prompt,
    },
    {"role": "user", "content": user_prompt},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

generation_args = {
    "max_new_tokens": 8192,
    "temperature": 0.6,
    "do_sample": True,
    "top_p": 1.0,
    "eos_token_id": tokenizer.eos_token_id,
    "pad_token_id": tokenizer.pad_token_id,
}

with torch.no_grad():
    output = model.generate(**inputs, **generation_args)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

The generated text:

Okay, so I need to figure out which assistant is better for the given instruction. The instruction is asking, \"How do I detail a car?\" \n\nLooking at Assistant 1's response, it gives a brief overview of what detailing involves. It mentions washing the exterior and interior, polishing and waxing the exterior, and some interior cleaning steps like vacuuming, cleaning upholstery, and dusting. It also notes that polishing and waxing depend on the paint condition. But it's pretty short and doesn't go into much detail about the process.\n\nAssistant 2, on the other hand, provides a much more comprehensive guide. It breaks down the process into exterior and interior detailing with specific steps. For the exterior, it includes washing, drying, clay bar treatment, polishing, waxing, cleaning windows, and applying tire and trim dressing. For the interior, it covers removing trash, vacuuming, shampooing carpets, cleaning hard surfaces, windows, air vents, and final touches. It also adds additional tips about working in the shade, using separate buckets, gentle materials, and systematic cleaning.\n\nThe instruction is straightforward, asking how to detail a car. Assistant 1 gives a general idea but lacks the detailed steps that someone might need to actually perform the task. Assistant 2 provides a thorough, step-by-step guide that would be more helpful for someone looking to detail their car properly. \n\nConsidering the evaluation rules, the response should precisely execute the instruction. Assistant 2 does this by providing a detailed, organized approach, which is more helpful and accurate. It doesn't add unnecessary information but covers all necessary steps. \n\nTherefore, Assistant 2 is better because it offers a more precise, helpful, and detailed response to the instruction.\n</think>\n\nThe better response is provided by Assistant 2, as it offers a comprehensive and detailed guide to car detailing, which is more helpful and precise than the brief overview given by Assistant 1.\n\n\\boxed{Assistant 2}

6. Citation

@misc{rewardreasoningmodel,
      title={Reward Reasoning Model}, 
      author={Jiaxin Guo and Zewen Chi and Li Dong and Qingxiu Dong and Xun Wu and Shaohan Huang and Furu Wei},
      year={2025},
      eprint={2505.14674},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.14674}, 
}

Reward-Reasoning
/

RRM-32B