bunny127/SophiaVL-R1-Thinking-Reward-Model-3B

This is the Thinking Reward Model of SophiaVL-R1 (https://arxiv.org/abs/2505.17018).

This model is finetuned with the SophiaVL-R1-Thinking-156k Dataset. The base model is Qwen2.5-VL-3B.

The input of Thinking Reward Model is a question with model response. Thinking Reward Model will output a score between 0 and 1 indicating the thinking quality of model response.

We provide a command to deploy the Thinking Reward Model using vLLM:

python3 -m vllm.entrypoints.openai.api_server --port 80 --model /path/to/thinking/reward/model --served-model-name thinking-reward-model --tensor-parallel-size 2 --max-num-seqs 64 --max_model_len=32768

We provide a script to query the deployed model for thinking reward:

import httpx
import time
import base64
from pathlib import Path

openai_api_base = "vllm-url"
reward_model = "thinking-reward-model"
question = "your question"
image = "your image path"
answer = "your model response"

def encode_image_base64(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def get_process_reward(prompt_str, reasoning_str, image_path=None):
    image_base64 = None
    if image_path is not None:
        image_base64 = encode_image_base64(image_path)
        if "<image>" not in prompt_str:
            prompt_str = f"<image> {prompt_str}"

    prompt = f"""You are an expert reasoning evaluator. I will give you a multimodal question and an answer. Your goal is to judge a reward process and give a score between 0 and 1. You should focus on whether the reasoning process is good rather than whether the final answer is correct.### Evaluation Criteria:\n- **Logical Soundness**: Does each step follow logically from the previous one?\n- **Correct Reasoning**: Are the methods and steps used appropriate and valid? Are the facts and lemmas correctly stated and applied?\n- **Error Identification**: Are there any logical fallacies, unsupported assumptions, or incorrect steps?\n- **Language Consistency**: Is the reasoning process conducted in a single, consistent language without mixing different languages?\n- **Redundancy**: Is the reasoning concise, without unnecessary repetition or extraneous steps?\nProvide a single score from **{{0, 0.1, 0.2, ..., 1.0}}** based on the reasoning quality, where:\n - **0**: Completely flawed reasoning\n- **1**: Perfectly sound reasoning\n- Intermediate values (e.g., 0.3, 0.7) should reflect partial correctness or minor errors.\nBe strict, reward the good process and punish the bad one. You should only output the score without any explanation.
    Question: {prompt_str}
    Reasoning process: {reasoning_str}
    """

    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [{"type": "text", "text": prompt}]},
    ]

    if image_base64 is not None:
        messages[1]["content"].append({
            "type": "image_url",
            "image_url": {"url": f"data:image/png;base64,{image_base64}"},
        })

    payload = {
        "model": reward_model,
        "messages": messages,
        "temperature": 0.0,
    }

    attempt = 0
    max_retry = 10
    while attempt < max_retry:
        try:
            response = httpx.post(openai_api_base, headers={"Content-Type": "application/json"}, json=payload, timeout=60)
            response.raise_for_status()
            result = response.json()["choices"][0]["message"]["content"]
            print(result)
            return 0 
        except Exception as e:
            print(f"[Attempt {attempt+1}] get_process_reward failed: {e}, message: {prompt_str}")
            attempt += 1
            time.sleep(1)
    return 0

get_process_reward(qestion,answer,image)

bunny127
/

SophiaVL-R1-Thinking-Reward-Model-3B

Model tree for bunny127/SophiaVL-R1-Thinking-Reward-Model-3B