Model Card for SPARE-PRM

Process Reward Model (Llama-3-8b) used in our SPARE paper. This model was trained only on the generations from the MATH dataset.

Model Details

Input: instruction + question + step-by-step solutions with a special step tag ки.

Output: the logits. Post-process it to achieve the score of each step.

Usage

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
import torch

incorrect_token = "-"
correct_token = "+"
step_tag = " ки" # space in the beginning required for correct Llama tokenization

tokenizer = AutoTokenizer.from_pretrained("UKPLab/Llama-3-8b-spare-prm-math")

step_target_ids = tokenizer.convert_tokens_to_ids([incorrect_token, correct_token])
step_tag_id = tokenizer.encode(step_tag)[-1] 

device = "cuda:0"
model = AutoModelForCausalLM.from_pretrained("UKPLab/Llama-3-8b-spare-prm-math").to(device).eval()

# include this instruction as it was left as it is during the PRM training.
instruction = "You are an expert at solving challenging math problems spanning across various categories and difficulties such as Algebra, Number Theory, Geometry, Counting and Probability, Precalculus etc. For a given math problem, your task is to generate a step-by-step reasoning-based solution providing an answer to the question. Identify the correct concepts, formulas and heuristics that needs to be applied and then derive the contents of the reasoning steps from the given contexts and accurate calculations from the previous reasoning steps."
question = "Yann and Camille go to a restaurant. </S>\nIf there are 10 items on the menu, and each orders one dish, how many different combinations of meals can Yann and Camille order if they refuse to order the same dish? (It does matter who orders what---Yann ordering chicken and Camille ordering fish is different from Yann ordering fish and Camille ordering chicken.)"
correct_generation = "Let's think step by step.\nYann can order 1 of the 10 dishes. ки\nWhen he picks a dish, there are 9 left for Camille to choose from. ки\nThus, there are $10\\cdot 9=\\boxed{90}$ possible combinations.\nHence, the answer is 90. ки\n"
incorrect_generation = "Let's think step by step.\nWithout any restrictions, Yann and Camille could both order the same dish out of the 10 options, for a total of $10 \\cdot 9$ dishes. ки\nHowever, since Yann orders one of the 9 dishes that Camille didn't order (and vice versa), the number of possible combinations becomes $10 \\cdot 9 - 8 = \\boxed{72}$.\nHence, the answer is 72. ки\n"

for generation in (correct_generation, incorrect_generation):
    message = [
        dict(role="system", content=instruction),
        dict(role="user", content=question),
        dict(role="assistant", content=generation),
    ]

    input_ids = tokenizer.apply_chat_template(message, tokenize=True, return_tensors="pt").to(device)

    with torch.no_grad():
        logits = model(input_ids).logits[:,:,step_target_ids]
        scores = logits.softmax(dim=-1)[:,:,1] # correct_token at index 1 in the step_target_ids  
        step_scores = scores[input_ids == step_tag_id]
        print(step_scores)
        
# tensor([0.9617, 0.9487, 0.8938]) - correct_generation
# tensor([0.5794, 0.4910]) - incorrect_generation

Citation

Please cite this data using:

@misc{rizvi2025sparesinglepassannotationreferenceguided,
      title={SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling}, 
      author={Md Imbesat Hassan Rizvi and Xiaodan Zhu and Iryna Gurevych},
      year={2025},
      eprint={2506.15498},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.15498}, 
}