CompassJudger-2

TODO.

Introduction

We introduce CompassJudger-2, a novel series of generalist judge models designed to overcome the narrow specialization and limited robustness of existing LLM-as-judge solutions. Current judge models often struggle with comprehensive evaluation, but CompassJudger-2 addresses these limitations with a powerful new training paradigm.

Key contributions of our work include:

Advanced Data Strategy: We employ a task-driven, multi-domain data curation and synthesis strategy to enhance the model's robustness and domain adaptability.
Verifiable Reward-Guided Training: We supervise judgment tasks with verifiable rewards, guiding the model's intrinsic reasoning through chain-of-thought (CoT) and rejection sampling. A refined margin policy gradient loss further enhances performance.
Superior Performance: CompassJudger-2 achieves state-of-the-art results across multiple judge and reward benchmarks. Our 7B model demonstrates competitive accuracy with models that are significantly larger.
JudgerBenchV2: We introduce a new, comprehensive benchmark with 10,000 questions across 10 scenarios, using a Mixture-of-Judgers (MoJ) consensus for more reliable ground truth.

This repository contains the CompassJudger-2 series of models, fine-tuned on the Qwen2.5-Instruct series.

Model Downloads

Model Name	Size	Base Model	Download	Notes
👉 CompassJudger-2-7B-Instruct	7B	Qwen2.5-7B-Instruct	🤗 Model	Fine-tuned for generalist judge capabilities.
👉 CompassJudger-2-32B-Instruct	32B	Qwen2.5-32B-Instruct	🤗 Model	A larger, more powerful judge model.

Requirements

You will need to install the latest versions of transformers and accelerate:

pip install -U transformers accelerate torch

Quickstart

Here is a simple example demonstrating how to load the model and use it for pairwise evaluation.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "opencompass/CompassJudger-2-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = """your prompt"""

messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=2048
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Evaluation

CompassJudger-2 sets a new state-of-the-art for judge models, outperforming general models, reward models, and other specialized judge models across a wide range of benchmarks.

Model	JudgerBench V2	JudgeBench	RMB	RewardBench	Average
7B Judge Models
CompassJudger-1-7B-Instruct	57.96	46.00	38.18	80.74	55.72
Con-J-7B-Instruct	52.35	38.06	71.50	87.10	62.25
RISE-Judge-Qwen2.5-7B	46.12	40.48	72.64	88.20	61.61
CompassJudger-2-7B-Instruct	60.52	63.06	73.90	90.96	72.11
32B+ Judge Models
CompassJudger-1-32B-Instruct	60.33	62.29	77.63	86.17	71.61
Skywork-Critic-Llama-3.1-70B	52.41	50.65	65.50	93.30	65.47
RISE-Judge-Qwen2.5-32B	56.42	63.87	73.70	92.70	71.67
CompassJudger-2-32B-Instruct	62.21	65.48	72.98	92.62	73.32
General Models (for reference)
Qwen2.5-32B-Instruct	62.97	59.84	74.99	85.61	70.85
DeepSeek-V3-0324	64.43	59.68	78.16	85.17	71.86
Qwen3-235B-A22B	61.40	65.97	75.59	84.68	71.91

For detailed benchmark performance and methodology, please refer to our 📑 Paper. TODO.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details. TODO.

Citation

If you find our work helpful, please consider citing our paper: