CompassJudger-2

TODO.

Introduction

We introduce CompassJudger-2, a novel series of generalist judge models designed to overcome the narrow specialization and limited robustness of existing LLM-as-judge solutions. Current judge models often struggle with comprehensive evaluation, but CompassJudger-2 addresses these limitations with a powerful new training paradigm.

Key contributions of our work include:

  • Advanced Data Strategy: We employ a task-driven, multi-domain data curation and synthesis strategy to enhance the model's robustness and domain adaptability.
  • Verifiable Reward-Guided Training: We supervise judgment tasks with verifiable rewards, guiding the model's intrinsic reasoning through chain-of-thought (CoT) and rejection sampling. A refined margin policy gradient loss further enhances performance.
  • Superior Performance: CompassJudger-2 achieves state-of-the-art results across multiple judge and reward benchmarks. Our 7B model demonstrates competitive accuracy with models that are significantly larger.
  • JudgerBenchV2: We introduce a new, comprehensive benchmark with 10,000 questions across 10 scenarios, using a Mixture-of-Judgers (MoJ) consensus for more reliable ground truth.

This repository contains the CompassJudger-2 series of models, fine-tuned on the Qwen2.5-Instruct series.

Model Downloads

Model Name Size Base Model Download Notes
πŸ‘‰ CompassJudger-2-7B-Instruct 7B Qwen2.5-7B-Instruct πŸ€— Model Fine-tuned for generalist judge capabilities.
πŸ‘‰ CompassJudger-2-32B-Instruct 32B Qwen2.5-32B-Instruct πŸ€— Model A larger, more powerful judge model.

Requirements

You will need to install the latest versions of transformers and accelerate:

pip install -U transformers accelerate torch

Quickstart

Here is a simple example demonstrating how to load the model and use it for pairwise evaluation.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "opencompass/CompassJudger-2-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = """your prompt"""

messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=2048
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Evaluation

CompassJudger-2 sets a new state-of-the-art for judge models, outperforming general models, reward models, and other specialized judge models across a wide range of benchmarks.

Model JudgerBench V2 JudgeBench RMB RewardBench Average
7B Judge Models
CompassJudger-1-7B-Instruct 57.96 46.00 38.18 80.74 55.72
Con-J-7B-Instruct 52.35 38.06 71.50 87.10 62.25
RISE-Judge-Qwen2.5-7B 46.12 40.48 72.64 88.20 61.61
CompassJudger-2-7B-Instruct 60.52 63.06 73.90 90.96 72.11
32B+ Judge Models
CompassJudger-1-32B-Instruct 60.33 62.29 77.63 86.17 71.61
Skywork-Critic-Llama-3.1-70B 52.41 50.65 65.50 93.30 65.47
RISE-Judge-Qwen2.5-32B 56.42 63.87 73.70 92.70 71.67
CompassJudger-2-32B-Instruct 62.21 65.48 72.98 92.62 73.32
General Models (for reference)
Qwen2.5-32B-Instruct 62.97 59.84 74.99 85.61 70.85
DeepSeek-V3-0324 64.43 59.68 78.16 85.17 71.86
Qwen3-235B-A22B 61.40 65.97 75.59 84.68 71.91

For detailed benchmark performance and methodology, please refer to our πŸ“‘ Paper. TODO.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details. TODO.

Citation

If you find our work helpful, please consider citing our paper:

TODO.

Downloads last month
3
Safetensors
Model size
7.62B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for opencompass/CompassJudger-2-7B-Instruct

Quantizations
1 model