CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
Abstract
CompassVerifier is a lightweight, robust model for verifying LLM outputs across various domains, supported by VerifierBench, a comprehensive benchmark dataset.
Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate answer verification, evaluation protocols, and reinforcement learning research. Code and dataset are available at https://github.com/open-compass/CompassVerifier.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains (2025)
- UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases (2025)
- RLPR: Extrapolating RLVR to General Domains without Verifiers (2025)
- VerIF: Verification Engineering for Reinforcement Learning in Instruction Following (2025)
- Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers (2025)
- One Token to Fool LLM-as-a-Judge (2025)
- TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper