Papers
arxiv:2507.01001

SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

Published on Jul 1
ยท Submitted by yilunzhao on Jul 2
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

SciArena is a community-driven platform for evaluating foundation models on scientific literature tasks, using collective voter judgments to rank models and address the need for reliable automated evaluation.

AI-generated summary

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.

Community

Paper author Paper submitter

Scientific literature is expanding at an unprecedented rate, making it challenging for researchers to stay updated and synthesize new knowledge. Foundation models are increasingly being used to help with this, but evaluating their capabilities in open-ended scientific tasks remains a significant challenge. Traditional benchmarks are often not suitable for nuanced evaluations in scientific tasks as they are static, limited in scale, and quickly becoming outdated. To address these limitations, we present SciArena, an open and collaborative platform that directly engages the scientific research community in evaluating foundation models for scientific literature tasks. This crowdsourced, head-to-head evaluation approach for LLMs has been successfully pioneered in the general domain by platforms such as Chatbot Arena.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.01001 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.01001 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.01001 in a Space README.md to link it from this page.

Collections including this paper 2