Papers
arxiv:2504.16074

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Published on Apr 22
Β· Submitted by StarThomas1002 on Apr 24
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

We introduce PHYBench, a novel, high-quality benchmark designed for evaluating reasoning capabilities of large language models (LLMs) in physical contexts. PHYBench consists of 500 meticulously curated physics problems based on real-world physical scenarios, designed to assess the ability of models to understand and reason about realistic physical processes. Covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics, the benchmark spans difficulty levels from high school exercises to undergraduate problems and Physics Olympiad challenges. Additionally, we propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions, which effectively captures differences in model reasoning processes and results beyond traditional binary scoring methods. We evaluate various LLMs on PHYBench and compare their performance with human experts. Our results reveal that even state-of-the-art reasoning models significantly lag behind human experts, highlighting their limitations and the need for improvement in complex physical reasoning scenarios. Our benchmark results and dataset are publicly available at https://phybench-official.github.io/phybench-demo/.

Community

Paper submitter

πŸŽ‰ Major Release: PHYBench Benchmark is Now Public! πŸ”₯

πŸ“„ Paper Links:

🌐 Website:
PHYBench Official Demo

πŸ“¦ Dataset:
Hugging Face – PHYBench

πŸ“° Featured in Hugging Face Daily Paper


We proudly present PHYBench, a high-quality physics reasoning benchmark developed with great dedication by the School of Physics at Peking University.

PHYBench contains 500 carefully curated, challenging physics problems, designed to rigorously evaluate models’ true understanding of physical concepts and their ability to perform complex reasoning.

Unlike traditional benchmarks that rely on multiple-choice or simple numerical answers, PHYBench adopts expression-based answers, familiar to physics competition participants. To evaluate accuracy more effectively, we introduce Expression Edit Distance (EED) β€” the closer a model’s answer is to the ground truth expression, the higher the score. This method improves sample efficiency by 200%, making 500 problems equivalent to over 1500 binary-scored items.

Our findings show that even the strongest reasoning model to date β€” Gemini 2.5 Pro β€” achieves only 36.9% accuracy, while human experts exceed 60% on the same benchmark.

This project is the result of contributions from 180 students across the School of Physics and partner departments at Peking University, developed over one and a half months.

We have publicly released our dataset and website, and warmly welcome everyone to follow, cite, and share PHYBench. Let's push the boundaries of AI in physics reasoning together!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.16074 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.16074 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.