Model Evaluation
updated
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
Paper
•
2502.07445
•
Published
•
11
ARR: Question Answering with Large Language Models via Analyzing,
Retrieving, and Reasoning
Paper
•
2502.04689
•
Published
•
8
Analyze Feature Flow to Enhance Interpretation and Steering in Language
Models
Paper
•
2502.03032
•
Published
•
60
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Paper
•
2502.01534
•
Published
•
40
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
Paper
•
2502.01639
•
Published
•
26
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for
Reasoning Quality, Robustness, and Efficiency
Paper
•
2502.09621
•
Published
•
28
Logical Reasoning in Large Language Models: A Survey
Paper
•
2502.09100
•
Published
•
24
IHEval: Evaluating Language Models on Following the Instruction
Hierarchy
Paper
•
2502.08745
•
Published
•
20
InductionBench: LLMs Fail in the Simplest Complexity Class
Paper
•
2502.15823
•
Published
•
7
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM
Paper
•
2503.04504
•
Published
•
4
Feature-Level Insights into Artificial Text Detection with Sparse
Autoencoders
Paper
•
2503.03601
•
Published
•
232
Collapse of Dense Retrievers: Short, Early, and Literal Biases
Outranking Factual Evidence
Paper
•
2503.05037
•
Published
•
4
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?
Paper
•
2503.12349
•
Published
•
44
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper
•
2503.12605
•
Published
•
35
VERIFY: A Benchmark of Visual Explanation and Reasoning for
Investigating Multimodal Reasoning Fidelity
Paper
•
2503.11557
•
Published
•
22
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the
LLM Era
Paper
•
2503.12329
•
Published
•
27
Where do Large Vision-Language Models Look at when Answering Questions?
Paper
•
2503.13891
•
Published
•
8
I Have Covered All the Bases Here: Interpreting Reasoning Features in
Large Language Models via Sparse Autoencoders
Paper
•
2503.18878
•
Published
•
119
MORSE-500: A Programmatically Controllable Video Benchmark to
Stress-Test Multimodal Reasoning
Paper
•
2506.05523
•
Published
•
34
JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo
Retouching Agent
Paper
•
2506.17612
•
Published
•
64
Where to find Grokking in LLM Pretraining? Monitor
Memorization-to-Generalization without Test
Paper
•
2506.21551
•
Published
•
28
Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic
Empirical Study
Paper
•
2506.19794
•
Published
•
8
Do Vision-Language Models Have Internal World Models? Towards an Atomic
Evaluation
Paper
•
2506.21876
•
Published
•
28
Machine Bullshit: Characterizing the Emergent Disregard for Truth in
Large Language Models
Paper
•
2507.07484
•
Published
•
17
Hidden in plain sight: VLMs overlook their visual representations
Paper
•
2506.08008
•
Published
•
7
"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in
Text-to-Video Models
Paper
•
2507.13428
•
Published
•
15
MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models
Paper
•
2507.12806
•
Published
•
20
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video
Reasoning and Understanding
Paper
•
2507.15028
•
Published
•
21
Pixels, Patterns, but No Poetry: To See The World like Humans
Paper
•
2507.16863
•
Published
•
68
AgroBench: Vision-Language Model Benchmark in Agriculture
Paper
•
2507.20519
•
Published
•
7
Are We on the Right Way for Assessing Document Retrieval-Augmented
Generation?
Paper
•
2508.03644
•
Published
•
25
PRELUDE: A Benchmark Designed to Require Global Comprehension and
Reasoning over Long Contexts
Paper
•
2508.09848
•
Published
•
71
A Survey on Large Language Model Benchmarks
Paper
•
2508.15361
•
Published
•
20
DeepResearch Arena: The First Exam of LLMs' Research Abilities via
Seminar-Grounded Tasks
Paper
•
2509.01396
•
Published
•
57
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
Paper
•
2509.04013
•
Published
•
4
MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for
Reasoning-Intensive Multimodal Retrieval
Paper
•
2510.09510
•
Published
•
7
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large
Vision and Language Models
Paper
•
2510.16641
•
Published
•
4
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with
the MME-CoF Benchmark
Paper
•
2510.26802
•
Published
•
33
Multimodal Spatial Reasoning in the Large Model Era: A Survey and
Benchmarks
Paper
•
2510.25760
•
Published
•
16
UniREditBench: A Unified Reasoning-based Image Editing Benchmark
Paper
•
2511.01295
•
Published
•
38
SO-Bench: A Structural Output Evaluation of Multimodal LLMs
Paper
•
2511.21750
•
Published
•
5
RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence
Paper
•
2512.02622
•
Published
•
9
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Paper
•
2512.04324
•
Published
•
149