Model Evaluation - a Stalin16 Collection

Stalin16 's Collections

Edu

Agents

Model Evaluation

Reasoning Models

Data and other things

Gen AI Diffusion

Model Evaluation

updated 28 days ago

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

Paper • 2502.07445 • Published Feb 11, 2025 • 11
ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning

Paper • 2502.04689 • Published Feb 7, 2025 • 8
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Paper • 2502.03032 • Published Feb 5, 2025 • 60
Preference Leakage: A Contamination Problem in LLM-as-a-judge

Paper • 2502.01534 • Published Feb 3, 2025 • 40
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models

Paper • 2502.01639 • Published Feb 3, 2025 • 26
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Paper • 2502.09621 • Published Feb 13, 2025 • 28
Logical Reasoning in Large Language Models: A Survey

Paper • 2502.09100 • Published Feb 13, 2025 • 24
IHEval: Evaluating Language Models on Following the Instruction Hierarchy

Paper • 2502.08745 • Published Feb 12, 2025 • 20
InductionBench: LLMs Fail in the Simplest Complexity Class

Paper • 2502.15823 • Published Feb 20, 2025 • 7
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM

Paper • 2503.04504 • Published Mar 6, 2025 • 4
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Paper • 2503.03601 • Published Mar 5, 2025 • 232
Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence

Paper • 2503.05037 • Published Mar 6, 2025 • 4
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

Paper • 2503.12349 • Published Mar 16, 2025 • 44
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Paper • 2503.12605 • Published Mar 16, 2025 • 35
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

Paper • 2503.11557 • Published Mar 14, 2025 • 22
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

Paper • 2503.12329 • Published Mar 16, 2025 • 27
Where do Large Vision-Language Models Look at when Answering Questions?

Paper • 2503.13891 • Published Mar 18, 2025 • 8
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Paper • 2503.18878 • Published Mar 24, 2025 • 119
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

Paper • 2506.05523 • Published Jun 5, 2025 • 34
JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

Paper • 2506.17612 • Published Jun 21, 2025 • 64
Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

Paper • 2506.21551 • Published Jun 26, 2025 • 28
Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Paper • 2506.19794 • Published Jun 24, 2025 • 8
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation

Paper • 2506.21876 • Published Jun 27, 2025 • 28
Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

Paper • 2507.07484 • Published Jul 10, 2025 • 17
Hidden in plain sight: VLMs overlook their visual representations

Paper • 2506.08008 • Published Jun 9, 2025 • 7
"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Paper • 2507.13428 • Published Jul 17, 2025 • 15
MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Paper • 2507.12806 • Published Jul 17, 2025 • 20
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding

Paper • 2507.15028 • Published Jul 20, 2025 • 21
Pixels, Patterns, but No Poetry: To See The World like Humans

Paper • 2507.16863 • Published Jul 21, 2025 • 68
AgroBench: Vision-Language Model Benchmark in Agriculture

Paper • 2507.20519 • Published Jul 28, 2025 • 7
Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

Paper • 2508.03644 • Published Aug 5, 2025 • 25
PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Paper • 2508.09848 • Published Aug 13, 2025 • 71
A Survey on Large Language Model Benchmarks

Paper • 2508.15361 • Published Aug 21, 2025 • 20
DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

Paper • 2509.01396 • Published Sep 1, 2025 • 57
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Paper • 2509.04013 • Published Sep 4, 2025 • 4
MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval

Paper • 2510.09510 • Published Oct 10, 2025 • 7
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

Paper • 2510.16641 • Published Oct 18, 2025 • 4
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Paper • 2510.26802 • Published Oct 30, 2025 • 33
Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Paper • 2510.25760 • Published Oct 29, 2025 • 16
UniREditBench: A Unified Reasoning-based Image Editing Benchmark

Paper • 2511.01295 • Published Nov 3, 2025 • 38
SO-Bench: A Structural Output Evaluation of Multimodal LLMs

Paper • 2511.21750 • Published Nov 23, 2025 • 5
RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

Paper • 2512.02622 • Published about 1 month ago • 9
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

Paper • 2512.04324 • Published 29 days ago • 149