btjhjeon
's Collections
Multimodal Benchmarks
updated
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
•
2407.07053
•
Published
•
47
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper
•
2407.12772
•
Published
•
36
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality
Models
Paper
•
2407.11691
•
Published
•
14
MMIU: Multimodal Multi-image Understanding for Evaluating Large
Vision-Language Models
Paper
•
2408.02718
•
Published
•
62
Teaching CLIP to Count to Ten
Paper
•
2302.12066
•
Published
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Paper
•
2408.11817
•
Published
•
9
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution
Real-World Scenarios that are Difficult for Humans?
Paper
•
2408.13257
•
Published
•
27
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal
Models in Multi-View Urban Scenarios
Paper
•
2408.17267
•
Published
•
24
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language
Models for Trait Discovery from Biological Images
Paper
•
2408.16176
•
Published
•
8
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding
Benchmark
Paper
•
2409.02813
•
Published
•
31
DSBench: How Far Are Data Science Agents to Becoming Data Science
Experts?
Paper
•
2409.07703
•
Published
•
69
OmniBench: Towards The Future of Universal Omni-Language Models
Paper
•
2409.15272
•
Published
•
31
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating
Satire Comprehension capability of Vision-Language Models
Paper
•
2409.13592
•
Published
•
52
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short
Videos
Paper
•
2410.02763
•
Published
•
7
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex
Diagrams in Coding Tasks
Paper
•
2410.12381
•
Published
•
45
WorldMedQA-V: a multilingual, multimodal medical examination dataset for
multimodal language models evaluation
Paper
•
2410.12722
•
Published
•
5
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large
Vision-Language Models
Paper
•
2410.10139
•
Published
•
53
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
Paper
•
2410.10563
•
Published
•
39
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
Paper
•
2410.10783
•
Published
•
28
TemporalBench: Benchmarking Fine-grained Temporal Understanding for
Multimodal Video Models
Paper
•
2410.10818
•
Published
•
17
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained
Vision-Language Models
Paper
•
2410.09733
•
Published
•
9
TVBench: Redesigning Video-Language Evaluation
Paper
•
2410.07752
•
Published
•
6
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures
Paper
•
2410.13754
•
Published
•
76
The Curse of Multi-Modalities: Evaluating Hallucinations of Large
Multimodal Models across Language, Visual, and Audio
Paper
•
2410.12787
•
Published
•
32
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding
Benchmark for Culture-aware Evaluation
Paper
•
2410.17250
•
Published
•
15
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial
Samples
Paper
•
2410.14669
•
Published
•
38
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
Paper
•
2410.18976
•
Published
•
12
TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing
Prompts
Paper
•
2410.18071
•
Published
•
7
CLEAR: Character Unlearning in Textual and Visual Modalities
Paper
•
2410.18057
•
Published
•
210
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
Paper
•
2410.19168
•
Published
•
20
BenchX: A Unified Benchmark Framework for Medical Vision-Language
Pretraining on Chest X-Rays
Paper
•
2410.21969
•
Published
•
10
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal
Foundation Models
Paper
•
2410.23266
•
Published
•
20
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical
Reasoning Robustness of Vision Language Models
Paper
•
2411.00836
•
Published
•
15
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for
Evaluating Foundation Models
Paper
•
2411.04075
•
Published
•
17
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding
And A Retrieval-Aware Tuning Framework
Paper
•
2411.06176
•
Published
•
46
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
Multimodal Understanding and Generation
Paper
•
2411.07975
•
Published
•
31
VLRewardBench: A Challenging Benchmark for Vision-Language Generative
Reward Models
Paper
•
2411.17451
•
Published
•
11
Interleaved Scene Graph for Interleaved Text-and-Image Generation
Assessment
Paper
•
2411.17188
•
Published
•
23
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
Comprehension with Video-Text Duet Interaction Format
Paper
•
2411.17991
•
Published
•
5
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Paper
•
2411.15296
•
Published
•
22
VisOnlyQA: Large Vision Language Models Still Struggle with Visual
Perception of Geometric Information
Paper
•
2412.00947
•
Published
•
8
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
Audio-Visual Information?
Paper
•
2412.02611
•
Published
•
24
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Paper
•
2412.07825
•
Published
•
11
OmniDocBench: Benchmarking Diverse PDF Document Parsing with
Comprehensive Annotations
Paper
•
2412.07626
•
Published
•
22
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
•
2412.08737
•
Published
•
54
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
Paper
•
2412.07769
•
Published
•
28
Multi-Dimensional Insights: Benchmarking Real-World Personalization in
Large Multimodal Models
Paper
•
2412.12606
•
Published
•
42
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in
Financial Domain
Paper
•
2412.13018
•
Published
•
42
Thinking in Space: How Multimodal Large Language Models See, Remember,
and Recall Spaces
Paper
•
2412.14171
•
Published
•
24
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Paper
•
2412.18072
•
Published
•
19
MotionBench: Benchmarking and Improving Fine-grained Video Motion
Understanding for Vision Language Models
Paper
•
2501.02955
•
Published
•
45
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
•
2501.05510
•
Published
•
44
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
•
2501.06186
•
Published
•
66
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Paper
•
2501.08828
•
Published
•
32
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Paper
•
2501.09012
•
Published
•
10
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper
•
2501.09781
•
Published
•
29
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper
•
2501.12380
•
Published
•
86
MSTS: A Multimodal Safety Test Suite for Vision-Language Models
Paper
•
2501.10057
•
Published
•
8
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline
Professional Videos
Paper
•
2501.13826
•
Published
•
26
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents
Paper
•
2501.11858
•
Published
•
7
Redundancy Principles for MLLMs Benchmarks
Paper
•
2501.13953
•
Published
•
29
PhysBench: Benchmarking and Enhancing Vision-Language Models for
Physical World Understanding
Paper
•
2501.16411
•
Published
•
18
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal
Models
Paper
•
2502.00698
•
Published
•
24
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image
Interpretation
Paper
•
2502.08168
•
Published
•
12
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language
Models for Vision-Driven Embodied Agents
Paper
•
2502.09560
•
Published
•
36
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for
Reasoning Quality, Robustness, and Efficiency
Paper
•
2502.09621
•
Published
•
27
mmE5: Improving Multimodal Multilingual Embeddings via High-quality
Synthetic Data
Paper
•
2502.08468
•
Published
•
13
ZeroBench: An Impossible Visual Benchmark for Contemporary Large
Multimodal Models
Paper
•
2502.09696
•
Published
•
43
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
Paper
•
2502.10391
•
Published
•
34
MVL-SIB: A Massively Multilingual Vision-Language Benchmark for
Cross-Modal Topical Matching
Paper
•
2502.12852
•
Published
•
3
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge
Benchmarking
Paper
•
2502.13766
•
Published
•
3
VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit
Matching Visual Cues
Paper
•
2502.12084
•
Published
•
29
KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and
Document Understanding
Paper
•
2502.14949
•
Published
•
7
Evaluating Multimodal Generative AI with Korean Educational Standards
Paper
•
2502.15422
•
Published
•
9
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
•
2502.16033
•
Published
•
17
M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image
Quality Assessment
Paper
•
2502.15167
•
Published
•
2
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Paper
•
2502.18411
•
Published
•
73
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem
Understanding
Paper
•
2502.19400
•
Published
•
48
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long
Video Comprehension
Paper
•
2503.08689
•
Published
•
4
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large
Vision-Language Models in Fact-Seeking Question Answering
Paper
•
2503.06492
•
Published
•
10
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
•
2503.10291
•
Published
•
34
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
•
2503.10615
•
Published
•
16
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based
Scientific Research
Paper
•
2503.13399
•
Published
•
20
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Paper
•
2503.14478
•
Published
•
44
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
for Knowledge-Intensive Visual Grounding
Paper
•
2503.12797
•
Published
•
29
MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process
Errors Identification
Paper
•
2503.12505
•
Published
•
9
PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for
Multimodal Large Language Models
Paper
•
2503.12545
•
Published
•
5
Judge Anything: MLLM as a Judge Across Any Modality
Paper
•
2503.17489
•
Published
•
19
Video SimpleQA: Towards Factuality Evaluation in Large Video Language
Models
Paper
•
2503.18923
•
Published
•
12
Exploring Hallucination of Large Multimodal Models in Video
Understanding: Benchmark, Analysis and Mitigation
Paper
•
2503.19622
•
Published
•
29
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Paper
•
2503.19990
•
Published
•
33
VideoWebArena: Evaluating Long Context Multimodal Agents with Video
Understanding Web Tasks
Paper
•
2410.19100
•
Published
•
6
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Paper
•
2501.11733
•
Published
•
29
ViLBench: A Suite for Vision-Language Process Reward Modeling
Paper
•
2503.20271
•
Published
•
7
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic
Faithfulness
Paper
•
2503.21755
•
Published
•
31
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object
Understanding
Paper
•
2503.17827
•
Published
•
8
UPME: An Unsupervised Peer Review Framework for Multimodal Large
Language Model Evaluation
Paper
•
2503.14941
•
Published
•
6
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large
Vision-Language Models in the Korean Language
Paper
•
2503.23730
•
Published
•
4
Exploring the Effect of Reinforcement Learning on Video Understanding:
Insights from SEED-Bench-R1
Paper
•
2503.24376
•
Published
•
37
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming
Video Contexts
Paper
•
2503.22952
•
Published
•
18
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual
Editing
Paper
•
2504.02826
•
Published
•
67
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image
Generation
Paper
•
2504.02782
•
Published
•
54
MME-Unify: A Comprehensive Benchmark for Unified Multimodal
Understanding and Generation Models
Paper
•
2504.03641
•
Published
•
13
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs
with Controllable Puzzle Generation
Paper
•
2504.00043
•
Published
•
8
VCR-Bench: A Comprehensive Evaluation Framework for Video
Chain-of-Thought Reasoning
Paper
•
2504.07956
•
Published
•
42