SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement Paper • 2504.07934 • Published Apr 10 • 19
BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation Paper • 2503.20672 • Published Mar 26 • 14
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities Paper • 2408.00765 • Published Aug 1, 2024 • 14
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos Paper • 2406.08407 • Published Jun 12, 2024 • 29
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs Paper • 2404.16375 • Published Apr 25, 2024 • 18
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation Paper • 2311.07562 • Published Nov 13, 2023 • 15
MM-VID: Advancing Video Understanding with GPT-4V(ision) Paper • 2310.19773 • Published Oct 30, 2023 • 20
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design Paper • 2310.15144 • Published Oct 23, 2023 • 14
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation Paper • 2310.07749 • Published Oct 11, 2023 • 5
Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation Paper • 2310.08541 • Published Oct 12, 2023 • 18
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities Paper • 2308.02490 • Published Aug 4, 2023 • 17
Aligning Large Multi-Modal Model with Robust Instruction Tuning Paper • 2306.14565 • Published Jun 26, 2023 • 6
DisCo: Disentangled Control for Referring Human Dance Generation in Real World Paper • 2307.00040 • Published Jun 30, 2023 • 25
GIT: A Generative Image-to-text Transformer for Vision and Language Paper • 2205.14100 • Published May 27, 2022 • 1