Submitted by minghaowu 39 The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks · 10 authors 1
Submitted by longlian 24 Describe Anything: Detailed Localized Image and Video Captioning · 11 authors 2
Submitted by zhangysk 14 IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs · 20 authors 1
Submitted by yueyang2000 11 CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning · 6 authors 1
Submitted by Neph0s 10 BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation · 6 authors 1
Submitted by Zilence006 6 Vidi: Large Multimodal Models for Video Understanding and Editing · 22 authors 1
Submitted by chenjoya 5 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale · 6 authors 1
Submitted by sayakpaul 4 From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning · 9 authors 1
Submitted by thomasschmied 4 LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities · 5 authors 1
Submitted by zhoutianyi 3 WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents · 7 authors 2
Submitted by ziqipang 1 MR. Video: "MapReduce" is the Principle for Long Video Understanding · 2 authors 1
Submitted by QiYao-Wang 1 IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property · 23 authors 1
Submitted by theFoxofSky 1 RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild · 8 authors 1