Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation Paper • 2506.21876 • Published Jun 27 • 28
Can Vision Language Models Infer Human Gaze Direction? A Controlled Study Paper • 2506.05412 • Published Jun 4 • 4
4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time Paper • 2506.18890 • Published Jun 23 • 6
Learning Video Representations without Natural Videos Paper • 2410.24213 • Published Oct 31, 2024 • 16
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens Paper • 2506.17218 • Published Jun 20 • 27
GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation Paper • 2504.07962 • Published Apr 10
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs Paper • 2506.10128 • Published Jun 11 • 23
Frame In-N-Out: Unbounded Controllable Image-to-Video Generation Paper • 2505.21491 • Published May 27 • 17
VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation Paper • 2503.14350 • Published Mar 18
Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation Paper • 2504.16060 • Published Apr 22
DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences Paper • 2406.03008 • Published Jun 5, 2024
Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models Paper • 2407.07035 • Published Jul 9, 2024