Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations Paper • 2506.18898 • Published 3 days ago • 23
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs Paper • 2505.21327 • Published 30 days ago • 83
Unleashing Vecset Diffusion Model for Fast Shape Generation Paper • 2503.16302 • Published Mar 20 • 44
Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model Paper • 2502.16779 • Published Feb 24 • 3
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation Paper • 2502.16707 • Published Feb 23 • 13
Chimera: Improving Generalist Model with Domain-Specific Experts Paper • 2412.05983 • Published Dec 8, 2024 • 9
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? Paper • 2412.02611 • Published Dec 3, 2024 • 24
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant Paper • 2410.13360 • Published Oct 17, 2024 • 9
Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations Paper • 2410.08049 • Published Oct 10, 2024 • 8