StreamDiT: Real-Time Streaming Text-to-Video Generation Paper • 2507.03745 • Published 5 days ago • 20
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks Paper • 2507.01955 • Published 7 days ago • 27
Energy-Based Transformers are Scalable Learners and Thinkers Paper • 2507.02092 • Published 6 days ago • 41
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective Paper • 2507.01925 • Published 7 days ago • 29
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers Paper • 2506.23918 • Published 9 days ago • 74
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings Paper • 2506.23115 • Published 10 days ago • 36
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Paper • 2507.01006 • Published 8 days ago • 174
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs Paper • 2506.21656 • Published 13 days ago • 13
BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing Paper • 2506.17450 • Published 18 days ago • 60