StreamDiT: Real-Time Streaming Text-to-Video Generation Paper • 2507.03745 • Published 4 days ago • 15
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks Paper • 2507.01955 • Published 6 days ago • 25
Energy-Based Transformers are Scalable Learners and Thinkers Paper • 2507.02092 • Published 6 days ago • 41
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective Paper • 2507.01925 • Published 6 days ago • 29
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers Paper • 2506.23918 • Published 9 days ago • 73
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings Paper • 2506.23115 • Published 10 days ago • 36
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Paper • 2507.01006 • Published 7 days ago • 174
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs Paper • 2506.21656 • Published 12 days ago • 13
BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing Paper • 2506.17450 • Published 18 days ago • 60
Guidance in the Frequency Domain Enables High-Fidelity Sampling at Low CFG Scales Paper • 2506.19713 • Published 15 days ago • 13
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning Paper • 2506.16141 • Published 20 days ago • 27
Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition Paper • 2506.17201 • Published 18 days ago • 52
DreamCube: 3D Panorama Generation via Multi-plane Synchronization Paper • 2506.17206 • Published 18 days ago • 21
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning Paper • 2506.09049 • Published 28 days ago • 34