Yume-1.5: A Text-Controlled Interactive World Generation Model Paper • 2512.22096 • Published 5 days ago • 53
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer Paper • 2511.22699 • Published Nov 27, 2025 • 217
MeViS Collection MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation • 2 items • Updated Nov 14, 2025
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI Paper • 2510.05684 • Published Oct 7, 2025 • 141
Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents Paper • 2510.23691 • Published Oct 27, 2025 • 53 • 10
MOVE Collection Motion-Guided Few-Shot Video Object Segmentation • 2 items • Updated Sep 28, 2025
OmniAVS Collection Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation • 3 items • Updated Sep 28, 2025