SeerAttention-R: Sparse Attention Adaptation for Long Reasoning Paper • 2506.08889 • Published Jun 10 • 24
VideoDeepResearch: Long Video Understanding With Agentic Tool Using Paper • 2506.10821 • Published Jun 12 • 19
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers Paper • 2506.07986 • Published Jun 9 • 19
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs Paper • 2506.05344 • Published Jun 5 • 16
zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression Paper • 2506.01084 • Published Jun 1 • 7
Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers Paper • 2506.03065 • Published Jun 3 • 27
DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers Paper • 2505.21541 • Published May 24 • 7
Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model Paper • 2505.17561 • Published May 23 • 31
LaViDa: A Large Diffusion Language Model for Multimodal Understanding Paper • 2505.16839 • Published May 22 • 12
Training-Free Efficient Video Generation via Dynamic Token Carving Paper • 2505.16864 • Published May 22 • 22
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning Paper • 2505.16933 • Published May 22 • 33
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design Paper • 2505.16175 • Published May 22 • 42
X-Fusion: Introducing New Modality to Frozen Large Language Models Paper • 2504.20996 • Published Apr 29 • 13
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published May 14 • 96