STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis Paper • 2506.06276 • Published 23 days ago • 20
Mercury: Ultra-Fast Language Models Based on Diffusion Paper • 2506.17298 • Published 12 days ago • 1
USAD: Universal Speech and Audio Representation via Distillation Paper • 2506.18843 • Published 6 days ago • 10
Guidance in the Frequency Domain Enables High-Fidelity Sampling at Low CFG Scales Paper • 2506.19713 • Published 6 days ago • 12
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders Paper • 2407.13036 • Published Jul 17, 2024 • 3
LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones Paper • 2409.03460 • Published Sep 5, 2024 • 1
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens Paper • 2506.17218 • Published 9 days ago • 19
Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding Paper • 2506.16035 • Published 11 days ago • 82
Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression Paper • 2506.09482 • Published 19 days ago • 46
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning Paper • 2506.09985 • Published 18 days ago • 26
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model Paper • 2506.13642 • Published 14 days ago • 26
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper • 2412.04467 • Published Dec 5, 2024 • 116