Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis Paper • 2505.10046 • Published 15 days ago • 9
PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop Paper • 2503.09595 • Published Mar 12
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published 15 days ago • 85
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Paper • 2412.14171 • Published Dec 18, 2024 • 24
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published Jun 24, 2024 • 61
Image Sculpting: Precise Object Editing with 3D Geometry Control Paper • 2401.01702 • Published Jan 2, 2024 • 21
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition Paper • 2203.07996 • Published Feb 24, 2022
Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models Paper • 2211.10950 • Published Nov 20, 2022
Kosmos-G: Generating Images in Context with Multimodal Large Language Models Paper • 2310.02992 • Published Oct 4, 2023 • 4