Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs Paper • 2504.15280 • Published 1 day ago • 13
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models Paper • 2503.06749 • Published Mar 9 • 28
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Paper • 2311.06242 • Published Nov 10, 2023 • 93
view article Article Memory-efficient Diffusion Transformers with Quanto and Diffusers Jul 30, 2024 • 66
InternVL2.0 Collection Expanding Performance Boundaries of Open-Source MLLM • 15 items • Updated 3 days ago • 91
λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space Paper • 2402.05195 • Published Feb 7, 2024 • 19
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering Paper • 2403.09622 • Published Mar 14, 2024 • 18
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis Paper • 2310.00426 • Published Sep 30, 2023 • 60
Sora Reference Papers Collection A collection of all papers referenced in OpenAI's "Video generation models as world simulators" technical report • openai.com/sora • 30 items • Updated Oct 3, 2024 • 52