PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling Paper • 2506.20936 • Published 19 days ago • 11
view article Article Introducing Cosmos Predict-2: A Foundation For Your Own World Model By nvidia and 2 others • 28 days ago • 8
CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech Paper • 2506.02863 • Published Jun 3 • 8
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published May 14 • 95
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Paper • 2505.02707 • Published May 5 • 84
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis Paper • 2505.02625 • Published May 5 • 22
Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction Paper • 2505.02471 • Published May 5 • 12
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer Paper • 2504.20690 • Published Apr 29 • 20
Compass Control: Multi Object Orientation Control for Text-to-Image Generation Paper • 2504.06752 • Published Apr 9 • 10
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning Paper • 2504.07960 • Published Apr 10 • 49
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation Paper • 2503.09641 • Published Mar 12 • 40
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k Paper • 2503.09642 • Published Mar 12 • 18
A Multimodal Symphony: Integrating Taste and Sound through Generative AI Paper • 2503.02823 • Published Mar 4 • 3
DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion Paper • 2503.01183 • Published Mar 3 • 28
LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation Paper • 2502.20583 • Published Feb 27 • 13