Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis Paper • 2412.15322 • Published Dec 19, 2024 • 18
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Paper • 2505.02707 • Published 12 days ago • 79
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis Paper • 2505.02625 • Published 12 days ago • 20
Fast Text-to-Audio Generation with Adversarial Post-Training Paper • 2505.08175 • Published 4 days ago • 15
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder Paper • 2505.07916 • Published 5 days ago • 108