Bringing MLLMs into Embodied World
Demo for multimodal understanding and generation
VideoRefer x VideoLLaMA3