Abstract
AutoMV, a multi-agent system, generates coherent full-length music videos directly from songs, outperforming existing methods and narrowing the gap to professional videos.
Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for "story" or "singer" scenes. A Verifier Agent evaluates their output, enabling multi-agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high-level categories (Music Content, Technical, Post-production, Art) and twelve ine-grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.
Community
arxiv: https://arxiv.org/abs/2512.12196v1
GitHub: https://github.com/multimodal-art-projection/AutoMV
Website: https://m-a-p.ai/AutoMV/
Apache-2.0 license
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- YingVideo-MV: Music-Driven Multi-Stage Video Generation (2025)
- Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation (2025)
- Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model (2025)
- FoleyBench: A Benchmark For Video-to-Audio Models (2025)
- VABench: A Comprehensive Benchmark for Audio-Video Generation (2025)
- UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist (2025)
- SegTune: Structured and Fine-Grained Control for Song Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper