AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
Abstract
We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation. Project Page: snap-research.github.io/AVLink/
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation (2024)
- Tell What You Hear From What You See -- Video to Audio Generation Through Text (2024)
- Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation (2024)
- MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation (2024)
- YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls (2024)
- Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration (2024)
- LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper