## [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://hkchengrex.github.io/MMAudio)

Paper (Soon) | Webpage | Models | Demo

[Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/)

University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation

MMAudio generates synchronized audio given video and/or text inputs. Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets. Moreover, a synchronization module aligns the generated audio with the video frames.

## Make sure we are using GPU

If not, Runtime -> Change runtime type -> T4



In [None]:
!nvidia-smi

import torch

if torch.cuda.is_available():
 print('Using GPU')
 device = 'cuda'
else:
 print('CUDA not available. Please connect to a GPU instance if possible.')
 device = 'cpu'

Tue Dec 10 21:37:48 2024 
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 47C P8 11W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
 
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| No running processes found |
+---------------------------------------------------------------------------------------+
Using GPU


## Install dependencies

In [None]:
!pip install torch torchvision torchaudio
!git clone https://github.com/hkchengrex/MMAudio.git
%cd /content/MMAudio
!pip install -e .

fatal: destination path 'MMAudio' already exists and is not an empty directory.
/content/MMAudio
Obtaining file:///content/MMAudio
 Installing build dependencies ... [?25l[?25hdone
 Checking if build backend supports build_editable ... [?25l[?25hdone
 Getting requirements to build editable ... [?25l[?25hdone
 Installing backend dependencies ... [?25l[?25hdone
 Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: mmaudio
 Building editable for mmaudio (pyproject.toml) ... [?25l[?25hdone
 Created wheel for mmaudio: filename=mmaudio-1.0.0-py3-none-any.whl size=4529 sha256=18ccab9f9e09fe24644a2d7c9c769e099a220964790e659fbd07d1b90f569884
 Stored in directory: /tmp/pip-ephem-wheel-cache-1nou9kfy/wheels/5f/51/e8/a8b3bd781dd4b9595fbcf3d3841b547397a4bff996716d464c
Successfully built mmaudio
Installing collected packages: mmaudio
 Attempting uninstall: mmaudio
 Found existing installation: mmaudio 1.0.0
 Uninstalling mmaudio-1.0.0:
 

## Load some data

In [None]:
%cd /content/MMAudio
!curl https://i.imgur.com/8xHJTzI.mp4 -o video.mp4

from IPython.display import HTML
from base64 import b64encode
data_url = "data:video/mp4;base64," + b64encode(open('video.mp4', 'rb').read()).decode()
HTML("""

""" % data_url)

/content/MMAudio
 % Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed
100 832k 100 832k 0 0 4246k 0 --:--:-- --:--:-- --:--:-- 4248k


## Run the model (models will be downloaded automatically)

In [None]:
!python demo.py --duration=10 --video=video.mp4 --prompt "waves and seagulls"


from IPython.display import HTML
from base64 import b64encode
data_url = "data:video/mp4;base64," + b64encode(open('./output/video.mp4', 'rb').read()).decode()
HTML("""

""" % data_url)

 [32mINFO [0m | [32mDownloading mmaudio_large_44k_v2.pth to weights/mmaudio_large_44k_v2.pth...[0m
100% 4.12G/4.12G [03:35<00:00, 19.1MiB/s]
 [32mINFO [0m | [32mDownloading v1-44.pth to ext_weights/v1-44.pth...[0m
100% 1.22G/1.22G [00:27<00:00, 43.7MiB/s]
 [32mINFO [0m | [32mDownloading synchformer_state_dict.pth to ext_weights/synchformer_state_dict.pth...[0m
100% 950M/950M [00:12<00:00, 74.4MiB/s]
 [32mINFO [0m | [32mLoaded weights from weights/mmaudio_large_44k_v2.pth[0m
open_clip_pytorch_model.bin: 100% 3.95G/3.95G [01:33<00:00, 42.0MB/s]
open_clip_config.json: 100% 735/735 [00:00<00:00, 6.15MB/s]
 [32mINFO [0m | [32mLoaded hf-hub:apple/DFN5B-CLIP-ViT-H-14-384 model config.[0m
 [32mINFO [0m | [32mLoading pretrained hf-hub:apple/DFN5B-CLIP-ViT-H-14-384 weights (/root/.cache/huggingface/hub/models--apple--DFN5B-CLIP-ViT-H-14-384/snapshots/f17177bb05c69c6336d0adbfc97e06d69f876904/open_clip_pytorch_model.bin).[0m
 [32mINFO [0m | [32mLoading MotionFormer confi