{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"gpuType": "T4"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"source": [
"## [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://hkchengrex.github.io/MMAudio)\n",
"\n",
"Paper (Soon) | Webpage | Models | Demo\n",
"\n",
"[Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/)\n",
"\n",
"University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation"
],
"metadata": {
"id": "o_cHOIk6fkrC"
}
},
{
"cell_type": "markdown",
"source": [
"MMAudio generates synchronized audio given video and/or text inputs. Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets. Moreover, a synchronization module aligns the generated audio with the video frames."
],
"metadata": {
"id": "MWrQTB4qf7kb"
}
},
{
"cell_type": "markdown",
"source": [
"## Make sure we are using GPU\n",
"\n",
"If not, Runtime -> Change runtime type -> T4\n",
"\n"
],
"metadata": {
"id": "kmoaRe0ff_Jn"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "f3P_MH7IWMlX",
"outputId": "d3a8337e-97ee-4e64-ff5e-1a0e71d76127"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Tue Dec 10 21:37:48 2024 \n",
"+---------------------------------------------------------------------------------------+\n",
"| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |\n",
"|-----------------------------------------+----------------------+----------------------+\n",
"| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
"| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n",
"| | | MIG M. |\n",
"|=========================================+======================+======================|\n",
"| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |\n",
"| N/A 47C P8 11W / 70W | 0MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-----------------------------------------+----------------------+----------------------+\n",
" \n",
"+---------------------------------------------------------------------------------------+\n",
"| Processes: |\n",
"| GPU GI CI PID Type Process name GPU Memory |\n",
"| ID ID Usage |\n",
"|=======================================================================================|\n",
"| No running processes found |\n",
"+---------------------------------------------------------------------------------------+\n",
"Using GPU\n"
]
}
],
"source": [
"!nvidia-smi\n",
"\n",
"import torch\n",
"\n",
"if torch.cuda.is_available():\n",
" print('Using GPU')\n",
" device = 'cuda'\n",
"else:\n",
" print('CUDA not available. Please connect to a GPU instance if possible.')\n",
" device = 'cpu'"
]
},
{
"cell_type": "markdown",
"source": [
"## Install dependencies"
],
"metadata": {
"id": "nq1Ytxi_gJnh"
}
},
{
"cell_type": "code",
"source": [
"!pip install torch torchvision torchaudio\n",
"!git clone https://github.com/hkchengrex/MMAudio.git\n",
"%cd /content/MMAudio\n",
"!pip install -e ."
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "tTiNYTWSgJKB",
"outputId": "c08726cb-485b-4eb6-aa26-6414263d5cfa"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.5.1+cu121)\n",
"Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (0.20.1+cu121)\n",
"Requirement already satisfied: torchaudio in /usr/local/lib/python3.10/dist-packages (2.5.1+cu121)\n",
"Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch) (3.16.1)\n",
"Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch) (4.12.2)\n",
"Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.4.2)\n",
"Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.4)\n",
"Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch) (2024.10.0)\n",
"Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch) (1.13.1)\n",
"Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch) (1.3.0)\n",
"Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torchvision) (1.26.4)\n",
"Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision) (11.0.0)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (2.1.5)\n",
"fatal: destination path 'MMAudio' already exists and is not an empty directory.\n",
"/content/MMAudio\n",
"Obtaining file:///content/MMAudio\n",
" Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
" Checking if build backend supports build_editable ... \u001b[?25l\u001b[?25hdone\n",
" Getting requirements to build editable ... \u001b[?25l\u001b[?25hdone\n",
" Installing backend dependencies ... \u001b[?25l\u001b[?25hdone\n",
" Preparing editable metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
"Requirement already satisfied: auraloss in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.4.0)\n",
"Requirement already satisfied: colorlog in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (6.9.0)\n",
"Requirement already satisfied: cython in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (3.0.11)\n",
"Requirement already satisfied: einops>=0.6 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.8.0)\n",
"Requirement already satisfied: gitpython>=3.1 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (3.1.43)\n",
"Requirement already satisfied: gradio>=3.34 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (5.8.0)\n",
"Requirement already satisfied: hydra-colorlog in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (1.2.0)\n",
"Requirement already satisfied: hydra-core>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (1.3.2)\n",
"Requirement already satisfied: librosa>=0.8.1 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.10.2.post1)\n",
"Requirement already satisfied: nitrous-ema in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.0.1)\n",
"Requirement already satisfied: numpy<2.1,>=1.21 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (1.26.4)\n",
"Requirement already satisfied: open-clip-torch in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (2.29.0)\n",
"Requirement already satisfied: opencv-python>=4.8 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (4.10.0.84)\n",
"Requirement already satisfied: pillow>=9.5 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (11.0.0)\n",
"Requirement already satisfied: python-dotenv in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (1.0.1)\n",
"Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (2.32.3)\n",
"Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.4.5)\n",
"Requirement already satisfied: scipy>=1.7 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (1.13.1)\n",
"Requirement already satisfied: soundfile in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.12.1)\n",
"Requirement already satisfied: tensorboard>=2.11 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (2.17.1)\n",
"Requirement already satisfied: tensordict in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.6.2)\n",
"Requirement already satisfied: torch>=2.5.1 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (2.5.1+cu121)\n",
"Requirement already satisfied: torchdiffeq in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.2.5)\n",
"Requirement already satisfied: tqdm>=4.66.1 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (4.66.6)\n",
"Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.10/dist-packages (from gitpython>=3.1->mmaudio==1.0.0) (4.0.11)\n",
"Requirement already satisfied: aiofiles<24.0,>=22.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (23.2.1)\n",
"Requirement already satisfied: anyio<5.0,>=3.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (3.7.1)\n",
"Requirement already satisfied: fastapi<1.0,>=0.115.2 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.115.6)\n",
"Requirement already satisfied: ffmpy in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.4.0)\n",
"Requirement already satisfied: gradio-client==1.5.1 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (1.5.1)\n",
"Requirement already satisfied: httpx>=0.24.1 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.28.0)\n",
"Requirement already satisfied: huggingface-hub>=0.25.1 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.26.3)\n",
"Requirement already satisfied: jinja2<4.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (3.1.4)\n",
"Requirement already satisfied: markupsafe~=2.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (2.1.5)\n",
"Requirement already satisfied: orjson~=3.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (3.10.12)\n",
"Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (24.2)\n",
"Requirement already satisfied: pandas<3.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (2.2.2)\n",
"Requirement already satisfied: pydantic>=2.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (2.10.3)\n",
"Requirement already satisfied: pydub in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.25.1)\n",
"Requirement already satisfied: python-multipart>=0.0.18 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.0.19)\n",
"Requirement already satisfied: pyyaml<7.0,>=5.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (6.0.2)\n",
"Requirement already satisfied: ruff>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.8.2)\n",
"Requirement already satisfied: safehttpx<0.2.0,>=0.1.6 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.1.6)\n",
"Requirement already satisfied: semantic-version~=2.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (2.10.0)\n",
"Requirement already satisfied: starlette<1.0,>=0.40.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.41.3)\n",
"Requirement already satisfied: tomlkit<0.14.0,>=0.12.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.13.2)\n",
"Requirement already satisfied: typer<1.0,>=0.12 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.15.0)\n",
"Requirement already satisfied: typing-extensions~=4.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (4.12.2)\n",
"Requirement already satisfied: uvicorn>=0.14.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.32.1)\n",
"Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from gradio-client==1.5.1->gradio>=3.34->mmaudio==1.0.0) (2024.10.0)\n",
"Requirement already satisfied: websockets<15.0,>=10.0 in /usr/local/lib/python3.10/dist-packages (from gradio-client==1.5.1->gradio>=3.34->mmaudio==1.0.0) (14.1)\n",
"Requirement already satisfied: omegaconf<2.4,>=2.2 in /usr/local/lib/python3.10/dist-packages (from hydra-core>=1.3.2->mmaudio==1.0.0) (2.3.0)\n",
"Requirement already satisfied: antlr4-python3-runtime==4.9.* in /usr/local/lib/python3.10/dist-packages (from hydra-core>=1.3.2->mmaudio==1.0.0) (4.9.3)\n",
"Requirement already satisfied: audioread>=2.1.9 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (3.0.1)\n",
"Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (1.5.2)\n",
"Requirement already satisfied: joblib>=0.14 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (1.4.2)\n",
"Requirement already satisfied: decorator>=4.3.0 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (4.4.2)\n",
"Requirement already satisfied: numba>=0.51.0 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (0.60.0)\n",
"Requirement already satisfied: pooch>=1.1 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (1.8.2)\n",
"Requirement already satisfied: soxr>=0.3.2 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (0.5.0.post1)\n",
"Requirement already satisfied: lazy-loader>=0.1 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (0.4)\n",
"Requirement already satisfied: msgpack>=1.0 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (1.1.0)\n",
"Requirement already satisfied: cffi>=1.0 in /usr/local/lib/python3.10/dist-packages (from soundfile->mmaudio==1.0.0) (1.17.1)\n",
"Requirement already satisfied: absl-py>=0.4 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (1.4.0)\n",
"Requirement already satisfied: grpcio>=1.48.2 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (1.68.1)\n",
"Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (3.7)\n",
"Requirement already satisfied: protobuf!=4.24.0,>=3.19.6 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (4.25.5)\n",
"Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (75.1.0)\n",
"Requirement already satisfied: six>1.9 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (1.16.0)\n",
"Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (0.7.2)\n",
"Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (3.1.3)\n",
"Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=2.5.1->mmaudio==1.0.0) (3.16.1)\n",
"Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=2.5.1->mmaudio==1.0.0) (3.4.2)\n",
"Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch>=2.5.1->mmaudio==1.0.0) (1.13.1)\n",
"Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch>=2.5.1->mmaudio==1.0.0) (1.3.0)\n",
"Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (from open-clip-torch->mmaudio==1.0.0) (0.20.1+cu121)\n",
"Requirement already satisfied: regex in /usr/local/lib/python3.10/dist-packages (from open-clip-torch->mmaudio==1.0.0) (2024.9.11)\n",
"Requirement already satisfied: ftfy in /usr/local/lib/python3.10/dist-packages (from open-clip-torch->mmaudio==1.0.0) (6.3.1)\n",
"Requirement already satisfied: timm in /usr/local/lib/python3.10/dist-packages (from open-clip-torch->mmaudio==1.0.0) (1.0.12)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->mmaudio==1.0.0) (3.4.0)\n",
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->mmaudio==1.0.0) (3.10)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->mmaudio==1.0.0) (2.2.3)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->mmaudio==1.0.0) (2024.8.30)\n",
"Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from tensordict->mmaudio==1.0.0) (3.1.0)\n",
"Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.10/dist-packages (from anyio<5.0,>=3.0->gradio>=3.34->mmaudio==1.0.0) (1.3.1)\n",
"Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5.0,>=3.0->gradio>=3.34->mmaudio==1.0.0) (1.2.2)\n",
"Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.0->soundfile->mmaudio==1.0.0) (2.22)\n",
"Requirement already satisfied: smmap<6,>=3.0.1 in /usr/local/lib/python3.10/dist-packages (from gitdb<5,>=4.0.1->gitpython>=3.1->mmaudio==1.0.0) (5.0.1)\n",
"Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx>=0.24.1->gradio>=3.34->mmaudio==1.0.0) (1.0.7)\n",
"Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx>=0.24.1->gradio>=3.34->mmaudio==1.0.0) (0.14.0)\n",
"Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba>=0.51.0->librosa>=0.8.1->mmaudio==1.0.0) (0.43.0)\n",
"Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas<3.0,>=1.0->gradio>=3.34->mmaudio==1.0.0) (2.8.2)\n",
"Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas<3.0,>=1.0->gradio>=3.34->mmaudio==1.0.0) (2024.2)\n",
"Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas<3.0,>=1.0->gradio>=3.34->mmaudio==1.0.0) (2024.2)\n",
"Requirement already satisfied: platformdirs>=2.5.0 in /usr/local/lib/python3.10/dist-packages (from pooch>=1.1->librosa>=0.8.1->mmaudio==1.0.0) (4.3.6)\n",
"Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.0->gradio>=3.34->mmaudio==1.0.0) (0.7.0)\n",
"Requirement already satisfied: pydantic-core==2.27.1 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.0->gradio>=3.34->mmaudio==1.0.0) (2.27.1)\n",
"Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->librosa>=0.8.1->mmaudio==1.0.0) (3.5.0)\n",
"Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0,>=0.12->gradio>=3.34->mmaudio==1.0.0) (8.1.7)\n",
"Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0,>=0.12->gradio>=3.34->mmaudio==1.0.0) (1.5.4)\n",
"Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0,>=0.12->gradio>=3.34->mmaudio==1.0.0) (13.9.4)\n",
"Requirement already satisfied: wcwidth in /usr/local/lib/python3.10/dist-packages (from ftfy->open-clip-torch->mmaudio==1.0.0) (0.2.13)\n",
"Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer<1.0,>=0.12->gradio>=3.34->mmaudio==1.0.0) (3.0.0)\n",
"Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer<1.0,>=0.12->gradio>=3.34->mmaudio==1.0.0) (2.18.0)\n",
"Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0,>=0.12->gradio>=3.34->mmaudio==1.0.0) (0.1.2)\n",
"Building wheels for collected packages: mmaudio\n",
" Building editable for mmaudio (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for mmaudio: filename=mmaudio-1.0.0-py3-none-any.whl size=4529 sha256=18ccab9f9e09fe24644a2d7c9c769e099a220964790e659fbd07d1b90f569884\n",
" Stored in directory: /tmp/pip-ephem-wheel-cache-1nou9kfy/wheels/5f/51/e8/a8b3bd781dd4b9595fbcf3d3841b547397a4bff996716d464c\n",
"Successfully built mmaudio\n",
"Installing collected packages: mmaudio\n",
" Attempting uninstall: mmaudio\n",
" Found existing installation: mmaudio 1.0.0\n",
" Uninstalling mmaudio-1.0.0:\n",
" Successfully uninstalled mmaudio-1.0.0\n",
"Successfully installed mmaudio-1.0.0\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## Load some data"
],
"metadata": {
"id": "NmJDTxMGhQdi"
}
},
{
"cell_type": "code",
"source": [
"%cd /content/MMAudio\n",
"!curl https://i.imgur.com/8xHJTzI.mp4 -o video.mp4\n",
"\n",
"from IPython.display import HTML\n",
"from base64 import b64encode\n",
"data_url = \"data:video/mp4;base64,\" + b64encode(open('video.mp4', 'rb').read()).decode()\n",
"HTML(\"\"\"\n",
"\n",
"\"\"\" % data_url)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 317
},
"id": "3SYoUcUghlal",
"outputId": "bd8df759-60e1-40f9-f1bd-c51ad8aee2c4"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"/content/MMAudio\n",
" % Total % Received % Xferd Average Speed Time Time Time Current\n",
" Dload Upload Total Spent Left Speed\n",
"100 832k 100 832k 0 0 4246k 0 --:--:-- --:--:-- --:--:-- 4248k\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
""
],
"text/html": [
"\n",
"\n"
]
},
"metadata": {},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"source": [
"## Run the model (models will be downloaded automatically)"
],
"metadata": {
"id": "v_BM2f7niHm1"
}
},
{
"cell_type": "code",
"source": [
"!python demo.py --duration=10 --video=video.mp4 --prompt \"waves and seagulls\"\n",
"\n",
"\n",
"from IPython.display import HTML\n",
"from base64 import b64encode\n",
"data_url = \"data:video/mp4;base64,\" + b64encode(open('./output/video.mp4', 'rb').read()).decode()\n",
"HTML(\"\"\"\n",
"\n",
"\"\"\" % data_url)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 710
},
"id": "xnhmnCkJiiAU",
"outputId": "22a6c5a7-4687-40e8-8f23-37ad3e013a24"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" \u001b[32mINFO \u001b[0m | \u001b[32mDownloading mmaudio_large_44k_v2.pth to weights/mmaudio_large_44k_v2.pth...\u001b[0m\n",
"100% 4.12G/4.12G [03:35<00:00, 19.1MiB/s]\n",
" \u001b[32mINFO \u001b[0m | \u001b[32mDownloading v1-44.pth to ext_weights/v1-44.pth...\u001b[0m\n",
"100% 1.22G/1.22G [00:27<00:00, 43.7MiB/s]\n",
" \u001b[32mINFO \u001b[0m | \u001b[32mDownloading synchformer_state_dict.pth to ext_weights/synchformer_state_dict.pth...\u001b[0m\n",
"100% 950M/950M [00:12<00:00, 74.4MiB/s]\n",
" \u001b[32mINFO \u001b[0m | \u001b[32mLoaded weights from weights/mmaudio_large_44k_v2.pth\u001b[0m\n",
"open_clip_pytorch_model.bin: 100% 3.95G/3.95G [01:33<00:00, 42.0MB/s]\n",
"open_clip_config.json: 100% 735/735 [00:00<00:00, 6.15MB/s]\n",
" \u001b[32mINFO \u001b[0m | \u001b[32mLoaded hf-hub:apple/DFN5B-CLIP-ViT-H-14-384 model config.\u001b[0m\n",
" \u001b[32mINFO \u001b[0m | \u001b[32mLoading pretrained hf-hub:apple/DFN5B-CLIP-ViT-H-14-384 weights (/root/.cache/huggingface/hub/models--apple--DFN5B-CLIP-ViT-H-14-384/snapshots/f17177bb05c69c6336d0adbfc97e06d69f876904/open_clip_pytorch_model.bin).\u001b[0m\n",
" \u001b[32mINFO \u001b[0m | \u001b[32mLoading MotionFormer config from /content/MMAudio/mmaudio/ext/synchformer/divided_224_16x4.yaml\u001b[0m\n",
"config.json: 100% 1.40k/1.40k [00:00<00:00, 11.0MB/s]\n",
"Loading weights from nvidia/bigvgan_v2_44khz_128band_512x\n",
"bigvgan_generator.pt: 100% 489M/489M [00:11<00:00, 42.2MB/s]\n",
"Removing weight norm...\n",
" \u001b[32mINFO \u001b[0m | \u001b[32mUsing video video.mp4\u001b[0m\n",
" \u001b[33mWARNING \u001b[0m | \u001b[33mClip video is too short: 5.00 < 10.00\u001b[0m\n",
" \u001b[33mWARNING \u001b[0m | \u001b[33mTruncating to 5.00 sec\u001b[0m\n",
" \u001b[33mWARNING \u001b[0m | \u001b[33mSync video is too short: 4.96 < 5.00\u001b[0m\n",
" \u001b[33mWARNING \u001b[0m | \u001b[33mTruncating to 4.96 sec\u001b[0m\n",
" \u001b[32mINFO \u001b[0m | \u001b[32mPrompt: waves and seagulls\u001b[0m\n",
" \u001b[32mINFO \u001b[0m | \u001b[32mNegative prompt: \u001b[0m\n",
" \u001b[32mINFO \u001b[0m | \u001b[32mAudio saved to output/video.flac\u001b[0m\n",
" \u001b[32mINFO \u001b[0m | \u001b[32mVideo saved to output/output/video.mp4\u001b[0m\n",
" \u001b[32mINFO \u001b[0m | \u001b[32mMemory usage: 5.84 GB\u001b[0m\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
""
],
"text/html": [
"\n",
"\n"
]
},
"metadata": {},
"execution_count": 6
}
]
}
]
}