{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "gpuType": "T4" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "source": [ "## [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://hkchengrex.github.io/MMAudio)\n", "\n", "Paper (Soon) | Webpage | Models | Demo\n", "\n", "[Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/)\n", "\n", "University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation" ], "metadata": { "id": "o_cHOIk6fkrC" } }, { "cell_type": "markdown", "source": [ "MMAudio generates synchronized audio given video and/or text inputs. Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets. Moreover, a synchronization module aligns the generated audio with the video frames." ], "metadata": { "id": "MWrQTB4qf7kb" } }, { "cell_type": "markdown", "source": [ "## Make sure we are using GPU\n", "\n", "If not, Runtime -> Change runtime type -> T4\n", "\n" ], "metadata": { "id": "kmoaRe0ff_Jn" } }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "f3P_MH7IWMlX", "outputId": "d3a8337e-97ee-4e64-ff5e-1a0e71d76127" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Tue Dec 10 21:37:48 2024 \n", "+---------------------------------------------------------------------------------------+\n", "| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |\n", "|-----------------------------------------+----------------------+----------------------+\n", "| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n", "| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n", "| | | MIG M. |\n", "|=========================================+======================+======================|\n", "| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |\n", "| N/A 47C P8 11W / 70W | 0MiB / 15360MiB | 0% Default |\n", "| | | N/A |\n", "+-----------------------------------------+----------------------+----------------------+\n", " \n", "+---------------------------------------------------------------------------------------+\n", "| Processes: |\n", "| GPU GI CI PID Type Process name GPU Memory |\n", "| ID ID Usage |\n", "|=======================================================================================|\n", "| No running processes found |\n", "+---------------------------------------------------------------------------------------+\n", "Using GPU\n" ] } ], "source": [ "!nvidia-smi\n", "\n", "import torch\n", "\n", "if torch.cuda.is_available():\n", " print('Using GPU')\n", " device = 'cuda'\n", "else:\n", " print('CUDA not available. Please connect to a GPU instance if possible.')\n", " device = 'cpu'" ] }, { "cell_type": "markdown", "source": [ "## Install dependencies" ], "metadata": { "id": "nq1Ytxi_gJnh" } }, { "cell_type": "code", "source": [ "!pip install torch torchvision torchaudio\n", "!git clone https://github.com/hkchengrex/MMAudio.git\n", "%cd /content/MMAudio\n", "!pip install -e ." ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tTiNYTWSgJKB", "outputId": "c08726cb-485b-4eb6-aa26-6414263d5cfa" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.5.1+cu121)\n", "Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (0.20.1+cu121)\n", "Requirement already satisfied: torchaudio in /usr/local/lib/python3.10/dist-packages (2.5.1+cu121)\n", "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch) (3.16.1)\n", "Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch) (4.12.2)\n", "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.4.2)\n", "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.4)\n", "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch) (2024.10.0)\n", "Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch) (1.13.1)\n", "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch) (1.3.0)\n", "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torchvision) (1.26.4)\n", "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision) (11.0.0)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (2.1.5)\n", "fatal: destination path 'MMAudio' already exists and is not an empty directory.\n", "/content/MMAudio\n", "Obtaining file:///content/MMAudio\n", " Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n", " Checking if build backend supports build_editable ... \u001b[?25l\u001b[?25hdone\n", " Getting requirements to build editable ... \u001b[?25l\u001b[?25hdone\n", " Installing backend dependencies ... \u001b[?25l\u001b[?25hdone\n", " Preparing editable metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", "Requirement already satisfied: auraloss in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.4.0)\n", "Requirement already satisfied: colorlog in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (6.9.0)\n", "Requirement already satisfied: cython in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (3.0.11)\n", "Requirement already satisfied: einops>=0.6 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.8.0)\n", "Requirement already satisfied: gitpython>=3.1 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (3.1.43)\n", "Requirement already satisfied: gradio>=3.34 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (5.8.0)\n", "Requirement already satisfied: hydra-colorlog in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (1.2.0)\n", "Requirement already satisfied: hydra-core>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (1.3.2)\n", "Requirement already satisfied: librosa>=0.8.1 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.10.2.post1)\n", "Requirement already satisfied: nitrous-ema in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.0.1)\n", "Requirement already satisfied: numpy<2.1,>=1.21 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (1.26.4)\n", "Requirement already satisfied: open-clip-torch in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (2.29.0)\n", "Requirement already satisfied: opencv-python>=4.8 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (4.10.0.84)\n", "Requirement already satisfied: pillow>=9.5 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (11.0.0)\n", "Requirement already satisfied: python-dotenv in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (1.0.1)\n", "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (2.32.3)\n", "Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.4.5)\n", "Requirement already satisfied: scipy>=1.7 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (1.13.1)\n", "Requirement already satisfied: soundfile in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.12.1)\n", "Requirement already satisfied: tensorboard>=2.11 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (2.17.1)\n", "Requirement already satisfied: tensordict in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.6.2)\n", "Requirement already satisfied: torch>=2.5.1 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (2.5.1+cu121)\n", "Requirement already satisfied: torchdiffeq in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (0.2.5)\n", "Requirement already satisfied: tqdm>=4.66.1 in /usr/local/lib/python3.10/dist-packages (from mmaudio==1.0.0) (4.66.6)\n", "Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.10/dist-packages (from gitpython>=3.1->mmaudio==1.0.0) (4.0.11)\n", "Requirement already satisfied: aiofiles<24.0,>=22.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (23.2.1)\n", "Requirement already satisfied: anyio<5.0,>=3.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (3.7.1)\n", "Requirement already satisfied: fastapi<1.0,>=0.115.2 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.115.6)\n", "Requirement already satisfied: ffmpy in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.4.0)\n", "Requirement already satisfied: gradio-client==1.5.1 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (1.5.1)\n", "Requirement already satisfied: httpx>=0.24.1 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.28.0)\n", "Requirement already satisfied: huggingface-hub>=0.25.1 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.26.3)\n", "Requirement already satisfied: jinja2<4.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (3.1.4)\n", "Requirement already satisfied: markupsafe~=2.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (2.1.5)\n", "Requirement already satisfied: orjson~=3.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (3.10.12)\n", "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (24.2)\n", "Requirement already satisfied: pandas<3.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (2.2.2)\n", "Requirement already satisfied: pydantic>=2.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (2.10.3)\n", "Requirement already satisfied: pydub in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.25.1)\n", "Requirement already satisfied: python-multipart>=0.0.18 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.0.19)\n", "Requirement already satisfied: pyyaml<7.0,>=5.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (6.0.2)\n", "Requirement already satisfied: ruff>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.8.2)\n", "Requirement already satisfied: safehttpx<0.2.0,>=0.1.6 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.1.6)\n", "Requirement already satisfied: semantic-version~=2.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (2.10.0)\n", "Requirement already satisfied: starlette<1.0,>=0.40.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.41.3)\n", "Requirement already satisfied: tomlkit<0.14.0,>=0.12.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.13.2)\n", "Requirement already satisfied: typer<1.0,>=0.12 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.15.0)\n", "Requirement already satisfied: typing-extensions~=4.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (4.12.2)\n", "Requirement already satisfied: uvicorn>=0.14.0 in /usr/local/lib/python3.10/dist-packages (from gradio>=3.34->mmaudio==1.0.0) (0.32.1)\n", "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from gradio-client==1.5.1->gradio>=3.34->mmaudio==1.0.0) (2024.10.0)\n", "Requirement already satisfied: websockets<15.0,>=10.0 in /usr/local/lib/python3.10/dist-packages (from gradio-client==1.5.1->gradio>=3.34->mmaudio==1.0.0) (14.1)\n", "Requirement already satisfied: omegaconf<2.4,>=2.2 in /usr/local/lib/python3.10/dist-packages (from hydra-core>=1.3.2->mmaudio==1.0.0) (2.3.0)\n", "Requirement already satisfied: antlr4-python3-runtime==4.9.* in /usr/local/lib/python3.10/dist-packages (from hydra-core>=1.3.2->mmaudio==1.0.0) (4.9.3)\n", "Requirement already satisfied: audioread>=2.1.9 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (3.0.1)\n", "Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (1.5.2)\n", "Requirement already satisfied: joblib>=0.14 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (1.4.2)\n", "Requirement already satisfied: decorator>=4.3.0 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (4.4.2)\n", "Requirement already satisfied: numba>=0.51.0 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (0.60.0)\n", "Requirement already satisfied: pooch>=1.1 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (1.8.2)\n", "Requirement already satisfied: soxr>=0.3.2 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (0.5.0.post1)\n", "Requirement already satisfied: lazy-loader>=0.1 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (0.4)\n", "Requirement already satisfied: msgpack>=1.0 in /usr/local/lib/python3.10/dist-packages (from librosa>=0.8.1->mmaudio==1.0.0) (1.1.0)\n", "Requirement already satisfied: cffi>=1.0 in /usr/local/lib/python3.10/dist-packages (from soundfile->mmaudio==1.0.0) (1.17.1)\n", "Requirement already satisfied: absl-py>=0.4 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (1.4.0)\n", "Requirement already satisfied: grpcio>=1.48.2 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (1.68.1)\n", "Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (3.7)\n", "Requirement already satisfied: protobuf!=4.24.0,>=3.19.6 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (4.25.5)\n", "Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (75.1.0)\n", "Requirement already satisfied: six>1.9 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (1.16.0)\n", "Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (0.7.2)\n", "Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard>=2.11->mmaudio==1.0.0) (3.1.3)\n", "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=2.5.1->mmaudio==1.0.0) (3.16.1)\n", "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=2.5.1->mmaudio==1.0.0) (3.4.2)\n", "Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch>=2.5.1->mmaudio==1.0.0) (1.13.1)\n", "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch>=2.5.1->mmaudio==1.0.0) (1.3.0)\n", "Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (from open-clip-torch->mmaudio==1.0.0) (0.20.1+cu121)\n", "Requirement already satisfied: regex in /usr/local/lib/python3.10/dist-packages (from open-clip-torch->mmaudio==1.0.0) (2024.9.11)\n", "Requirement already satisfied: ftfy in /usr/local/lib/python3.10/dist-packages (from open-clip-torch->mmaudio==1.0.0) (6.3.1)\n", "Requirement already satisfied: timm in /usr/local/lib/python3.10/dist-packages (from open-clip-torch->mmaudio==1.0.0) (1.0.12)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->mmaudio==1.0.0) (3.4.0)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->mmaudio==1.0.0) (3.10)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->mmaudio==1.0.0) (2.2.3)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->mmaudio==1.0.0) (2024.8.30)\n", "Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from tensordict->mmaudio==1.0.0) (3.1.0)\n", "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.10/dist-packages (from anyio<5.0,>=3.0->gradio>=3.34->mmaudio==1.0.0) (1.3.1)\n", "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5.0,>=3.0->gradio>=3.34->mmaudio==1.0.0) (1.2.2)\n", "Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.0->soundfile->mmaudio==1.0.0) (2.22)\n", "Requirement already satisfied: smmap<6,>=3.0.1 in /usr/local/lib/python3.10/dist-packages (from gitdb<5,>=4.0.1->gitpython>=3.1->mmaudio==1.0.0) (5.0.1)\n", "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx>=0.24.1->gradio>=3.34->mmaudio==1.0.0) (1.0.7)\n", "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx>=0.24.1->gradio>=3.34->mmaudio==1.0.0) (0.14.0)\n", "Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba>=0.51.0->librosa>=0.8.1->mmaudio==1.0.0) (0.43.0)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas<3.0,>=1.0->gradio>=3.34->mmaudio==1.0.0) (2.8.2)\n", "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas<3.0,>=1.0->gradio>=3.34->mmaudio==1.0.0) (2024.2)\n", "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas<3.0,>=1.0->gradio>=3.34->mmaudio==1.0.0) (2024.2)\n", "Requirement already satisfied: platformdirs>=2.5.0 in /usr/local/lib/python3.10/dist-packages (from pooch>=1.1->librosa>=0.8.1->mmaudio==1.0.0) (4.3.6)\n", "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.0->gradio>=3.34->mmaudio==1.0.0) (0.7.0)\n", "Requirement already satisfied: pydantic-core==2.27.1 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.0->gradio>=3.34->mmaudio==1.0.0) (2.27.1)\n", "Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->librosa>=0.8.1->mmaudio==1.0.0) (3.5.0)\n", "Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0,>=0.12->gradio>=3.34->mmaudio==1.0.0) (8.1.7)\n", "Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0,>=0.12->gradio>=3.34->mmaudio==1.0.0) (1.5.4)\n", "Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.10/dist-packages (from typer<1.0,>=0.12->gradio>=3.34->mmaudio==1.0.0) (13.9.4)\n", "Requirement already satisfied: wcwidth in /usr/local/lib/python3.10/dist-packages (from ftfy->open-clip-torch->mmaudio==1.0.0) (0.2.13)\n", "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer<1.0,>=0.12->gradio>=3.34->mmaudio==1.0.0) (3.0.0)\n", "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer<1.0,>=0.12->gradio>=3.34->mmaudio==1.0.0) (2.18.0)\n", "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0,>=0.12->gradio>=3.34->mmaudio==1.0.0) (0.1.2)\n", "Building wheels for collected packages: mmaudio\n", " Building editable for mmaudio (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", " Created wheel for mmaudio: filename=mmaudio-1.0.0-py3-none-any.whl size=4529 sha256=18ccab9f9e09fe24644a2d7c9c769e099a220964790e659fbd07d1b90f569884\n", " Stored in directory: /tmp/pip-ephem-wheel-cache-1nou9kfy/wheels/5f/51/e8/a8b3bd781dd4b9595fbcf3d3841b547397a4bff996716d464c\n", "Successfully built mmaudio\n", "Installing collected packages: mmaudio\n", " Attempting uninstall: mmaudio\n", " Found existing installation: mmaudio 1.0.0\n", " Uninstalling mmaudio-1.0.0:\n", " Successfully uninstalled mmaudio-1.0.0\n", "Successfully installed mmaudio-1.0.0\n" ] } ] }, { "cell_type": "markdown", "source": [ "## Load some data" ], "metadata": { "id": "NmJDTxMGhQdi" } }, { "cell_type": "code", "source": [ "%cd /content/MMAudio\n", "!curl https://i.imgur.com/8xHJTzI.mp4 -o video.mp4\n", "\n", "from IPython.display import HTML\n", "from base64 import b64encode\n", "data_url = \"data:video/mp4;base64,\" + b64encode(open('video.mp4', 'rb').read()).decode()\n", "HTML(\"\"\"\n", "\n", "\"\"\" % data_url)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 317 }, "id": "3SYoUcUghlal", "outputId": "bd8df759-60e1-40f9-f1bd-c51ad8aee2c4" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "/content/MMAudio\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 832k 100 832k 0 0 4246k 0 --:--:-- --:--:-- --:--:-- 4248k\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "" ], "text/html": [ "\n", "\n" ] }, "metadata": {}, "execution_count": 5 } ] }, { "cell_type": "markdown", "source": [ "## Run the model (models will be downloaded automatically)" ], "metadata": { "id": "v_BM2f7niHm1" } }, { "cell_type": "code", "source": [ "!python demo.py --duration=10 --video=video.mp4 --prompt \"waves and seagulls\"\n", "\n", "\n", "from IPython.display import HTML\n", "from base64 import b64encode\n", "data_url = \"data:video/mp4;base64,\" + b64encode(open('./output/video.mp4', 'rb').read()).decode()\n", "HTML(\"\"\"\n", "\n", "\"\"\" % data_url)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 710 }, "id": "xnhmnCkJiiAU", "outputId": "22a6c5a7-4687-40e8-8f23-37ad3e013a24" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " \u001b[32mINFO \u001b[0m | \u001b[32mDownloading mmaudio_large_44k_v2.pth to weights/mmaudio_large_44k_v2.pth...\u001b[0m\n", "100% 4.12G/4.12G [03:35<00:00, 19.1MiB/s]\n", " \u001b[32mINFO \u001b[0m | \u001b[32mDownloading v1-44.pth to ext_weights/v1-44.pth...\u001b[0m\n", "100% 1.22G/1.22G [00:27<00:00, 43.7MiB/s]\n", " \u001b[32mINFO \u001b[0m | \u001b[32mDownloading synchformer_state_dict.pth to ext_weights/synchformer_state_dict.pth...\u001b[0m\n", "100% 950M/950M [00:12<00:00, 74.4MiB/s]\n", " \u001b[32mINFO \u001b[0m | \u001b[32mLoaded weights from weights/mmaudio_large_44k_v2.pth\u001b[0m\n", "open_clip_pytorch_model.bin: 100% 3.95G/3.95G [01:33<00:00, 42.0MB/s]\n", "open_clip_config.json: 100% 735/735 [00:00<00:00, 6.15MB/s]\n", " \u001b[32mINFO \u001b[0m | \u001b[32mLoaded hf-hub:apple/DFN5B-CLIP-ViT-H-14-384 model config.\u001b[0m\n", " \u001b[32mINFO \u001b[0m | \u001b[32mLoading pretrained hf-hub:apple/DFN5B-CLIP-ViT-H-14-384 weights (/root/.cache/huggingface/hub/models--apple--DFN5B-CLIP-ViT-H-14-384/snapshots/f17177bb05c69c6336d0adbfc97e06d69f876904/open_clip_pytorch_model.bin).\u001b[0m\n", " \u001b[32mINFO \u001b[0m | \u001b[32mLoading MotionFormer config from /content/MMAudio/mmaudio/ext/synchformer/divided_224_16x4.yaml\u001b[0m\n", "config.json: 100% 1.40k/1.40k [00:00<00:00, 11.0MB/s]\n", "Loading weights from nvidia/bigvgan_v2_44khz_128band_512x\n", "bigvgan_generator.pt: 100% 489M/489M [00:11<00:00, 42.2MB/s]\n", "Removing weight norm...\n", " \u001b[32mINFO \u001b[0m | \u001b[32mUsing video video.mp4\u001b[0m\n", " \u001b[33mWARNING \u001b[0m | \u001b[33mClip video is too short: 5.00 < 10.00\u001b[0m\n", " \u001b[33mWARNING \u001b[0m | \u001b[33mTruncating to 5.00 sec\u001b[0m\n", " \u001b[33mWARNING \u001b[0m | \u001b[33mSync video is too short: 4.96 < 5.00\u001b[0m\n", " \u001b[33mWARNING \u001b[0m | \u001b[33mTruncating to 4.96 sec\u001b[0m\n", " \u001b[32mINFO \u001b[0m | \u001b[32mPrompt: waves and seagulls\u001b[0m\n", " \u001b[32mINFO \u001b[0m | \u001b[32mNegative prompt: \u001b[0m\n", " \u001b[32mINFO \u001b[0m | \u001b[32mAudio saved to output/video.flac\u001b[0m\n", " \u001b[32mINFO \u001b[0m | \u001b[32mVideo saved to output/output/video.mp4\u001b[0m\n", " \u001b[32mINFO \u001b[0m | \u001b[32mMemory usage: 5.84 GB\u001b[0m\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "" ], "text/html": [ "\n", "\n" ] }, "metadata": {}, "execution_count": 6 } ] } ] }