File size: 4,302 Bytes

995ed01

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "gpuType": "T4"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Use VJEPA 2"
      ],
      "metadata": {
        "id": "02ruu54h4yLc"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "V-JEPA 2 is a new open 1.2B video embedding model by Meta, which attempts to capture the physical world modelling through video ⏯️\n",
        "\n",
        "The model can be used for various tasks for video: fine-tuning for downstream tasks like video classification, or any task involving embeddings (similarity, retrieval and more!).\n",
        "\n",
        "You can check all V-JEPA 2 checkpoints and the datasets that come with this release [in this collection](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6)."
      ],
      "metadata": {
        "id": "ol0IGYCd4hg4"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "We need to install transformers' release specific branch."
      ],
      "metadata": {
        "id": "kIIBxYOA41Ga"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install -q git+https://github.com/huggingface/[email protected]"
      ],
      "metadata": {
        "id": "4D4D1hC940yX"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "from huggingface_hub import login # to later push the model\n",
        "\n",
        "login()"
      ],
      "metadata": {
        "id": "Ne2rU68Ep1On"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "As of now, Colab supports torchcodec==0.2.1 which supports torch==2.6.0."
      ],
      "metadata": {
        "id": "dJWXmFu53Ap6"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install -q torch==2.6.0 torchvision==0.21.0\n",
        "!pip install -q torchcodec==0.2.1\n",
        "\n",
        "import torch\n",
        "print(\"Torch:\", torch.__version__)\n",
        "from torchcodec.decoders import VideoDecoder # verify"
      ],
      "metadata": {
        "id": "JIoq84ze2_Ls"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Initialize the model and the processor"
      ],
      "metadata": {
        "id": "-7OATf5S20U_"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from transformers import AutoVideoProcessor, AutoModel\n",
        "\n",
        "hf_repo = \"facebook/vjepa2-vitl-fpc64-256\"\n",
        "\n",
        "model = AutoModel.from_pretrained(hf_repo).to(\"cuda\")\n",
        "processor = AutoVideoProcessor.from_pretrained(hf_repo)"
      ],
      "metadata": {
        "id": "K8oSsy7Y2zQK"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Extract video embeddings from the model"
      ],
      "metadata": {
        "id": "ZJ_DUR9f22Uc"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import torch\n",
        "from torchcodec.decoders import VideoDecoder\n",
        "import numpy as np\n",
        "\n",
        "video_url = \"https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4\"\n",
        "vr = VideoDecoder(video_url)\n",
        "frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy\n",
        "video = vr.get_frames_at(indices=frame_idx).data  # T x C x H x W\n",
        "video = processor(video, return_tensors=\"pt\").to(model.device)\n",
        "with torch.no_grad():\n",
        "    video_embeddings = model.get_vision_features(**video)\n",
        "\n",
        "print(video_embeddings.shape)"
      ],
      "metadata": {
        "id": "kAgWZJHt24px"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}