{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "gpuType": "T4" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "source": [ "# Use VJEPA 2" ], "metadata": { "id": "02ruu54h4yLc" } }, { "cell_type": "markdown", "source": [ "V-JEPA 2 is a new open 1.2B video embedding model by Meta, which attempts to capture the physical world modelling through video ⏯️\n", "\n", "The model can be used for various tasks for video: fine-tuning for downstream tasks like video classification, or any task involving embeddings (similarity, retrieval and more!).\n", "\n", "You can check all V-JEPA 2 checkpoints and the datasets that come with this release [in this collection](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6)." ], "metadata": { "id": "ol0IGYCd4hg4" } }, { "cell_type": "markdown", "source": [ "We need to install transformers' release specific branch." ], "metadata": { "id": "kIIBxYOA41Ga" } }, { "cell_type": "code", "source": [ "!pip install -q git+https://github.com/huggingface/transformers@v4.52.4-VJEPA-2-preview" ], "metadata": { "id": "4D4D1hC940yX" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "from huggingface_hub import login # to later push the model\n", "\n", "login()" ], "metadata": { "id": "Ne2rU68Ep1On" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "As of now, Colab supports torchcodec==0.2.1 which supports torch==2.6.0." ], "metadata": { "id": "dJWXmFu53Ap6" } }, { "cell_type": "code", "source": [ "!pip install -q torch==2.6.0 torchvision==0.21.0\n", "!pip install -q torchcodec==0.2.1\n", "\n", "import torch\n", "print(\"Torch:\", torch.__version__)\n", "from torchcodec.decoders import VideoDecoder # verify" ], "metadata": { "id": "JIoq84ze2_Ls" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Initialize the model and the processor" ], "metadata": { "id": "-7OATf5S20U_" } }, { "cell_type": "code", "source": [ "from transformers import AutoVideoProcessor, AutoModel\n", "\n", "hf_repo = \"facebook/vjepa2-vitl-fpc64-256\"\n", "\n", "model = AutoModel.from_pretrained(hf_repo).to(\"cuda\")\n", "processor = AutoVideoProcessor.from_pretrained(hf_repo)" ], "metadata": { "id": "K8oSsy7Y2zQK" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Extract video embeddings from the model" ], "metadata": { "id": "ZJ_DUR9f22Uc" } }, { "cell_type": "code", "source": [ "import torch\n", "from torchcodec.decoders import VideoDecoder\n", "import numpy as np\n", "\n", "video_url = \"https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4\"\n", "vr = VideoDecoder(video_url)\n", "frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy\n", "video = vr.get_frames_at(indices=frame_idx).data # T x C x H x W\n", "video = processor(video, return_tensors=\"pt\").to(model.device)\n", "with torch.no_grad():\n", " video_embeddings = model.get_vision_features(**video)\n", "\n", "print(video_embeddings.shape)" ], "metadata": { "id": "kAgWZJHt24px" }, "execution_count": null, "outputs": [] } ] }