File size: 4,302 Bytes
995ed01 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"gpuType": "T4"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# Use VJEPA 2"
],
"metadata": {
"id": "02ruu54h4yLc"
}
},
{
"cell_type": "markdown",
"source": [
"V-JEPA 2 is a new open 1.2B video embedding model by Meta, which attempts to capture the physical world modelling through video ⏯️\n",
"\n",
"The model can be used for various tasks for video: fine-tuning for downstream tasks like video classification, or any task involving embeddings (similarity, retrieval and more!).\n",
"\n",
"You can check all V-JEPA 2 checkpoints and the datasets that come with this release [in this collection](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6)."
],
"metadata": {
"id": "ol0IGYCd4hg4"
}
},
{
"cell_type": "markdown",
"source": [
"We need to install transformers' release specific branch."
],
"metadata": {
"id": "kIIBxYOA41Ga"
}
},
{
"cell_type": "code",
"source": [
"!pip install -q git+https://github.com/huggingface/[email protected]"
],
"metadata": {
"id": "4D4D1hC940yX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from huggingface_hub import login # to later push the model\n",
"\n",
"login()"
],
"metadata": {
"id": "Ne2rU68Ep1On"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"As of now, Colab supports torchcodec==0.2.1 which supports torch==2.6.0."
],
"metadata": {
"id": "dJWXmFu53Ap6"
}
},
{
"cell_type": "code",
"source": [
"!pip install -q torch==2.6.0 torchvision==0.21.0\n",
"!pip install -q torchcodec==0.2.1\n",
"\n",
"import torch\n",
"print(\"Torch:\", torch.__version__)\n",
"from torchcodec.decoders import VideoDecoder # verify"
],
"metadata": {
"id": "JIoq84ze2_Ls"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Initialize the model and the processor"
],
"metadata": {
"id": "-7OATf5S20U_"
}
},
{
"cell_type": "code",
"source": [
"from transformers import AutoVideoProcessor, AutoModel\n",
"\n",
"hf_repo = \"facebook/vjepa2-vitl-fpc64-256\"\n",
"\n",
"model = AutoModel.from_pretrained(hf_repo).to(\"cuda\")\n",
"processor = AutoVideoProcessor.from_pretrained(hf_repo)"
],
"metadata": {
"id": "K8oSsy7Y2zQK"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Extract video embeddings from the model"
],
"metadata": {
"id": "ZJ_DUR9f22Uc"
}
},
{
"cell_type": "code",
"source": [
"import torch\n",
"from torchcodec.decoders import VideoDecoder\n",
"import numpy as np\n",
"\n",
"video_url = \"https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4\"\n",
"vr = VideoDecoder(video_url)\n",
"frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy\n",
"video = vr.get_frames_at(indices=frame_idx).data # T x C x H x W\n",
"video = processor(video, return_tensors=\"pt\").to(model.device)\n",
"with torch.no_grad():\n",
" video_embeddings = model.get_vision_features(**video)\n",
"\n",
"print(video_embeddings.shape)"
],
"metadata": {
"id": "kAgWZJHt24px"
},
"execution_count": null,
"outputs": []
}
]
} |