Fine Tuned V-JEPA 2 on UCF101 Subset

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of VJEPA, resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. The code is released in this repository.

The base model we used is vjepa2-vitl-fpc16-256-ssv2, a V-JEPA 2 model pretrained on the Something-Something-V2 dataset. We further fine-tuned this model on a subset of UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. This dataset contains just 400 short videos (in total) across 10 different categories.

Installation

To run this V-JEPA 2 model, ensure you have installed the latest transformers:

pip install -U git+https://github.com/huggingface/transformers

Video classification code snippet

import torch
import numpy as np

from torchcodec.decoders import VideoDecoder
from transformers import AutoVideoProcessor, AutoModelForVideoClassification

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and video preprocessor
hf_repo = "facebook/vjepa2-vitg-fpc64-384-ssv2"

model = AutoModelForVideoClassification.from_pretrained(hf_repo).to(device)
processor = AutoVideoProcessor.from_pretrained(hf_repo)

# To load a video, sample the number of frames according to the model.
# For this model, we use 64.
video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/bowling/-WH-lxmGJVY_000005_000015.mp4"
vr = VideoDecoder(video_url)
frame_idx = np.arange(0, model.config.frames_per_clip, 2) # you can define more complex sampling strategy
video = vr.get_frames_at(indices=frame_idx).data  # frames x channels x height x width

# Preprocess and run inference
inputs = processor(video, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model(**inputs)
logits = outputs.logits

print("Top 5 predicted class names:")
top5_indices = logits.topk(5).indices[0]
top5_probs = torch.softmax(logits, dim=-1).topk(5).values[0]
for idx, prob in zip(top5_indices, top5_probs):
    text_label = model.config.id2label[idx.item()]
    print(f" - {text_label}: {prob:.2f}")

Output:

Top 5 predicted class names:
 - Putting [something] onto [something]: 0.39
 - Putting [something similar to other things that are already on the table]: 0.23
 - Stacking [number of] [something]: 0.07
 - Putting [something] into [something]: 0.04
 - Putting [number of] [something] onto [something]: 0.03

Citation

@techreport{assran2025vjepa2,
  title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},
  author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and
  Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and
  Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and
  Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and
  Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and
  Rabbat, Michael and Ballas, Nicolas},
  institution={FAIR at Meta},
  year={2025}
}

ariG23498
/

vjepa2-vitl-fpc16-256-ssv2-uvf101

Fine Tuned V-JEPA 2 on UCF101 Subset

Installation

Video classification code snippet

Citation

Model tree for ariG23498/vjepa2-vitl-fpc16-256-ssv2-uvf101

Dataset used to train ariG23498/vjepa2-vitl-fpc16-256-ssv2-uvf101

Space using ariG23498/vjepa2-vitl-fpc16-256-ssv2-uvf101 1