VideoMAE model(large) variant that has been finetuned for multi-label video classification (a video can belong to multiple classes simultaneously) for camera motion classification on internal dataset.

The model predicts 18 different camera motion 'arc_left', 'arc_right', 'dolly_in', 'dolly_out', 'pan_left', 'pan_right', 'pedestal_down', 'pedestal_up', 'roll_left', 'roll_right', 'static', 'tilt_down', 'tilt_up', 'truck_left', 'truck_right', 'undefined', 'zoom_in', 'zoom_out' and and 3 shot type classes: 'pov', 'shake', 'track' .

Model was trained to associate entire video with camera labels, not frame-level motions(!): [input video] -> label/labels (because multilabel) for all video. So, if this camera motion exists during all video frames model should predict this motion, otherwise it should predict undefined.

The model is configured to process config.num_frames=16 frames per input clip. These frames are extracted uniformly from the input video, regardless of its original duration. For videos longer than 2 seconds, processing the entire video as a single clip may miss temporal nuances (e.g., varying camera motions). So, recommended workflow for such videos will be follows:

(a) split the video into non-overlapping 2-second segments (or sliding windows with optional overlap).

(b) run inference independently on each segment.

(c) post-process results.

Model accurucy on internal test dataset of 2s videos is 75%, ignoring 'pov', 'shake', 'track' classes ~- 84%.

Inference example can be found here

Downloads last month
56
Safetensors
Model size
304M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ai-forever/kandinsky-videomae-large-camera-motion

Finetuned
(18)
this model