VideoMAE model(large
) variant that has been finetuned for multi-label video classification (a video can belong to multiple classes simultaneously) for camera motion classification on internal dataset.
The model predicts 18
different camera motion 'arc_left', 'arc_right', 'dolly_in', 'dolly_out', 'pan_left', 'pan_right', 'pedestal_down', 'pedestal_up', 'roll_left', 'roll_right', 'static', 'tilt_down', 'tilt_up', 'truck_left', 'truck_right', 'undefined', 'zoom_in', 'zoom_out'
and and 3
shot type classes: 'pov', 'shake', 'track'
.
Model was trained to associate entire video with camera labels, not frame-level motions(!): [input video] -> label/labels (because multilabel) for all video. So, if this camera motion exists during all video frames model should predict this motion, otherwise it should predict undefined
.
The model is configured to process config.num_frames=16
frames per input clip. These frames are extracted uniformly from the input video, regardless of its original duration. For videos longer than 2 seconds, processing the entire video as a single clip may miss temporal nuances (e.g., varying camera motions). So, recommended workflow for such videos will be follows:
(a) split the video into non-overlapping 2-second segments (or sliding windows with optional overlap).
(b) run inference independently on each segment.
(c) post-process results.
Model accurucy on internal test dataset of 2s videos is 75%, ignoring 'pov', 'shake', 'track'
classes ~- 84%.
Inference example can be found here
- Downloads last month
- 56
Model tree for ai-forever/kandinsky-videomae-large-camera-motion
Base model
MCG-NJU/videomae-large