--- license: apache-2.0 pipeline_tag: video-text-to-text library_name: transformers --- # SkyCaptioner-V1: A Structural Video Captioning Model
📑 Technical Report · 👋 Playground · 💬 Discord · 🤗 Hugging Face · 🤖 ModelScope · 🌐 GitHub
--- Welcome to the SkyCaptioner-V1 repository! Here, you'll find the structural video captioning model weights and inference code for our video captioner that labels the video data efficiently and comprehensively. ## 🔥🔥🔥 News!! * Apr 21, 2025: 👋 We release the [vllm](https://github.com/vllm-project/vllm) batch inference code for SkyCaptioner-V1 Model and caption fusion inference code. * Apr 21, 2025: 👋 We release the first shot-aware video captioning model [SkyCaptioner-V1 Model](https://huggingface.co/Skywork/SkyCaptioner-V1). For more details, please check our [paper](https://arxiv.org/pdf/2504.13074). ## 📑 TODO List - SkyCaptioner-V1 - [x] Checkpoints - [x] Batch Inference Code - [x] Caption Fusion Method - [ ] Web Demo (Gradio) ## 🌟 Overview SkyCaptioner-V1 is a structural video captioning model designed to generate high-quality, structural descriptions for video data. It integrates specialized sub-expert models and multimodal large language models (MLLMs) with human annotations to address the limitations of general captioners in capturing professional film-related details. Key aspects include: 1. **Structural Representation**: Combines general video descriptions (from MLLMs) with sub-expert captioner (e.g., shot types,shot angles, shot positions, camera motions.) and human annotations. 2. **Knowledge Distillation**: Distills expertise from sub-expert captioners into a unified model. 3. **Application Flexibility**: Generates dense captions for text-to-video (T2V) and concise prompts for image-to-video (I2V) tasks. ## 🔑 Key Features ### Structural Captioning Framework Our Video Captioning model captures multi-dimensional details: * **Subjects**: Appearance, action, expression, position, and hierarchical categorization. * **Shot Metadata**: Shot type (e.g., close-up, long shot), shot angle, shot position, camera motion, environment, lighting, etc. ### Sub-Expert Integration * **Shot Captioner**: Classifies shot type, angle, and position with high precision. * **Expression Captioner**: Analyzes facial expressions, emotion intensity, and temporal dynamics. * **Camera Motion Captioner**: Tracks 6DoF camera movements and composite motion types, ### Training Pipeline * Trained on \~2M high-quality, concept-balanced videos curated from 10M raw samples. * Fine-tuned on Qwen2.5-VL-7B-Instruct with a global batch size of 512 across 32 A800 GPUs. * Optimized using AdamW (learning rate: 1e-5) for 2 epochs. ### Dynamic Caption Fusion: * Adapts output length based on application (T2V/I2V). * Employs LLM Model to fusion structural fields to get a natural and fluency caption for downstream tasks. ## 📊 Benchmark Results SkyCaptioner-V1 demonstrates significant improvements over existing models in key film-specific captioning tasks, particularly in **shot-language understanding** and **domain-specific precision**. The differences stem from its structural architecture and expert-guided training: 1. **Superior Shot-Language Understanding**: * Our Captioner model outperforms Qwen2.5-VL-72B with +11.2% in shot type, +16.1% in shot angle, and +50.4% in shot position accuracy. Because SkyCaptioner-V1’s specialized shot classifiers outperform generalist MLLMs, which lack film-domain fine-tuning. * +28.5% accuracy in camera motion vs. Tarsier2-recap-7B (88.8% vs. 41.5%): Its 6DoF motion analysis and active learning pipeline address ambiguities in composite motions (e.g., tracking + panning) that challenge generic captioners. 2. **High domain-specific precision**: * Expression accuracy: 68.8% vs. 54.3% (Tarsier2-recap-7B), leveraging temporal-aware S2D frameworks to capture dynamic facial changes.
Metric | Qwen2.5-VL-7B-Ins. | Qwen2.5-VL-72B-Ins. | Tarsier2-recap-7B | SkyCaptioner-V1 |
---|---|---|---|---|
Avg accuracy | 51.4% | 58.7% | 49.4% | 76.3% |
shot type | 76.8% | 82.5% | 60.2% | 93.7% |
shot angle | 60.0% | 73.7% | 52.4% | 89.8% |
shot position | 28.4% | 32.7% | 23.6% | 83.1% |
camera motion | 62.0% | 61.2% | 45.3% | 85.3% |
expression | 43.6% | 51.5% | 54.3% | 68.8% |
TYPES_type | 43.5% | 49.7% | 47.6% | 82.5% |
TYPES_sub_type | 38.9% | 44.9% | 45.9% | 75.4% |
appearance | 40.9% | 52.0% | 45.6% | 59.3% |
action | 32.4% | 52.0% | 69.8% | 68.8% |
position | 35.4% | 48.6% | 45.5% | 57.5% |
is_main_subject | 58.5% | 68.7% | 69.7% | 80.9% |
environment | 70.4% | 72.7% | 61.4% | 70.5% |
lighting | 77.1% | 80.0% | 21.2% | 76.5% |