qingy2024 commited on
Commit
923823f
·
verified ·
1 Parent(s): 7fe23f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -13
README.md CHANGED
@@ -8,20 +8,38 @@ pipeline_tag: video-classification
8
  ---
9
 
10
  # InternVideo2-B14
11
- ### Cross-Modal and Vision-Language Model Checkpoints
12
 
13
- This repository hosts pre-trained model checkpoints for cross-modal video-text understanding, vision-language alignment, and efficient deployment. Below is a summary of included files:
 
 
 
14
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- | Filename | Size | Description |
17
- |-------------------------|---------|-----------------------------------------------------------------------------|
18
- | cross_mamba_film_warmup.pt | 504 MB | A cross-modal checkpoint combining vision and text using **FiLM** (Feature-wise Linear Modulation) and **Mamba** layers |
19
- | mamba_mobileclip_ckpt.pt | 500 MB | Mamba trained to aggregate MobileCLIP embeddings. Checkpoint 6900, no FiLM |
20
- | internvideo2_clip.pt | 5.55 MB | CLIP component of **InternVideo2-B14** |
21
- | internvideo2_vision.pt | 205 MB | Vision encoder backbone for **InternVideo2-B14** |
22
- | mobileclip_blt.pt | 599 MB | Lightweight **MobileCLIP** variant (BLT) |
 
23
 
24
- | spfs_r64 files | Size | Description |
25
- | ------------------------|---------|---------------------|
26
- | /confidence_head.pt | 144 MB | Confidence head for SPFS, saved at step 14,000 (Epoch 2) |
27
- | /predictor_head.pt | 3.04 MB | Predictor head for SPFS, saved at the end of Epoch 1 |
 
 
 
8
  ---
9
 
10
  # InternVideo2-B14
11
+ ### Vision-Language Model and __StreamMamba__ checkpoints
12
 
13
+ <details>
14
+ <summary>License: Apache-2.0</summary>
15
+ This model is licensed under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0 License</a>.
16
+ </details>
17
 
18
+ ---
19
+
20
+ ## Overview
21
+ **InternVideo2-B14** is a family of pre-trained vision-language models designed for cross-modal video-text understanding, vision-language alignment, and efficient deployment. This repository provides modular checkpoints for various downstream tasks, including video classification and frame-skipping systems.
22
+
23
+ **Base Model**: [OpenGVLab/InternVideo2_distillation_models](https://github.com/OpenGVLab/InternVideo)
24
+
25
+ **Pipeline Tag**: `video-classification` (supports vision-language and video-only tasks)
26
+
27
+ ---
28
+
29
+ ## Model Details
30
 
31
+ ### Included Checkpoints
32
+ | Filename | Size | Description |
33
+ |-------------------------|----------|-----------------------------------------------------------------------------|
34
+ | `cross_mamba_film_warmup.pt` | 504 MB | Cross-modal model combining vision and text using **FiLM** (Feature-wise Linear Modulation) and **Mamba** layers for temporal modeling. |
35
+ | `mamba_mobileclip_ckpt.pt` | 500 MB | Mamba-based temporal aggregator trained on MobileCLIP embeddings (no FiLM). Checkpoint 6900. |
36
+ | `internvideo2_clip.pt` | 5.55 MB | CLIP-style vision-language alignment component for InternVideo2-B14. |
37
+ | `internvideo2_vision.pt` | 205 MB | Vision encoder backbone (InternVideo2-B14) for video feature extraction. |
38
+ | `mobileclip_blt.pt` | 599 MB | Lightweight **MobileCLIP** variant (BLT) for resource-constrained applications. |
39
 
40
+ #### Self-Predictive Frame Skipping (SPFS)
41
+ The `spfs_r64` folder contains a self-contained system for adaptive frame skipping in videos. It includes:
42
+ - MobileCLIP vision/text encoders
43
+ - InternVideo2-B14 vision encoder weights
44
+ - Mamba temporal aggregator (from `mamba_mobileclip_ckpt.pt`)
45
+ - SPFS-specific weights for frame selection