Update README.md
Browse files
README.md
CHANGED
@@ -8,20 +8,38 @@ pipeline_tag: video-classification
|
|
8 |
---
|
9 |
|
10 |
# InternVideo2-B14
|
11 |
-
###
|
12 |
|
13 |
-
|
|
|
|
|
|
|
14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
|
20 |
-
|
|
21 |
-
|
|
22 |
-
|
|
|
|
23 |
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
|
|
|
|
|
8 |
---
|
9 |
|
10 |
# InternVideo2-B14
|
11 |
+
### Vision-Language Model and __StreamMamba__ checkpoints
|
12 |
|
13 |
+
<details>
|
14 |
+
<summary>License: Apache-2.0</summary>
|
15 |
+
This model is licensed under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0 License</a>.
|
16 |
+
</details>
|
17 |
|
18 |
+
---
|
19 |
+
|
20 |
+
## Overview
|
21 |
+
**InternVideo2-B14** is a family of pre-trained vision-language models designed for cross-modal video-text understanding, vision-language alignment, and efficient deployment. This repository provides modular checkpoints for various downstream tasks, including video classification and frame-skipping systems.
|
22 |
+
|
23 |
+
**Base Model**: [OpenGVLab/InternVideo2_distillation_models](https://github.com/OpenGVLab/InternVideo)
|
24 |
+
|
25 |
+
**Pipeline Tag**: `video-classification` (supports vision-language and video-only tasks)
|
26 |
+
|
27 |
+
---
|
28 |
+
|
29 |
+
## Model Details
|
30 |
|
31 |
+
### Included Checkpoints
|
32 |
+
| Filename | Size | Description |
|
33 |
+
|-------------------------|----------|-----------------------------------------------------------------------------|
|
34 |
+
| `cross_mamba_film_warmup.pt` | 504 MB | Cross-modal model combining vision and text using **FiLM** (Feature-wise Linear Modulation) and **Mamba** layers for temporal modeling. |
|
35 |
+
| `mamba_mobileclip_ckpt.pt` | 500 MB | Mamba-based temporal aggregator trained on MobileCLIP embeddings (no FiLM). Checkpoint 6900. |
|
36 |
+
| `internvideo2_clip.pt` | 5.55 MB | CLIP-style vision-language alignment component for InternVideo2-B14. |
|
37 |
+
| `internvideo2_vision.pt` | 205 MB | Vision encoder backbone (InternVideo2-B14) for video feature extraction. |
|
38 |
+
| `mobileclip_blt.pt` | 599 MB | Lightweight **MobileCLIP** variant (BLT) for resource-constrained applications. |
|
39 |
|
40 |
+
#### Self-Predictive Frame Skipping (SPFS)
|
41 |
+
The `spfs_r64` folder contains a self-contained system for adaptive frame skipping in videos. It includes:
|
42 |
+
- MobileCLIP vision/text encoders
|
43 |
+
- InternVideo2-B14 vision encoder weights
|
44 |
+
- Mamba temporal aggregator (from `mamba_mobileclip_ckpt.pt`)
|
45 |
+
- SPFS-specific weights for frame selection
|