cminst
/

StreamMamba

Video Classification

English

Model card Files Files and versions

xet

Community

qingy2024 commited on Jul 21

Commit

923823f

verified ·

1 Parent(s): 7fe23f6

Update README.md

Browse files

Files changed (1) hide show

README.md +31 -13

README.md CHANGED Viewed

@@ -8,20 +8,38 @@ pipeline_tag: video-classification
 ---
 # InternVideo2-B14
-### Cross-Modal and Vision-Language Model Checkpoints
-This repository hosts pre-trained model checkpoints for cross-modal video-text understanding, vision-language alignment, and efficient deployment. Below is a summary of included files:
-| Filename                | Size    | Description                                                                 |
-|-------------------------|---------|-----------------------------------------------------------------------------|
-| cross_mamba_film_warmup.pt | 504 MB | A cross-modal checkpoint combining vision and text using **FiLM** (Feature-wise Linear Modulation) and **Mamba** layers |
-| mamba_mobileclip_ckpt.pt | 500 MB | Mamba trained to aggregate MobileCLIP embeddings. Checkpoint 6900, no FiLM |
-| internvideo2_clip.pt    | 5.55 MB | CLIP component of **InternVideo2-B14** |
-| internvideo2_vision.pt  | 205 MB  | Vision encoder backbone for **InternVideo2-B14** |
-| mobileclip_blt.pt       | 599 MB  | Lightweight **MobileCLIP** variant (BLT) |
-| spfs_r64 files | Size | Description |
-| ------------------------|---------|---------------------|
-| /confidence_head.pt       | 144 MB  | Confidence head for SPFS, saved at step 14,000 (Epoch 2) |
-| /predictor_head.pt       | 3.04 MB  | Predictor head for SPFS, saved at the end of Epoch 1 |

 ---
 # InternVideo2-B14
+### Vision-Language Model and __StreamMamba__ checkpoints
+<details>
+<summary>License: Apache-2.0</summary>
+This model is licensed under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0 License</a>.
+</details>
+---
+## Overview
+**InternVideo2-B14** is a family of pre-trained vision-language models designed for cross-modal video-text understanding, vision-language alignment, and efficient deployment. This repository provides modular checkpoints for various downstream tasks, including video classification and frame-skipping systems.
+**Base Model**: [OpenGVLab/InternVideo2_distillation_models](https://github.com/OpenGVLab/InternVideo)
+**Pipeline Tag**: `video-classification` (supports vision-language and video-only tasks)
+---
+## Model Details
+### Included Checkpoints
+| Filename                | Size     | Description                                                                 |
+|-------------------------|----------|-----------------------------------------------------------------------------|
+| `cross_mamba_film_warmup.pt` | 504 MB | Cross-modal model combining vision and text using **FiLM** (Feature-wise Linear Modulation) and **Mamba** layers for temporal modeling. |
+| `mamba_mobileclip_ckpt.pt`   | 500 MB | Mamba-based temporal aggregator trained on MobileCLIP embeddings (no FiLM). Checkpoint 6900. |
+| `internvideo2_clip.pt`       | 5.55 MB | CLIP-style vision-language alignment component for InternVideo2-B14. |
+| `internvideo2_vision.pt`     | 205 MB  | Vision encoder backbone (InternVideo2-B14) for video feature extraction. |
+| `mobileclip_blt.pt`          | 599 MB  | Lightweight **MobileCLIP** variant (BLT) for resource-constrained applications. |
+#### Self-Predictive Frame Skipping (SPFS)
+The `spfs_r64` folder contains a self-contained system for adaptive frame skipping in videos. It includes:
+- MobileCLIP vision/text encoders
+- InternVideo2-B14 vision encoder weights
+- Mamba temporal aggregator (from `mamba_mobileclip_ckpt.pt`)
+- SPFS-specific weights for frame selection