JenJSun commited on
Commit
6186ebf
·
verified ·
1 Parent(s): d843420

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -5
README.md CHANGED
@@ -19,6 +19,16 @@ VideoPrism is a foundational video encoder that enables state-of-the-art perform
19
 
20
  ## Model details
21
 
 
 
 
 
 
 
 
 
 
 
22
  ### Model description
23
 
24
  VideoPrism-B/L are the composition of a Vision Transformer image encoder and four temporal-attention Transformer layers. The image encoder and text encoder are initialized from [CoCa](https://arxiv.org/abs/2205.01917), which is trained on WebLI following the CoCa recipes. VideoPrism is based on the [ViViT](https://arxiv.org/abs/2103.15691) factorized video encoder architecture.
@@ -26,6 +36,7 @@ VideoPrism-B/L are the composition of a Vision Transformer image encoder and fou
26
  ### Inputs and outputs
27
  The models take videos with shape (num_frames, 288, 288, 3) as inputs and outputs embeddings with shape (num_frames * 16 * 16, feature_channels) which could be reshaped into (num_frames, 16, 16, feature_channels) for spatiotemporal representations. During model training, num_frames is set to 16 and 8 for VideoPrism-B and VideoPrism-L, respectively. Both models are expected to work with arbitrary num_frames by interpolating the temporal positional embeddings.
28
 
 
29
 
30
  ## Uses
31
  VideoPrism has a wide range of applications across various video understanding scenarios. The following lists some primary use cases and yet is not comprehensive. The purpose of this list is to provide contextual information the model creators considered as part of model training and development.
@@ -43,7 +54,14 @@ The model inherits the safety benefits and safety risks associated with the imag
43
 
44
 
45
  ## How to get started with the model
46
- Use the code at our GitHub to get started with the model: https://github.com/google-deepmind/videoprism.
 
 
 
 
 
 
 
47
 
48
  ## Training details
49
 
@@ -63,19 +81,41 @@ VideoPrism is pre-trained on a wide range of videos (36M video-caption pairs and
63
 
64
  ## Evaluation
65
 
 
 
 
66
  ### Results on video-focused tasks with frozen backbones
67
 
68
  | Dataset | K400 | MiT | SSv2 | D48 | Charades | ActivityNet | AVA | AVA-K |
69
  | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
70
- | VideoPrism-B (public) | 82.9 | 39.7 | 62.2 | 64.3 | 43.5 | 36.5 | 28.3 | 30.8 |
71
- | VideoPrism-L (public) | 85.0 | 43.3 | 64.6 | 67.6 | 53.2 | 37.0 | 32.4 | 34.5 |
72
  | VideoPrism-B (paper) | 84.2 | 40.8 | 63.6 | 67.4 | 40.4 | 36.6 | 30.6 | 31.8 |
73
  | VideoPrism-g (paper) | 87.2 | 45.5 | 68.5 | 71.3 | 62.3 | 37.8 | 36.2 | 37.3 |
74
  | Prior SOTA (B) | 77.1 | 34.0 | 58.2 | 55.6 | 33.3 | 35.8 | 21.1 | 25.9 |
75
  | Prior SOTA (L+) | 82.8 | 40.3 | 67.4 | 69.6 | 39.9 | 36.7 | 24.4 | 26.2 |
76
 
 
77
 
78
- "Public" denotes models we released in this repository. "Paper" and "Prior SOTA" denote our models and previous best-performing models reported in the paper, respectively. Our public models perform slightly worse than the paper models due to different pre-training image-text data we used subject to data policy.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
 
81
  ## Implementation information
@@ -116,4 +156,3 @@ VideoGLUE benchmarks:
116
  ```
117
 
118
 
119
-
 
19
 
20
  ## Model details
21
 
22
+ We release the following model variants:
23
+
24
+ | Model Name | Configuration Name | Model Type | Backbone | #Params | File Size | Checkpoint |
25
+ | -------- | -------- | ------- | :-------: | :-------: | :-------: | :-------: |
26
+ | VideoPrism-B | `videoprism_public_v1_base` | Video encoder | ViT-B | 114M | 458MB | [link](https://huggingface.co/google/videoprism-base-f16r288) |
27
+ | VideoPrism-L | `videoprism_public_v1_large` | Video encoder | ViT-L | 354M | 1.42GB | [link](https://huggingface.co/google/videoprism-large-f8r288) |
28
+ | VideoPrism-LvT-B | `videoprism_lvt_public_v1_base` | Video-text encoders | ViT-B | 248M | 991MB | [link](https://huggingface.co/google/videoprism-lvt-base-f16r288) |
29
+ | VideoPrism-LvT-L | `videoprism_lvt_public_v1_large` | Video-text encoders | ViT-L | 580M | 2.30GB | [link](https://huggingface.co/google/videoprism-lvt-large-f8r288) |
30
+
31
+
32
  ### Model description
33
 
34
  VideoPrism-B/L are the composition of a Vision Transformer image encoder and four temporal-attention Transformer layers. The image encoder and text encoder are initialized from [CoCa](https://arxiv.org/abs/2205.01917), which is trained on WebLI following the CoCa recipes. VideoPrism is based on the [ViViT](https://arxiv.org/abs/2103.15691) factorized video encoder architecture.
 
36
  ### Inputs and outputs
37
  The models take videos with shape (num_frames, 288, 288, 3) as inputs and outputs embeddings with shape (num_frames * 16 * 16, feature_channels) which could be reshaped into (num_frames, 16, 16, feature_channels) for spatiotemporal representations. During model training, num_frames is set to 16 and 8 for VideoPrism-B and VideoPrism-L, respectively. Both models are expected to work with arbitrary num_frames by interpolating the temporal positional embeddings.
38
 
39
+ In video-text models, both video and text encoders produce global embeddings with shape `(feature_channels)`, whose similarities could be measured by cosine distances. We use the `c4_en` [SentencePiece](https://github.com/google/sentencepiece) model for text tokenization. During inference, embedding calculation for either modality can be skipped by providing `None` as the input.
40
 
41
  ## Uses
42
  VideoPrism has a wide range of applications across various video understanding scenarios. The following lists some primary use cases and yet is not comprehensive. The purpose of this list is to provide contextual information the model creators considered as part of model training and development.
 
54
 
55
 
56
  ## How to get started with the model
57
+ To get started with our models, please see the code and examples in our [GitHub Repository](https://github.com/google-deepmind/videoprism).
58
+
59
+ ### Feedback and Questions
60
+
61
+ We welcome all questions and feedback! If you find a bug, have a feature request, or want to ask a question, please don't hesitate to **open an issue** on our GitHub repository.
62
+
63
+ We're excited to see what you build with VideoPrism! 🚀
64
+
65
 
66
  ## Training details
67
 
 
81
 
82
  ## Evaluation
83
 
84
+ In the tables below, "Public" denotes models we released in this repository. "Paper" and "Prior SOTA" denote our models and previous best-performing models reported in the paper, respectively. Our public models perform slightly worse than the paper models due to different pre-training image-text data we used subject to data policy.
85
+
86
+
87
  ### Results on video-focused tasks with frozen backbones
88
 
89
  | Dataset | K400 | MiT | SSv2 | D48 | Charades | ActivityNet | AVA | AVA-K |
90
  | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
91
+ | **VideoPrism-B (public)** | 82.9 | 39.7 | 62.2 | 64.3 | 43.5 | 36.5 | 28.3 | 30.8 |
92
+ | **VideoPrism-L (public)** | 85.0 | 43.3 | 64.6 | 67.6 | 53.2 | 37.0 | 32.4 | 34.5 |
93
  | VideoPrism-B (paper) | 84.2 | 40.8 | 63.6 | 67.4 | 40.4 | 36.6 | 30.6 | 31.8 |
94
  | VideoPrism-g (paper) | 87.2 | 45.5 | 68.5 | 71.3 | 62.3 | 37.8 | 36.2 | 37.3 |
95
  | Prior SOTA (B) | 77.1 | 34.0 | 58.2 | 55.6 | 33.3 | 35.8 | 21.1 | 25.9 |
96
  | Prior SOTA (L+) | 82.8 | 40.3 | 67.4 | 69.6 | 39.9 | 36.7 | 24.4 | 26.2 |
97
 
98
+ ### Zero-shot video-text retrieval
99
 
100
+ | Models | MSRVTT-1K (v2t) | MSRVTT-1K (t2v) | VATEX (v2t) | VATEX (t2v) | ActivityNet (v2t) | ActivityNet (t2v) |
101
+ | -------- | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: |
102
+ | **VideoPrism-LvT-B (public)** | 49.8 | 50.1 | 73.1 | 56.2 | 47.9 | 48.8 |
103
+ | **VideoPrism-LvT-L (public)** | 50.6 | 50.1 | 75.0 | 57.2 | 49.1 | 51.3 |
104
+ | VideoPrism-LvT-B (paper) | 50.2 | 51.4 | 76.2 | 57.7 | 47.9 | 49.6 |
105
+ | VideoPrism-LvT-g (paper) | 51.7 | 52.7 | 77.1 | 62.5 | 50.3 | 52.7 |
106
+ | Prior SOTA (B) | - | 34.0 | - | - | - | 30.6 |
107
+ | Prior SOTA (L+) | 45.4 | 43.9 | 73.6 | 53.2 | 40.7 | 42.8 |
108
+
109
+ ### Zero-shot video classification
110
+
111
+ | Models | K400 | SSv2 (Temporal) | SSv2 (Events) | NExT-QA (Hard) | Charades | Charades (STA) |
112
+ | -------- | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: |
113
+ | **VideoPrism-LvT-B (public)** | 69.2 | 14.6 | 11.3 | 31.1 | 26.9 | 48.6 |
114
+ | **VideoPrism-LvT-L (public)** | 72.4 | 18.0 | 12.4 | 32.1 | 32.4 | 50.2 |
115
+ | VideoPrism-LvT-B (paper) | 71.3 | 16.1 | 11.9 | 31.3 | 29.2 | 50.0 |
116
+ | VideoPrism-LvT-g (paper) | 74.6 | 18.6 | 15.7 | 32.7 | 32.4 | 50.4 |
117
+ | Prior SOTA (B) | - | 9.8 | 6.4 | 27.6 | 21.1 | - |
118
+ | Prior SOTA (L+) | 72.0 | 15.2 | 11.4 | 25.2 | 25.8 | 47.2 |
119
 
120
 
121
  ## Implementation information
 
156
  ```
157
 
158