JenJSun tingianliu commited on
Commit
9255df1
·
verified ·
1 Parent(s): 26318a4

Update README.md (#2)

Browse files

- Update README.md (4617ff7ba0e27042279002d021de0795ea334a9d)


Co-authored-by: Ting Liu <[email protected]>

Files changed (1) hide show
  1. README.md +18 -24
README.md CHANGED
@@ -7,19 +7,19 @@ tags:
7
 
8
  # VideoPrism Model Card
9
 
10
- **Paper page**: https://huggingface.co/papers/2402.13217
11
 
12
- **Arxiv**: https://arxiv.org/pdf/2402.13217
13
 
14
- **Github link**: https://github.com/google-deepmind/videoprism
15
 
16
- **Blogpost**: https://research.google/blog/videoprism-a-foundational-visual-encoder-for-video-understanding/
17
 
18
  VideoPrism is a foundational video encoder that enables state-of-the-art performance on a large variety of video understanding tasks. It takes video frames as input and outputs compact embeddings of the frames, which one can conveniently feed into classifiers, LLMs, retrieval models, etc. When tested on 33 public video understanding benchmarks over four task categories, a single frozen VideoPrism checkpoint outperforms previous best-performing foundation models on 31 of them, with no fine-tuning on target task datasets.
19
 
20
- ## Model Details
21
 
22
- ### Model Description
23
 
24
  VideoPrism-B/L are the composition of a Vision Transformer image encoder and four temporal-attention Transformer layers. The image encoder and text encoder are initialized from [CoCa](https://arxiv.org/abs/2205.01917), which is trained on WebLI following the CoCa recipes. VideoPrism is based on the [ViViT](https://arxiv.org/abs/2103.15691) factorized video encoder architecture.
25
 
@@ -28,29 +28,26 @@ The models take videos with shape (num_frames, 288, 288, 3) as inputs and output
28
 
29
 
30
  ## Uses
31
-
32
  VideoPrism has a wide range of applications across various video understanding scenarios. The following lists some primary use cases and yet is not comprehensive. The purpose of this list is to provide contextual information the model creators considered as part of model training and development.
33
  * **Video classification**: By feeding the video embeddings to a lightweight classifier, we can tackle video action recognition, a fundamental task in video understanding, under various scenarios.
34
- * **Temporal and spatiotemporal localization**: We can also use the model to localize actions of interest spatially across time by equipping it with a bounding box proposal.
35
  * **Video retrieval and open-set classification**: By pairing up the video embeddings with a text encoder in the CLIP fashion, we can do text-video retrieval and open-set video classification.
36
 
37
 
38
- ## Ethical Considerations and Risks
39
-
40
  The model inherits the safety benefits and safety risks associated with the image encoder CoCa and the training datasets described above. We recommend that the model should not be used for downstream applications without prior assessment and mitigation of downstream application-specific security and fairness concerns.
41
- * Data Bias: Large datasets scraped from the internet can contain inherent biases, leading to skewed model performance and potentially discriminatory outputs. The presence of "noisy parallel text" like ASR transcripts introduces potential inaccuracies and biases from the speech-to-text process.
42
- * Content Moderation: The sheer volume of data (36M video-caption pairs and 582M video clips) raises concerns about the presence of objectionable or inappropriate content within the training data, which could lead to harmful model outputs.
43
- * Ethical Use: As with any powerful video understanding model, there are risks of misuse, such as in surveillance or the propagation of misinformation.
44
- * Limitations: The reliance on potentially noisy text data can limit the models understanding of the true video content. Further research is needed to refine the models ability to understand long form videos, geometric information in videos, and non-semantic cues.
45
-
46
 
47
- ## How to Get Started with the Model
48
 
49
- Use the code at our Github to get started with the model: https://github.com/google-deepmind/videoprism.
 
50
 
51
- ## Training Details
52
 
53
- ### Training Data
54
 
55
  VideoPrism is pre-trained on a wide range of videos (36M video-caption pairs and 582M video clips), including the datasets below. Note that the number of clips are subject to change due to wipeout according to policy.
56
 
@@ -81,12 +78,9 @@ VideoPrism is pre-trained on a wide range of videos (36M video-caption pairs and
81
  "Public" denotes models we released in this repository. "Paper" and "Prior SOTA" denote our models and previous best-performing models reported in the paper, respectively. Our public models perform slightly worse than the paper models due to different pre-training image-text data we used subject to data policy.
82
 
83
 
 
84
 
85
- ## Implementation Information
86
-
87
- Details about the model internals.
88
-
89
- ### Model Architecture
90
 
91
  Vision model is a [ViViT](https://arxiv.org/abs/2103.15691) factorized video encoder architecture, initialized from the Vision Transformer image encoder ([CoCa](https://arxiv.org/abs/2205.01917)) followed by four temporal-attention Transformer layers.
92
 
 
7
 
8
  # VideoPrism Model Card
9
 
10
+ **Paper**: https://huggingface.co/papers/2402.13217
11
 
12
+ **arXiv**: https://arxiv.org/pdf/2402.13217
13
 
14
+ **GitHub**: https://github.com/google-deepmind/videoprism
15
 
16
+ **Blog**: https://research.google/blog/videoprism-a-foundational-visual-encoder-for-video-understanding/
17
 
18
  VideoPrism is a foundational video encoder that enables state-of-the-art performance on a large variety of video understanding tasks. It takes video frames as input and outputs compact embeddings of the frames, which one can conveniently feed into classifiers, LLMs, retrieval models, etc. When tested on 33 public video understanding benchmarks over four task categories, a single frozen VideoPrism checkpoint outperforms previous best-performing foundation models on 31 of them, with no fine-tuning on target task datasets.
19
 
20
+ ## Model details
21
 
22
+ ### Model description
23
 
24
  VideoPrism-B/L are the composition of a Vision Transformer image encoder and four temporal-attention Transformer layers. The image encoder and text encoder are initialized from [CoCa](https://arxiv.org/abs/2205.01917), which is trained on WebLI following the CoCa recipes. VideoPrism is based on the [ViViT](https://arxiv.org/abs/2103.15691) factorized video encoder architecture.
25
 
 
28
 
29
 
30
  ## Uses
 
31
  VideoPrism has a wide range of applications across various video understanding scenarios. The following lists some primary use cases and yet is not comprehensive. The purpose of this list is to provide contextual information the model creators considered as part of model training and development.
32
  * **Video classification**: By feeding the video embeddings to a lightweight classifier, we can tackle video action recognition, a fundamental task in video understanding, under various scenarios.
33
+ * **Temporal and spatiotemporal localization**: We can also use the model to localize actions of interest spatially across time by equipping it with a bounding box proposal.
34
  * **Video retrieval and open-set classification**: By pairing up the video embeddings with a text encoder in the CLIP fashion, we can do text-video retrieval and open-set video classification.
35
 
36
 
37
+ ## Ethical considerations and risks
 
38
  The model inherits the safety benefits and safety risks associated with the image encoder CoCa and the training datasets described above. We recommend that the model should not be used for downstream applications without prior assessment and mitigation of downstream application-specific security and fairness concerns.
39
+ * Data bias: Large datasets scraped from the internet can contain inherent biases, leading to skewed model performance and potentially discriminatory outputs. The presence of "noisy parallel text" like ASR transcripts introduces potential inaccuracies and biases from the speech-to-text process.
40
+ * Content moderation: The sheer volume of data (36M video-caption pairs and 582M video clips) raises concerns about the presence of objectionable or inappropriate content within the training data, which could lead to harmful model outputs.
41
+ * Ethical use: As with any powerful video understanding model, there are risks of misuse, such as in surveillance or the propagation of misinformation.
42
+ * Limitations: The reliance on potentially noisy text data can limit the models understanding of the true video content. Further research is needed to refine the models ability to understand long form videos, geometric information in videos, and non-semantic cues.
 
43
 
 
44
 
45
+ ## How to get started with the model
46
+ Use the code at our GitHub to get started with the model: https://github.com/google-deepmind/videoprism.
47
 
48
+ ## Training details
49
 
50
+ ### Training data
51
 
52
  VideoPrism is pre-trained on a wide range of videos (36M video-caption pairs and 582M video clips), including the datasets below. Note that the number of clips are subject to change due to wipeout according to policy.
53
 
 
78
  "Public" denotes models we released in this repository. "Paper" and "Prior SOTA" denote our models and previous best-performing models reported in the paper, respectively. Our public models perform slightly worse than the paper models due to different pre-training image-text data we used subject to data policy.
79
 
80
 
81
+ ## Implementation information
82
 
83
+ ### Model architecture
 
 
 
 
84
 
85
  Vision model is a [ViViT](https://arxiv.org/abs/2103.15691) factorized video encoder architecture, initialized from the Vision Transformer image encoder ([CoCa](https://arxiv.org/abs/2205.01917)) followed by four temporal-attention Transformer layers.
86