nielsr HF Staff commited on
Commit
2f58d7f
Β·
verified Β·
1 Parent(s): f9264a3

Add video-text-to-text pipeline tag

Browse files

This PR improves the model card by adding the `pipeline_tag` and linking to the paper. This ensures the model can be found at https://huggingface.co/models?pipeline_tag=video-text-to-text&sort=trending.

Files changed (1) hide show
  1. README.md +11 -12
README.md CHANGED
@@ -1,25 +1,24 @@
1
  ---
2
- license: apache-2.0
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
  base_model:
6
- - OpenGVLab/InternVL2.5-8B
7
- base_model_relation: merge
8
  language:
9
- - multilingual
 
 
 
10
  tags:
11
- - Sa2VA
12
- - custom_code
 
13
  ---
14
 
15
  # Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
16
 
 
 
17
  [\[πŸ“‚ GitHub\]](https://github.com/magic-research/Sa2VA)
18
- [\[πŸ“œ Sa2VA paper\]](https://arxiv.org/abs/2501.04001)
19
  [\[πŸš€ Quick Start\]](#quick-start)
20
 
21
-
22
-
23
  ## Introduction
24
 
25
  Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.
@@ -156,4 +155,4 @@ If you find this project useful in your research, please consider citing:
156
  journal={arXiv preprint},
157
  year={2025}
158
  }
159
- ```
 
1
  ---
 
 
 
2
  base_model:
3
+ - OpenGVLab/InternVL2.5-8B
 
4
  language:
5
+ - multilingual
6
+ library_name: transformers
7
+ license: apache-2.0
8
+ pipeline_tag: video-text-to-text
9
  tags:
10
+ - Sa2VA
11
+ - custom_code
12
+ base_model_relation: merge
13
  ---
14
 
15
  # Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
16
 
17
+ This repository contains the models based on [Sa2VA paper](https://arxiv.org/abs/2501.04001).
18
+
19
  [\[πŸ“‚ GitHub\]](https://github.com/magic-research/Sa2VA)
 
20
  [\[πŸš€ Quick Start\]](#quick-start)
21
 
 
 
22
  ## Introduction
23
 
24
  Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.
 
155
  journal={arXiv preprint},
156
  year={2025}
157
  }
158
+ ```