Add video-text-to-text pipeline tag

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +11 -12
README.md CHANGED
@@ -1,25 +1,24 @@
1
  ---
2
- license: apache-2.0
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
  base_model:
6
- - OpenGVLab/InternVL2.5-8B
7
- base_model_relation: merge
8
  language:
9
- - multilingual
 
 
 
10
  tags:
11
- - Sa2VA
12
- - custom_code
 
13
  ---
14
 
15
  # Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
16
 
 
 
17
  [\[πŸ“‚ GitHub\]](https://github.com/magic-research/Sa2VA)
18
- [\[πŸ“œ Sa2VA paper\]](https://arxiv.org/abs/2501.04001)
19
  [\[πŸš€ Quick Start\]](#quick-start)
20
 
21
-
22
-
23
  ## Introduction
24
 
25
  Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.
@@ -156,4 +155,4 @@ If you find this project useful in your research, please consider citing:
156
  journal={arXiv preprint},
157
  year={2025}
158
  }
159
- ```
 
1
  ---
 
 
 
2
  base_model:
3
+ - OpenGVLab/InternVL2.5-8B
 
4
  language:
5
+ - multilingual
6
+ library_name: transformers
7
+ license: apache-2.0
8
+ pipeline_tag: video-text-to-text
9
  tags:
10
+ - Sa2VA
11
+ - custom_code
12
+ base_model_relation: merge
13
  ---
14
 
15
  # Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
16
 
17
+ This repository contains the models based on [Sa2VA paper](https://arxiv.org/abs/2501.04001).
18
+
19
  [\[πŸ“‚ GitHub\]](https://github.com/magic-research/Sa2VA)
 
20
  [\[πŸš€ Quick Start\]](#quick-start)
21
 
 
 
22
  ## Introduction
23
 
24
  Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.
 
155
  journal={arXiv preprint},
156
  year={2025}
157
  }
158
+ ```