ByteDance
/

Sa2VA-8B

Image-Text-to-Text

feature-extraction

Model card Files Files and versions Community

Add video-text-to-text pipeline tag

#3

by nielsr HF Staff - opened Apr 15

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

Files changed (1) hide show

README.md +11 -12

README.md CHANGED Viewed

@@ -1,25 +1,24 @@
 ---
-license: apache-2.0
-pipeline_tag: image-text-to-text
-library_name: transformers
 base_model:
-  - OpenGVLab/InternVL2.5-8B
-base_model_relation: merge
 language:
-  - multilingual
 tags:
-  - Sa2VA
-  - custom_code
 ---
 # Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
 [\[📂 GitHub\]](https://github.com/magic-research/Sa2VA)
-[\[📜 Sa2VA paper\]](https://arxiv.org/abs/2501.04001)
 [\[🚀 Quick Start\]](#quick-start)
 ## Introduction
 Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.
@@ -156,4 +155,4 @@ If you find this project useful in your research, please consider citing:
   journal={arXiv preprint},
   year={2025}
 }
-```

 ---
 base_model:
+- OpenGVLab/InternVL2.5-8B
 language:
+- multilingual
+library_name: transformers
+license: apache-2.0
+pipeline_tag: video-text-to-text
 tags:
+- Sa2VA
+- custom_code
+base_model_relation: merge
 ---
 # Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
+This repository contains the models based on [Sa2VA paper](https://arxiv.org/abs/2501.04001).
 [\[📂 GitHub\]](https://github.com/magic-research/Sa2VA)
 [\[🚀 Quick Start\]](#quick-start)
 ## Introduction
 Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.
   journal={arXiv preprint},
   year={2025}
 }
+```