Improve model card: Add `library_name`, usage example, and tags
Browse filesThis PR enhances the model card by:
- Adding the `library_name: transformers` metadata, which enables the "Use in Transformers" widget on the model page.
- Adding a code snippet for quick start with Hugging Face.
- Adding tags `visual-grounding` and `spatial-reasoning` for better discoverability.
- Incorporating a more detailed description of the model from the GitHub README.
README.md
CHANGED
@@ -1,10 +1,21 @@
|
|
1 |
---
|
2 |
-
pipeline_tag: image-text-to-text
|
3 |
-
license: apache-2.0
|
4 |
base_model:
|
5 |
- liuhaotian/llava-v1.5-7b
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
---
|
7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
|
9 |
## Model Details
|
10 |
|
@@ -12,47 +23,78 @@ base_model:
|
|
12 |
|
13 |
**Model Date**: The VPP-LLaVA-7b enhancements were developed and tested based on the LLaVA-v1.5-7B model, which was trained in Feb. 2025.
|
14 |
|
15 |
-
|
16 |
-
- Original LLaVA: [LLaVA: Large Language and Vision Assistant](https://llava-vl.github.io/)
|
17 |
-
- VPP-LLaVA Enhancements: [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/pdf/2503.15426)
|
18 |
|
19 |
-
|
20 |
|
21 |
-
|
22 |
|
23 |
-
##
|
24 |
|
25 |
-
|
|
|
26 |
|
27 |
-
##
|
28 |
|
29 |
-
|
|
|
|
|
|
|
|
|
30 |
|
31 |
-
|
32 |
|
33 |
-
|
|
|
|
|
|
|
|
|
34 |
|
35 |
-
|
|
|
|
|
|
|
36 |
|
37 |
-
|
|
|
|
|
|
|
38 |
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
|
46 |
-
##
|
47 |
|
48 |
-
|
49 |
-
- **Global VPP**: Provides a global position reference by overlaying learnable, axis-like embeddings onto the input image.
|
50 |
-
- **Local VPP**: Focuses on fine-grained localization by incorporating position-aware queries that suggest probable object locations.
|
51 |
|
52 |
-
|
53 |
|
54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
55 |
|
56 |
-
|
57 |
|
58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
|
|
|
|
2 |
base_model:
|
3 |
- liuhaotian/llava-v1.5-7b
|
4 |
+
license: apache-2.0
|
5 |
+
pipeline_tag: image-text-to-text
|
6 |
+
library_name: transformers
|
7 |
+
tags:
|
8 |
+
- visual-grounding
|
9 |
+
- spatial-reasoning
|
10 |
---
|
11 |
+
|
12 |
+
# VPP-LLaVA: Visual Position Prompt for MLLM based Visual Grounding
|
13 |
+
|
14 |
+
This repository contains the VPP-LLaVA model, an enhanced multimodal large language model built upon the LLaVA architecture, designed to improve visual grounding capabilities by incorporating Visual Position Prompts (VPP).
|
15 |
+
|
16 |
+
The model was presented in the paper [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/abs/2503.15426).
|
17 |
+
|
18 |
+
**Code**: [https://github.com/WayneTomas/VPP-LLaVA](https://github.com/WayneTomas/VPP-LLaVA)
|
19 |
|
20 |
## Model Details
|
21 |
|
|
|
23 |
|
24 |
**Model Date**: The VPP-LLaVA-7b enhancements were developed and tested based on the LLaVA-v1.5-7B model, which was trained in Feb. 2025.
|
25 |
|
26 |
+
## About VPP-LLaVA
|
|
|
|
|
27 |
|
28 |
+
Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability.
|
29 |
|
30 |
+
To address these issues, VPP-LLaVA introduces an MLLM enhanced with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms: the global VPP overlays a learnable, axis-like tensor onto the input image to provide structured spatial cues, while the local VPP incorporates position-aware queries to support fine-grained localization. To effectively train our model with spatial guidance, we further introduce VPP-SFT, a curated dataset of 0.6M high-quality visual grounding samples. Designed in a compact format, it enables efficient training and is significantly smaller than datasets used by other MLLMs (e.g., ~21M samples in MiniGPT-v2), yet still provides a strong performance boost. The resulting model, VPP-LLaVA, not only achieves state-of-the-art results on standard visual grounding benchmarks but also demonstrates strong zero-shot generalization to challenging unseen datasets.
|
31 |
|
32 |
+
## Examples of VPP-LLaVA
|
33 |
|
34 |
+
<img src="https://github.com/WayneTomas/VPP-LLaVA/raw/main/images/visualization_gseval.jpg" alt="VPP-LLaVA Examples" style="width:100%; max-width:100%; height:auto;">
|
35 |
+
Our method shows strong zero-shot capability on the more complicated dataset of GSEval-BBox, especially when dealing with part-object and multi-object scenarios. In the visualizations, <font color="green">green</font> represents the ground truth (GT), <font color="red">red</font> represents our VPP-LLaVA-7B, and <font color="purple">purple</font> represents Qwen2.5-VL-7B.
|
36 |
|
37 |
+
## Quick Start With HuggingFace
|
38 |
|
39 |
+
```python
|
40 |
+
from llava.model.builder import load_pretrained_model
|
41 |
+
from llava.mm_utils import get_model_name_from_path
|
42 |
+
import torch
|
43 |
+
from PIL import Image
|
44 |
|
45 |
+
model_path = "wayneicloud/VPP-LLaVA-7b" # or "wayneicloud/VPP-LLaVA-13b"
|
46 |
|
47 |
+
tokenizer, model, image_processor, context_len = load_pretrained_model(
|
48 |
+
model_path=model_path,
|
49 |
+
model_base=None,
|
50 |
+
model_name=get_model_name_from_path(model_path)
|
51 |
+
)
|
52 |
|
53 |
+
# Example usage for visual grounding
|
54 |
+
# (Note: Specific input format and processing details might vary, refer to original GitHub for full implementation)
|
55 |
+
prompt = "Describe the image and locate the object 'tree' (with bbox)."
|
56 |
+
image_file = "path/to/your/image.jpg" # Replace with your image path
|
57 |
|
58 |
+
image = Image.open(image_file).convert("RGB")
|
59 |
+
# You'll need to process the image according to VPP-LLaVA's input requirements
|
60 |
+
# This might involve functions from llava.mm_utils or a custom preprocessor
|
61 |
+
# For simplicity, this example assumes a basic image processing step leading to `image_tensor`
|
62 |
|
63 |
+
# Assuming `image_tensor` is prepared and `prompt_ids` are tokenized
|
64 |
+
# For a full example, refer to the project's GitHub repository.
|
65 |
+
# input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').to(model.device)
|
66 |
+
# outputs = model.generate(input_ids, images=image_tensor, image_sizes=[image.size], max_new_tokens=100)
|
67 |
+
# print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
68 |
+
```
|
69 |
|
70 |
+
## Training Dataset
|
71 |
|
72 |
+
The training dataset for VPP-LLaVA is the VPP-SFT dataset, which is available on Hugging Face: [VPP-SFT](https://huggingface.co/datasets/wayneicloud/VPP-SFT). This dataset contains about 0.6M high-quality visual grounding samples, designed to efficiently train the model for improved visual grounding tasks.
|
|
|
|
|
73 |
|
74 |
+
## Evaluation Dataset
|
75 |
|
76 |
+
The evaluation dataset for VPP-LLaVA includes the following benchmarks:
|
77 |
+
- **RefCOCO**
|
78 |
+
- **RefCOCO+**
|
79 |
+
- **RefCOCOg**
|
80 |
+
- **ReferIt**
|
81 |
+
- **GSEval-BBox**
|
82 |
+
|
83 |
+
## License
|
84 |
|
85 |
+
The original LLaVA model is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved. The enhancements and modifications for VPP-LLaVA are intended for research use only and follow the same licensing principles.
|
86 |
|
87 |
+
## Citation
|
88 |
+
|
89 |
+
If you find this work helpful, please cite our paper:
|
90 |
+
```bibtex
|
91 |
+
@misc{tang2025visualpositionpromptmllm,
|
92 |
+
title={Visual Position Prompt for MLLM based Visual Grounding},
|
93 |
+
author={Wei Tang and Yanpeng Sun and Qinying Gu and Zechao Li},
|
94 |
+
year={2025},
|
95 |
+
eprint={2503.15426},
|
96 |
+
archivePrefix={arXiv},
|
97 |
+
primaryClass={cs.CV},
|
98 |
+
url={https://arxiv.org/abs/2503.15426},
|
99 |
+
}
|
100 |
+
```
|