nielsr HF Staff commited on
Commit
55b2746
·
verified ·
1 Parent(s): bd361d3

Improve model card: Add `library_name`, usage example, and tags

Browse files

This PR enhances the model card by:
- Adding the `library_name: transformers` metadata, which enables the "Use in Transformers" widget on the model page.
- Adding a code snippet for quick start with Hugging Face.
- Adding tags `visual-grounding` and `spatial-reasoning` for better discoverability.
- Incorporating a more detailed description of the model from the GitHub README.

Files changed (1) hide show
  1. README.md +72 -30
README.md CHANGED
@@ -1,10 +1,21 @@
1
  ---
2
- pipeline_tag: image-text-to-text
3
- license: apache-2.0
4
  base_model:
5
  - liuhaotian/llava-v1.5-7b
 
 
 
 
 
 
6
  ---
7
- # VPP-LLaVA Model Card
 
 
 
 
 
 
 
8
 
9
  ## Model Details
10
 
@@ -12,47 +23,78 @@ base_model:
12
 
13
  **Model Date**: The VPP-LLaVA-7b enhancements were developed and tested based on the LLaVA-v1.5-7B model, which was trained in Feb. 2025.
14
 
15
- **Paper or Resources for More Information**:
16
- - Original LLaVA: [LLaVA: Large Language and Vision Assistant](https://llava-vl.github.io/)
17
- - VPP-LLaVA Enhancements: [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/pdf/2503.15426)
18
 
19
- ## License
20
 
21
- The original LLaVA model is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved. The enhancements and modifications for VPP-LLaVA are intended for research use only and follow the same licensing principles.
22
 
23
- ## Where to Send Questions or Comments about the Model
24
 
25
- For questions or comments about VPP-LLaVA, please refer to the GitHub repository: [VPP-LLaVA](https://github.com/WayneTomas/VPP-LLaVA)
 
26
 
27
- ## Intended Use
28
 
29
- **Primary Intended Uses**: The primary use of VPP-LLaVA is for research on large multimodal models, particularly focusing on improving visual grounding and spatial reasoning capabilities. It aims to enhance the performance of LLaVA in tasks that require precise alignment of spatial information within images.
 
 
 
 
30
 
31
- **Primary Intended Users**: The primary intended users of VPP-LLaVA are researchers and hobbyists in the fields of computer vision, natural language processing, machine learning, and artificial intelligence, who are interested in exploring advanced multimodal models and improving visual grounding performance.
32
 
33
- ## Training Dataset
 
 
 
 
34
 
35
- The training dataset for VPP-LLaVA is the VPP-SFT dataset, which is available on Hugging Face: [VPP-SFT](https://huggingface.co/datasets/wayneicloud/VPP-SFT/tree/main). This dataset contains about 0.6M high-quality visual grounding samples, designed to efficiently train the model for improved visual grounding tasks. Please refer to our [VPP-LLaVA](https://github.com/WayneTomas/VPP-LLaVA) for more details.
 
 
 
36
 
37
- ## Evaluation Dataset
 
 
 
38
 
39
- The evaluation dataset for VPP-LLaVA includes the following benchmarks:
40
- - **RefCOCO**
41
- - **RefCOCO+**
42
- - **RefCOCOg**
43
- - **ReferIt**
44
- - **GSEval-BBox**
45
 
46
- ## Model Enhancements
47
 
48
- VPP-LLaVA introduces Visual Position Prompts (VPP) to the original LLaVA architecture to enhance visual grounding capabilities. The enhancements are based on the research presented in the paper [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/pdf/2503.15426). The VPP mechanism includes:
49
- - **Global VPP**: Provides a global position reference by overlaying learnable, axis-like embeddings onto the input image.
50
- - **Local VPP**: Focuses on fine-grained localization by incorporating position-aware queries that suggest probable object locations.
51
 
52
- These enhancements enable VPP-LLaVA to achieve state-of-the-art performance in visual grounding tasks, even when trained on a relatively smaller dataset compared to other models.
53
 
54
- ## Zero-Shot Performance on Unseen Dataset (GSeval)
 
 
 
 
 
 
 
55
 
56
- VPP-LLaVA demonstrates remarkable zero-shot performance on unseen datasets, particularly in challenging scenarios involving part-object and multi-object situations. This capability is crucial for real-world applications where the model may encounter previously unseen objects or complex scenes. The model's ability to generalize and accurately ground visual references in these scenarios highlights its robustness and adaptability.
57
 
58
- VPP-LLaVA paper link: https://arxiv.org/abs/2503.15426
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
2
  base_model:
3
  - liuhaotian/llava-v1.5-7b
4
+ license: apache-2.0
5
+ pipeline_tag: image-text-to-text
6
+ library_name: transformers
7
+ tags:
8
+ - visual-grounding
9
+ - spatial-reasoning
10
  ---
11
+
12
+ # VPP-LLaVA: Visual Position Prompt for MLLM based Visual Grounding
13
+
14
+ This repository contains the VPP-LLaVA model, an enhanced multimodal large language model built upon the LLaVA architecture, designed to improve visual grounding capabilities by incorporating Visual Position Prompts (VPP).
15
+
16
+ The model was presented in the paper [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/abs/2503.15426).
17
+
18
+ **Code**: [https://github.com/WayneTomas/VPP-LLaVA](https://github.com/WayneTomas/VPP-LLaVA)
19
 
20
  ## Model Details
21
 
 
23
 
24
  **Model Date**: The VPP-LLaVA-7b enhancements were developed and tested based on the LLaVA-v1.5-7B model, which was trained in Feb. 2025.
25
 
26
+ ## About VPP-LLaVA
 
 
27
 
28
+ Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability.
29
 
30
+ To address these issues, VPP-LLaVA introduces an MLLM enhanced with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms: the global VPP overlays a learnable, axis-like tensor onto the input image to provide structured spatial cues, while the local VPP incorporates position-aware queries to support fine-grained localization. To effectively train our model with spatial guidance, we further introduce VPP-SFT, a curated dataset of 0.6M high-quality visual grounding samples. Designed in a compact format, it enables efficient training and is significantly smaller than datasets used by other MLLMs (e.g., ~21M samples in MiniGPT-v2), yet still provides a strong performance boost. The resulting model, VPP-LLaVA, not only achieves state-of-the-art results on standard visual grounding benchmarks but also demonstrates strong zero-shot generalization to challenging unseen datasets.
31
 
32
+ ## Examples of VPP-LLaVA
33
 
34
+ <img src="https://github.com/WayneTomas/VPP-LLaVA/raw/main/images/visualization_gseval.jpg" alt="VPP-LLaVA Examples" style="width:100%; max-width:100%; height:auto;">
35
+ Our method shows strong zero-shot capability on the more complicated dataset of GSEval-BBox, especially when dealing with part-object and multi-object scenarios. In the visualizations, <font color="green">green</font> represents the ground truth (GT), <font color="red">red</font> represents our VPP-LLaVA-7B, and <font color="purple">purple</font> represents Qwen2.5-VL-7B.
36
 
37
+ ## Quick Start With HuggingFace
38
 
39
+ ```python
40
+ from llava.model.builder import load_pretrained_model
41
+ from llava.mm_utils import get_model_name_from_path
42
+ import torch
43
+ from PIL import Image
44
 
45
+ model_path = "wayneicloud/VPP-LLaVA-7b" # or "wayneicloud/VPP-LLaVA-13b"
46
 
47
+ tokenizer, model, image_processor, context_len = load_pretrained_model(
48
+ model_path=model_path,
49
+ model_base=None,
50
+ model_name=get_model_name_from_path(model_path)
51
+ )
52
 
53
+ # Example usage for visual grounding
54
+ # (Note: Specific input format and processing details might vary, refer to original GitHub for full implementation)
55
+ prompt = "Describe the image and locate the object 'tree' (with bbox)."
56
+ image_file = "path/to/your/image.jpg" # Replace with your image path
57
 
58
+ image = Image.open(image_file).convert("RGB")
59
+ # You'll need to process the image according to VPP-LLaVA's input requirements
60
+ # This might involve functions from llava.mm_utils or a custom preprocessor
61
+ # For simplicity, this example assumes a basic image processing step leading to `image_tensor`
62
 
63
+ # Assuming `image_tensor` is prepared and `prompt_ids` are tokenized
64
+ # For a full example, refer to the project's GitHub repository.
65
+ # input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').to(model.device)
66
+ # outputs = model.generate(input_ids, images=image_tensor, image_sizes=[image.size], max_new_tokens=100)
67
+ # print(tokenizer.decode(outputs[0], skip_special_tokens=True))
68
+ ```
69
 
70
+ ## Training Dataset
71
 
72
+ The training dataset for VPP-LLaVA is the VPP-SFT dataset, which is available on Hugging Face: [VPP-SFT](https://huggingface.co/datasets/wayneicloud/VPP-SFT). This dataset contains about 0.6M high-quality visual grounding samples, designed to efficiently train the model for improved visual grounding tasks.
 
 
73
 
74
+ ## Evaluation Dataset
75
 
76
+ The evaluation dataset for VPP-LLaVA includes the following benchmarks:
77
+ - **RefCOCO**
78
+ - **RefCOCO+**
79
+ - **RefCOCOg**
80
+ - **ReferIt**
81
+ - **GSEval-BBox**
82
+
83
+ ## License
84
 
85
+ The original LLaVA model is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved. The enhancements and modifications for VPP-LLaVA are intended for research use only and follow the same licensing principles.
86
 
87
+ ## Citation
88
+
89
+ If you find this work helpful, please cite our paper:
90
+ ```bibtex
91
+ @misc{tang2025visualpositionpromptmllm,
92
+ title={Visual Position Prompt for MLLM based Visual Grounding},
93
+ author={Wei Tang and Yanpeng Sun and Qinying Gu and Zechao Li},
94
+ year={2025},
95
+ eprint={2503.15426},
96
+ archivePrefix={arXiv},
97
+ primaryClass={cs.CV},
98
+ url={https://arxiv.org/abs/2503.15426},
99
+ }
100
+ ```