Improve model card: Add `library_name`, usage example, and tags

This PR enhances the model card by:
- Adding the `library_name: transformers` metadata, which enables the "Use in Transformers" widget on the model page.
- Adding a code snippet for quick start with Hugging Face.
- Adding tags `visual-grounding` and `spatial-reasoning` for better discoverability.
- Incorporating a more detailed description of the model from the GitHub README.

Files changed (1) hide show

README.md +72 -30

README.md CHANGED Viewed

@@ -1,10 +1,21 @@
 ---
-pipeline_tag: image-text-to-text
-license: apache-2.0
 base_model:
 - liuhaotian/llava-v1.5-7b
 ---
-# VPP-LLaVA Model Card
 ## Model Details
@@ -12,47 +23,78 @@ base_model:
 **Model Date**: The VPP-LLaVA-7b enhancements were developed and tested based on the LLaVA-v1.5-7B model, which was trained in Feb. 2025.
-**Paper or Resources for More Information**:
-- Original LLaVA: [LLaVA: Large Language and Vision Assistant](https://llava-vl.github.io/)
-- VPP-LLaVA Enhancements: [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/pdf/2503.15426)
-## License
-The original LLaVA model is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved. The enhancements and modifications for VPP-LLaVA are intended for research use only and follow the same licensing principles.
-## Where to Send Questions or Comments about the Model
-For questions or comments about VPP-LLaVA, please refer to the GitHub repository: [VPP-LLaVA](https://github.com/WayneTomas/VPP-LLaVA)
-## Intended Use
-**Primary Intended Uses**: The primary use of VPP-LLaVA is for research on large multimodal models, particularly focusing on improving visual grounding and spatial reasoning capabilities. It aims to enhance the performance of LLaVA in tasks that require precise alignment of spatial information within images.
-**Primary Intended Users**: The primary intended users of VPP-LLaVA are researchers and hobbyists in the fields of computer vision, natural language processing, machine learning, and artificial intelligence, who are interested in exploring advanced multimodal models and improving visual grounding performance.
-## Training Dataset
-The training dataset for VPP-LLaVA is the VPP-SFT dataset, which is available on Hugging Face: [VPP-SFT](https://huggingface.co/datasets/wayneicloud/VPP-SFT/tree/main). This dataset contains about 0.6M high-quality visual grounding samples, designed to efficiently train the model for improved visual grounding tasks. Please refer to our [VPP-LLaVA](https://github.com/WayneTomas/VPP-LLaVA) for more details.
-## Evaluation Dataset
-The evaluation dataset for VPP-LLaVA includes the following benchmarks:
-- **RefCOCO**
-- **RefCOCO+**
-- **RefCOCOg**
-- **ReferIt**
-- **GSEval-BBox**
-## Model Enhancements
-VPP-LLaVA introduces Visual Position Prompts (VPP) to the original LLaVA architecture to enhance visual grounding capabilities. The enhancements are based on the research presented in the paper [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/pdf/2503.15426). The VPP mechanism includes:
-- **Global VPP**: Provides a global position reference by overlaying learnable, axis-like embeddings onto the input image.
-- **Local VPP**: Focuses on fine-grained localization by incorporating position-aware queries that suggest probable object locations.
-These enhancements enable VPP-LLaVA to achieve state-of-the-art performance in visual grounding tasks, even when trained on a relatively smaller dataset compared to other models.
-## Zero-Shot Performance on Unseen Dataset (GSeval)
-VPP-LLaVA demonstrates remarkable zero-shot performance on unseen datasets, particularly in challenging scenarios involving part-object and multi-object situations. This capability is crucial for real-world applications where the model may encounter previously unseen objects or complex scenes. The model's ability to generalize and accurately ground visual references in these scenarios highlights its robustness and adaptability.
-VPP-LLaVA paper link: https://arxiv.org/abs/2503.15426

 ---
 base_model:
 - liuhaotian/llava-v1.5-7b
+license: apache-2.0
+pipeline_tag: image-text-to-text
+library_name: transformers
+tags:
+- visual-grounding
+- spatial-reasoning
 ---
+# VPP-LLaVA: Visual Position Prompt for MLLM based Visual Grounding
+This repository contains the VPP-LLaVA model, an enhanced multimodal large language model built upon the LLaVA architecture, designed to improve visual grounding capabilities by incorporating Visual Position Prompts (VPP).
+The model was presented in the paper [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/abs/2503.15426).
+**Code**: [https://github.com/WayneTomas/VPP-LLaVA](https://github.com/WayneTomas/VPP-LLaVA)
 ## Model Details
 **Model Date**: The VPP-LLaVA-7b enhancements were developed and tested based on the LLaVA-v1.5-7B model, which was trained in Feb. 2025.
+## About VPP-LLaVA
+Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability.
+To address these issues, VPP-LLaVA introduces an MLLM enhanced with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms: the global VPP overlays a learnable, axis-like tensor onto the input image to provide structured spatial cues, while the local VPP incorporates position-aware queries to support fine-grained localization. To effectively train our model with spatial guidance, we further introduce VPP-SFT, a curated dataset of 0.6M high-quality visual grounding samples. Designed in a compact format, it enables efficient training and is significantly smaller than datasets used by other MLLMs (e.g., ~21M samples in MiniGPT-v2), yet still provides a strong performance boost. The resulting model, VPP-LLaVA, not only achieves state-of-the-art results on standard visual grounding benchmarks but also demonstrates strong zero-shot generalization to challenging unseen datasets.
+## Examples of VPP-LLaVA
+<img src="https://github.com/WayneTomas/VPP-LLaVA/raw/main/images/visualization_gseval.jpg" alt="VPP-LLaVA Examples" style="width:100%; max-width:100%; height:auto;">
+Our method shows strong zero-shot capability on the more complicated dataset of GSEval-BBox, especially when dealing with part-object and multi-object scenarios. In the visualizations, <font color="green">green</font> represents the ground truth (GT), <font color="red">red</font> represents our VPP-LLaVA-7B, and <font color="purple">purple</font> represents Qwen2.5-VL-7B.
+## Quick Start With HuggingFace
+```python
+from llava.model.builder import load_pretrained_model
+from llava.mm_utils import get_model_name_from_path
+import torch
+from PIL import Image
+model_path = "wayneicloud/VPP-LLaVA-7b" # or "wayneicloud/VPP-LLaVA-13b"
+tokenizer, model, image_processor, context_len = load_pretrained_model(
+    model_path=model_path,
+    model_base=None,
+    model_name=get_model_name_from_path(model_path)
+)
+# Example usage for visual grounding
+# (Note: Specific input format and processing details might vary, refer to original GitHub for full implementation)
+prompt = "Describe the image and locate the object 'tree' (with bbox)."
+image_file = "path/to/your/image.jpg" # Replace with your image path
+image = Image.open(image_file).convert("RGB")
+# You'll need to process the image according to VPP-LLaVA's input requirements
+# This might involve functions from llava.mm_utils or a custom preprocessor
+# For simplicity, this example assumes a basic image processing step leading to `image_tensor`
+# Assuming `image_tensor` is prepared and `prompt_ids` are tokenized
+# For a full example, refer to the project's GitHub repository.
+# input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').to(model.device)
+# outputs = model.generate(input_ids, images=image_tensor, image_sizes=[image.size], max_new_tokens=100)
+# print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Training Dataset
+The training dataset for VPP-LLaVA is the VPP-SFT dataset, which is available on Hugging Face: [VPP-SFT](https://huggingface.co/datasets/wayneicloud/VPP-SFT). This dataset contains about 0.6M high-quality visual grounding samples, designed to efficiently train the model for improved visual grounding tasks.
+## Evaluation Dataset
+The evaluation dataset for VPP-LLaVA includes the following benchmarks:
+-   **RefCOCO**
+-   **RefCOCO+**
+-   **RefCOCOg**
+-   **ReferIt**
+-   **GSEval-BBox**
+## License
+The original LLaVA model is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved. The enhancements and modifications for VPP-LLaVA are intended for research use only and follow the same licensing principles.
+## Citation
+If you find this work helpful, please cite our paper:
+```bibtex
+@misc{tang2025visualpositionpromptmllm,
+      title={Visual Position Prompt for MLLM based Visual Grounding},
+      author={Wei Tang and Yanpeng Sun and Qinying Gu and Zechao Li},
+      year={2025},
+      eprint={2503.15426},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2503.15426},
+}
+```