File size: 4,974 Bytes
33818a7
 
 
 
 
 
 
919f84c
33818a7
 
 
 
919f84c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
license: apache-2.0
base_model:
- liuhaotian/llava-v1.5-7b
---


# LISA++ (LISA_Plus_7b): An Improved Baseline for Reasoning Segmentation with Large Language Model


🤗[Data](https://huggingface.co/collections/Senqiao/lisa-67713837a32d6abf516a162e) | 📄[Paper](https://arxiv.org/abs/2312.17240)

# Model Card for LISA++ (LISA_Plus_7b)

## Model Details

- **Developed by**: Senqiao Yang, The Chinese University of Hong Kong & SmartMore  
- **Model Type**: Large Vision-Language Model (VLM) for reasoning segmentation  
- **Language(s)**: Supports natural language queries in English  
- **License**: Apache 2.0  
- **Base Model**: Finetuned from [liuhaotian/llava-v1.5-7b](https://huggingface.co/liuhaotian/llava-v1.5-7b)  

## Model Description

LISA++ (LISA_Plus_7b) is an improved baseline for reasoning segmentation with large language models. It enhances the capabilities of its predecessor by incorporating instance segmentation and enabling more natural, multi-turn dialogues through Segmentation in Dialogue (SiD). These advancements are achieved without structural changes or additional data sources, relying instead on curated samples from existing segmentation datasets.

### Key Enhancements:

1. **Instance Segmentation**: Differentiates between different instances of the same category, providing more detailed scene analysis alongside existing multi-region semantic segmentation.
2. **Segmentation in Dialogue (SiD)**: Improved capability for multi-turn dialogue, allowing the model to incorporate segmentation results directly into text responses, leading to more natural and flexible conversations.
3. **Refined Data Curation**: Uses datasets like COCO and ADE20K to improve segmentation and dialogue integration.

## Intended Uses & Limitations

### Direct Use
- Interactive image understanding and segmentation  
- Multi-turn reasoning about segmented objects in images  
- Visual question-answering with spatial awareness  

### Out-of-Scope Use
- Real-time medical or security applications without further validation  
- Applications requiring precise 3D object segmentation  

## How to Use

As of now, the model is not available via the Hugging Face Inference API. To use locally:

```python
from transformers import pipeline

# Load LISA++
model = pipeline("image-segmentation", model="LISA_Plus_7b")

# Example usage
image_path = "example.jpg"
query = "Highlight all the cats in the image."
result = model(image_path, query)
print(result)
```

For further details, refer to the [model repository](https://huggingface.co/Senqiao/LISA_Plus_7b).

## Training Data

LISA++ is trained on curated samples from:

- **COCO Dataset**: Common Objects in Context  
- **ADE20K Dataset**: Scene parsing dataset  
- **Extended ReasonSeg Dataset**: Enhanced for multi-target instance segmentation  

The training data is structured to improve segmentation and dialogue capabilities.

## Training Procedure

- **Base Model**: Finetuned from [liuhaotian/llava-v1.5-7b](https://huggingface.co/liuhaotian/llava-v1.5-7b)  
- **Optimizer**: [Specify optimizer, e.g., AdamW]  
- **Training Steps**: Trained on ReasonSeg-Inst and ReasonSeg-Sem datasets  
- **Hardware**: Trained on GPUs [Specify model, e.g., NVIDIA A100]  
- **Loss Functions**: Combination of segmentation and language modeling losses  

## Evaluation Results

LISA++ significantly improves segmentation accuracy compared to its predecessor:

- **ReasonSeg-Inst (Instance Segmentation Performance)**:
  - AP50: **34.1%** (vs. 13.7% in LISA-7B)
  - AP75: **22.1%** (vs. 6.6% in LISA-7B)
  - mAP: **21.5%** (vs. 7.2% in LISA-7B)

- **ReasonSeg-Sem (Semantic Segmentation Performance)**:
  - gIoU: **64.2%** (vs. 53.6% in LISA)
  - cIoU: **68.1%** (vs. 52.3% in LISA)

These results highlight LISA++'s enhanced capabilities in both instance and semantic segmentation tasks.

## Bias, Risks, and Limitations

- **Bias**: The model's performance is limited by biases in training datasets (COCO, ADE20K).  
- **Limitations**: May struggle with unseen object categories or highly cluttered scenes.  
- **Ethical Considerations**: Users should verify outputs before deploying in critical applications.  

## Environmental Impact

- **Hardware Used**: NVIDIA A100 GPUs (or equivalent)  
- **Training Duration**: [Specify training time, if available]  
- **Estimated Carbon Emissions**: [Estimate, if available]  

## Citation

If you use LISA_Plus_7b in your research, please cite:

```
@article{yang2024lisa++,
  title={LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model},
  author={Senqiao Yang},
  journal={arXiv preprint arXiv:2312.17240},
  year={2024}
}
```

## Contact Information

For questions or feedback, contact:

- **Author**: Senqiao Yang

---

This AI generated model card provides an overview of LISA_Plus_7b's capabilities, training methodology, and evaluation metrics, reflecting the latest updates from the Hugging Face model repository and arXiv paper.