|
--- |
|
base_model: |
|
- liuhaotian/llava-v1.5-7b |
|
license: cc-by-nc-nd-4.0 |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
tags: |
|
- multimodal |
|
- chain-of-thought |
|
--- |
|
|
|
# UV-CoT: Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization |
|
|
|
This repository hosts the **UV-CoT** model, presented in the paper [Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization](https://huggingface.co/papers/2504.18397). |
|
|
|
* **Project page:** [https://kesenzhao.github.io/my_project/projects/UV-CoT.html](https://kesenzhao.github.io/my_project/projects/UV-CoT.html) |
|
* **Code:** [https://github.com/UV-CoT](https://github.com/UV-CoT) |
|
|
|
## Overview |
|
|
|
Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). Existing approaches primarily focus on text CoT, limiting their ability to leverage visual cues. Unsupervised Visual CoT (UV-CoT) introduces a novel framework for image-level CoT reasoning via preference optimization, eliminating the need for extensive labeled bounding-box data. |
|
|
|
UV-CoT achieves this by performing preference comparisons between model-generated bounding boxes. It generates preference data automatically, then uses an evaluator MLLM (e.g., OmniLLM-12B) to rank responses, which serves as supervision to train the target MLLM (e.g., LLaVA-1.5-7B). This approach emulates human perception—identifying key regions and reasoning based on them—thereby improving visual comprehension, particularly in spatial reasoning tasks. |
|
|
|
<!--  --> |
|
|
|
<div align="center"> |
|
<img src="./images/fig1.svg" alt="Figure 1: UV-CoT Overview" width="1200px" /> |
|
</div> |
|
|
|
## Visualizations |
|
|
|
Qualitative examples demonstrating UV-CoT's visual reasoning: |
|
|
|
<div align="center"> |
|
<img src="./images/fig5_v1.2.svg" alt="Figure 5: UV-CoT Visualization 1" width="1200px" /> |
|
</div> |
|
|
|
<div align="center"> |
|
<img src="./images/fig6_v1.2.svg" alt="Figure 6: UV-CoT Visualization 2" width="1200px" /> |
|
</div> |
|
|
|
<!--  |
|
 --> |
|
|
|
## Installation |
|
|
|
To set up the environment and install necessary packages, follow these steps: |
|
|
|
1. Clone this repository and navigate to the `UV-CoT` folder: |
|
```bash |
|
git clone https://github.com/UV-CoT |
|
cd UV-CoT |
|
``` |
|
|
|
2. Create a conda environment and install the package: |
|
```bash |
|
conda create -n uv-cot python=3.10 -y |
|
conda activate uv-cot |
|
pip install -e . |
|
``` |
|
3. Install the required spaCy model: |
|
```bash |
|
wget https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3.tar.gz |
|
pip install en_core_web_trf-3.7.3.tar.gz |
|
``` |
|
|
|
## Usage |
|
|
|
You can load and use the UV-CoT model with the `transformers` library. For detailed information on preference data curation, training, and evaluation, please refer to the [official GitHub repository](https://github.com/UV-CoT). |
|
|
|
Here's a basic example of how to use the model for inference: |
|
|
|
```python |
|
from transformers import AutoProcessor, AutoModelForCausalLM |
|
from PIL import Image |
|
import requests |
|
import torch |
|
|
|
# Load model and processor |
|
model_id = "kesenZhaoNTU/UV-CoT" # Use this model_id to load UV-CoT |
|
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
# Load an example image |
|
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bird.jpg" |
|
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB") |
|
|
|
# Define the conversation prompt |
|
prompt = "Describe the image in detail." |
|
messages = [ |
|
{"role": "user", "content": f"<image> |
|
{prompt}"} |
|
] |
|
|
|
# Apply the chat template to format the prompt for the model |
|
text = processor.apply_chat_template(messages, add_generation_prompt=True) |
|
|
|
# Prepare inputs for the model |
|
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device) |
|
|
|
# Generate response |
|
output = model.generate(**inputs, max_new_tokens=200) |
|
print(processor.decode(output[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## Citation |
|
|
|
If our work assists your research, feel free to give us a star ⭐ or cite us using: |
|
|
|
```bibtex |
|
@misc{zhao2025unsupervisedvisualchainofthoughtreasoning, |
|
title={Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization}, |
|
author={Kesen Zhao and Beier Zhu and Qianru Sun and Hanwang Zhang}, |
|
year={2025}, |
|
eprint={2504.18397}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2504.18397}, |
|
} |
|
``` |