File size: 4,717 Bytes
2593242 9a21672 2593242 9a21672 70d6d17 9a21672 6ef597d 9a21672 6ef597d 833579e 9a21672 6ef597d 9a21672 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
---
base_model:
- liuhaotian/llava-v1.5-7b
license: cc-by-nc-nd-4.0
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- multimodal
- chain-of-thought
---
# UV-CoT: Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
This repository hosts the **UV-CoT** model, presented in the paper [Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization](https://huggingface.co/papers/2504.18397).
* **Project page:** [https://kesenzhao.github.io/my_project/projects/UV-CoT.html](https://kesenzhao.github.io/my_project/projects/UV-CoT.html)
* **Code:** [https://github.com/UV-CoT](https://github.com/UV-CoT)
## Overview
Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). Existing approaches primarily focus on text CoT, limiting their ability to leverage visual cues. Unsupervised Visual CoT (UV-CoT) introduces a novel framework for image-level CoT reasoning via preference optimization, eliminating the need for extensive labeled bounding-box data.
UV-CoT achieves this by performing preference comparisons between model-generated bounding boxes. It generates preference data automatically, then uses an evaluator MLLM (e.g., OmniLLM-12B) to rank responses, which serves as supervision to train the target MLLM (e.g., LLaVA-1.5-7B). This approach emulates human perception—identifying key regions and reasoning based on them—thereby improving visual comprehension, particularly in spatial reasoning tasks.
<!--  -->
<div align="center">
<img src="./images/fig1.svg" alt="Figure 1: UV-CoT Overview" width="1200px" />
</div>
## Visualizations
Qualitative examples demonstrating UV-CoT's visual reasoning:
<div align="center">
<img src="./images/fig5_v1.2.svg" alt="Figure 5: UV-CoT Visualization 1" width="1200px" />
</div>
<div align="center">
<img src="./images/fig6_v1.2.svg" alt="Figure 6: UV-CoT Visualization 2" width="1200px" />
</div>
<!-- 
 -->
## Installation
To set up the environment and install necessary packages, follow these steps:
1. Clone this repository and navigate to the `UV-CoT` folder:
```bash
git clone https://github.com/UV-CoT
cd UV-CoT
```
2. Create a conda environment and install the package:
```bash
conda create -n uv-cot python=3.10 -y
conda activate uv-cot
pip install -e .
```
3. Install the required spaCy model:
```bash
wget https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3.tar.gz
pip install en_core_web_trf-3.7.3.tar.gz
```
## Usage
You can load and use the UV-CoT model with the `transformers` library. For detailed information on preference data curation, training, and evaluation, please refer to the [official GitHub repository](https://github.com/UV-CoT).
Here's a basic example of how to use the model for inference:
```python
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
import torch
# Load model and processor
model_id = "kesenZhaoNTU/UV-CoT" # Use this model_id to load UV-CoT
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)
# Load an example image
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bird.jpg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
# Define the conversation prompt
prompt = "Describe the image in detail."
messages = [
{"role": "user", "content": f"<image>
{prompt}"}
]
# Apply the chat template to format the prompt for the model
text = processor.apply_chat_template(messages, add_generation_prompt=True)
# Prepare inputs for the model
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
# Generate response
output = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(output[0], skip_special_tokens=True))
```
## Citation
If our work assists your research, feel free to give us a star ⭐ or cite us using:
```bibtex
@misc{zhao2025unsupervisedvisualchainofthoughtreasoning,
title={Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization},
author={Kesen Zhao and Beier Zhu and Qianru Sun and Hanwang Zhang},
year={2025},
eprint={2504.18397},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.18397},
}
``` |