kesenZhaoNTU
/

UV-CoT

@@ -1,8 +1,109 @@
 ---
-license: cc-by-nc-nd-4.0
 base_model:
 - liuhaotian/llava-v1.5-7b
 pipeline_tag: image-text-to-text
 ---
-Paper page: https://huggingface.co/papers/2504.18397

 ---
 base_model:
 - liuhaotian/llava-v1.5-7b
+license: cc-by-nc-nd-4.0
 pipeline_tag: image-text-to-text
+library_name: transformers
+tags:
+- multimodal
+- chain-of-thought
 ---
+# UV-CoT: Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
+This repository hosts the **UV-CoT** model, presented in the paper [Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization](https://huggingface.co/papers/2504.18397).
+*   **Project page:** [https://kesenzhao.github.io/my_project/projects/UV-CoT.html](https://kesenzhao.github.io/my_project/projects/UV-CoT.html)
+*   **Code:** [https://github.com/UV-CoT](https://github.com/UV-CoT)
+## Overview
+Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). Existing approaches primarily focus on text CoT, limiting their ability to leverage visual cues. Unsupervised Visual CoT (UV-CoT) introduces a novel framework for image-level CoT reasoning via preference optimization, eliminating the need for extensive labeled bounding-box data.
+UV-CoT achieves this by performing preference comparisons between model-generated bounding boxes. It generates preference data automatically, then uses an evaluator MLLM (e.g., OmniLLM-12B) to rank responses, which serves as supervision to train the target MLLM (e.g., LLaVA-1.5-7B). This approach emulates human perception—identifying key regions and reasoning based on them—thereby improving visual comprehension, particularly in spatial reasoning tasks.
+![Figure 1: UV-CoT Overview](https://raw.githubusercontent.com/UV-CoT/UV-CoT/main/images/fig1.svg)
+## Visualizations
+Qualitative examples demonstrating UV-CoT's visual reasoning:
+![Figure 5: UV-CoT Visualization 1](https://raw.githubusercontent.com/UV-CoT/UV-CoT/main/images/fig5_v1.2.svg)
+![Figure 6: UV-CoT Visualization 2](https://raw.githubusercontent.com/UV-CoT/UV-CoT/main/images/fig6_v1.2.svg)
+## Installation
+To set up the environment and install necessary packages, follow these steps:
+1.  Clone this repository and navigate to the `UV-CoT` folder:
+    ```bash
+    git clone https://github.com/UV-CoT
+    cd UV-CoT
+    ```
+2.  Create a conda environment and install the package:
+    ```bash
+    conda create -n uv-cot python=3.10 -y
+    conda activate uv-cot
+    pip install -e .
+    ```
+3.  Install the required spaCy model:
+    ```bash
+    wget https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3.tar.gz
+    pip install en_core_web_trf-3.7.3.tar.gz
+    ```
+## Usage
+You can load and use the UV-CoT model with the `transformers` library. For detailed information on preference data curation, training, and evaluation, please refer to the [official GitHub repository](https://github.com/UV-CoT).
+Here's a basic example of how to use the model for inference:
+```python
+from transformers import AutoProcessor, AutoModelForCausalLM
+from PIL import Image
+import requests
+import torch
+# Load model and processor
+model_id = "kesenZhaoNTU/UV-CoT" # Use this model_id to load UV-CoT
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
+processor = AutoProcessor.from_pretrained(model_id)
+# Load an example image
+image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bird.jpg"
+image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
+# Define the conversation prompt
+prompt = "Describe the image in detail."
+messages = [
+    {"role": "user", "content": f"<image>
+{prompt}"}
+]
+# Apply the chat template to format the prompt for the model
+text = processor.apply_chat_template(messages, add_generation_prompt=True)
+# Prepare inputs for the model
+inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
+# Generate response
+output = model.generate(**inputs, max_new_tokens=200)
+print(processor.decode(output[0], skip_special_tokens=True))
+```
+## Citation
+If our work assists your research, feel free to give us a star ⭐ or cite us using:
+```bibtex
+@misc{zhao2025unsupervisedvisualchainofthoughtreasoning,
+      title={Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization},
+      author={Kesen Zhao and Beier Zhu and Qianru Sun and Hanwang Zhang},
+      year={2025},
+      eprint={2504.18397},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2504.18397},
+}
+```