Improve model card: Add library, links, detailed sections, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +103 -2
README.md CHANGED
@@ -1,8 +1,109 @@
1
  ---
2
- license: cc-by-nc-nd-4.0
3
  base_model:
4
  - liuhaotian/llava-v1.5-7b
 
5
  pipeline_tag: image-text-to-text
 
 
 
 
6
  ---
7
 
8
- Paper page: https://huggingface.co/papers/2504.18397
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  base_model:
3
  - liuhaotian/llava-v1.5-7b
4
+ license: cc-by-nc-nd-4.0
5
  pipeline_tag: image-text-to-text
6
+ library_name: transformers
7
+ tags:
8
+ - multimodal
9
+ - chain-of-thought
10
  ---
11
 
12
+ # UV-CoT: Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
13
+
14
+ This repository hosts the **UV-CoT** model, presented in the paper [Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization](https://huggingface.co/papers/2504.18397).
15
+
16
+ * **Project page:** [https://kesenzhao.github.io/my_project/projects/UV-CoT.html](https://kesenzhao.github.io/my_project/projects/UV-CoT.html)
17
+ * **Code:** [https://github.com/UV-CoT](https://github.com/UV-CoT)
18
+
19
+ ## Overview
20
+
21
+ Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). Existing approaches primarily focus on text CoT, limiting their ability to leverage visual cues. Unsupervised Visual CoT (UV-CoT) introduces a novel framework for image-level CoT reasoning via preference optimization, eliminating the need for extensive labeled bounding-box data.
22
+
23
+ UV-CoT achieves this by performing preference comparisons between model-generated bounding boxes. It generates preference data automatically, then uses an evaluator MLLM (e.g., OmniLLM-12B) to rank responses, which serves as supervision to train the target MLLM (e.g., LLaVA-1.5-7B). This approach emulates human perception—identifying key regions and reasoning based on them—thereby improving visual comprehension, particularly in spatial reasoning tasks.
24
+
25
+ ![Figure 1: UV-CoT Overview](https://raw.githubusercontent.com/UV-CoT/UV-CoT/main/images/fig1.svg)
26
+
27
+ ## Visualizations
28
+
29
+ Qualitative examples demonstrating UV-CoT's visual reasoning:
30
+
31
+ ![Figure 5: UV-CoT Visualization 1](https://raw.githubusercontent.com/UV-CoT/UV-CoT/main/images/fig5_v1.2.svg)
32
+ ![Figure 6: UV-CoT Visualization 2](https://raw.githubusercontent.com/UV-CoT/UV-CoT/main/images/fig6_v1.2.svg)
33
+
34
+ ## Installation
35
+
36
+ To set up the environment and install necessary packages, follow these steps:
37
+
38
+ 1. Clone this repository and navigate to the `UV-CoT` folder:
39
+ ```bash
40
+ git clone https://github.com/UV-CoT
41
+ cd UV-CoT
42
+ ```
43
+
44
+ 2. Create a conda environment and install the package:
45
+ ```bash
46
+ conda create -n uv-cot python=3.10 -y
47
+ conda activate uv-cot
48
+ pip install -e .
49
+ ```
50
+ 3. Install the required spaCy model:
51
+ ```bash
52
+ wget https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3.tar.gz
53
+ pip install en_core_web_trf-3.7.3.tar.gz
54
+ ```
55
+
56
+ ## Usage
57
+
58
+ You can load and use the UV-CoT model with the `transformers` library. For detailed information on preference data curation, training, and evaluation, please refer to the [official GitHub repository](https://github.com/UV-CoT).
59
+
60
+ Here's a basic example of how to use the model for inference:
61
+
62
+ ```python
63
+ from transformers import AutoProcessor, AutoModelForCausalLM
64
+ from PIL import Image
65
+ import requests
66
+ import torch
67
+
68
+ # Load model and processor
69
+ model_id = "kesenZhaoNTU/UV-CoT" # Use this model_id to load UV-CoT
70
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
71
+ processor = AutoProcessor.from_pretrained(model_id)
72
+
73
+ # Load an example image
74
+ image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bird.jpg"
75
+ image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
76
+
77
+ # Define the conversation prompt
78
+ prompt = "Describe the image in detail."
79
+ messages = [
80
+ {"role": "user", "content": f"<image>
81
+ {prompt}"}
82
+ ]
83
+
84
+ # Apply the chat template to format the prompt for the model
85
+ text = processor.apply_chat_template(messages, add_generation_prompt=True)
86
+
87
+ # Prepare inputs for the model
88
+ inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
89
+
90
+ # Generate response
91
+ output = model.generate(**inputs, max_new_tokens=200)
92
+ print(processor.decode(output[0], skip_special_tokens=True))
93
+ ```
94
+
95
+ ## Citation
96
+
97
+ If our work assists your research, feel free to give us a star ⭐ or cite us using:
98
+
99
+ ```bibtex
100
+ @misc{zhao2025unsupervisedvisualchainofthoughtreasoning,
101
+ title={Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization},
102
+ author={Kesen Zhao and Beier Zhu and Qianru Sun and Hanwang Zhang},
103
+ year={2025},
104
+ eprint={2504.18397},
105
+ archivePrefix={arXiv},
106
+ primaryClass={cs.CV},
107
+ url={https://arxiv.org/abs/2504.18397},
108
+ }
109
+ ```