bfshi-nvidia commited on
Commit
6cb168a
·
verified ·
1 Parent(s): 230603b

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/example_selection_maps/bottom_up_selection_prob.png filter=lfs diff=lfs merge=lfs -text
37
+ assets/example_selection_maps/top_down_selection_prob_1.png filter=lfs diff=lfs merge=lfs -text
38
+ assets/example_selection_maps/top_down_selection_prob_2.png filter=lfs diff=lfs merge=lfs -text
39
+ assets/test_images/dock.jpg filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,320 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - google/siglip-so400m-patch14-384
6
+ pipeline_tag: image-feature-extraction
7
+ ---
8
+ <div align="center">
9
+
10
+ # Scaling Vision Pre-Training to 4K Resolution
11
+
12
+ [![website](https://img.shields.io/badge/website-76b900?style=for-the-badge&logo=safari&labelColor=555555)](https://nvlabs.github.io/PS3/)
13
+ [![Arxiv](https://img.shields.io/badge/Arxiv-b31b1b?style=for-the-badge&logo=arxiv&labelColor=555555)](https://arxiv.org/abs/2503.19903)
14
+ [![VILA-HD Demo](https://img.shields.io/badge/-VILA--HD_Demo-brightgreen?style=for-the-badge&logo=huggingface&labelColor=555555&color=ff6e00)](https://huggingface.co/spaces/bfshi/VILA-HD-demo)
15
+ [![PS3 Models](https://img.shields.io/badge/PS3%20Models%20-ffd21e?style=for-the-badge&logo=huggingface&labelColor=555555)](https://huggingface.co/collections/nvidia/ps3-scaling-vision-pre-training-to-4k-resolution-682d0535b61c07afd45242e9)
16
+ [![VILA-HD Models](https://img.shields.io/badge/VILA--HD%20Models%20-ffd21e?style=for-the-badge&logo=huggingface&labelColor=555555)](https://huggingface.co/collections/nvidia/ps3-scaling-vision-pre-training-to-4k-resolution-682d0535b61c07afd45242e9)
17
+ [![PS3 Code](https://img.shields.io/badge/PS3%20Code%20-181717?style=for-the-badge&logo=github&labelColor=555555)](https://github.com/NVlabs/PS3)
18
+ [![VILA-HD Code](https://img.shields.io/badge/VILA--HD%20Code%20-181717?style=for-the-badge&logo=github&labelColor=555555)](https://github.com/NVlabs/VILA/tree/main/vila_hd)
19
+
20
+ <div style="font-family: charter;">
21
+ <a href="https://bfshi.github.io" target="_blank" style="color: #6f6f6f; text-decoration: none;">Baifeng Shi</a><sup style="font-size: 0.6em;">1,2</sup>&nbsp;&nbsp;&nbsp;
22
+ <a href="https://sites.google.com/site/boyilics/home" target="_blank" style="color: #6f6f6f; text-decoration: none;">Boyi Li</a><sup style="font-size: 0.6em;">1,2</sup>&nbsp;&nbsp;&nbsp;
23
+ <a href="https://han-cai.github.io/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Han Cai</a><sup style="font-size: 0.6em;">2</sup>&nbsp;&nbsp;&nbsp;
24
+ <a href="https://scholar.google.com/citations?user=OI7zFmwAAAAJ&hl=en/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Yao Lu</a><sup style="font-size: 0.6em;">2</sup>&nbsp;&nbsp;&nbsp;
25
+ <a href="https://sifeiliu.net/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Sifei Liu</a><sup style="font-size: 0.6em;">2</sup>&nbsp;&nbsp;&nbsp;
26
+ <a href="https://research.nvidia.com/person/marco-pavone" target="blank" style="color: #6f6f6f; text-decoration: none;">Marco Pavone</a><sup style="font-size: 0.6em;">2</sup>
27
+ <br>
28
+ <a href="https://jankautz.com/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Jan Kautz</a><sup style="font-size: 0.6em;">2</sup>&nbsp;&nbsp;&nbsp;
29
+ <a href="https://hanlab.mit.edu/songhan/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Song Han</a><sup style="font-size: 0.6em;">2</sup>&nbsp;&nbsp;&nbsp;
30
+ <a href="https://people.eecs.berkeley.edu/~trevor/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Trevor Darrell</a><sup style="font-size: 0.6em;">1</sup>&nbsp;&nbsp;&nbsp;
31
+ <a href="https://www.pmolchanov.com/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Pavlo Molchanov</a><sup style="font-size: 0.6em;">2</sup>&nbsp;&nbsp;&nbsp;
32
+ <a href="https://hongxu-yin.github.io/" target="_blank" style="color: #6f6f6f; text-decoration: none;">Hongxu Yin</a><sup style="font-size: 0.6em;">2</sup>
33
+ <br>
34
+ </a><sup style="font-size: 0.6em;">1</sup> UC Berkeley&nbsp;&nbsp;&nbsp;
35
+ </a><sup style="font-size: 0.6em;">2</sup> NVIDIA&nbsp;&nbsp;&nbsp;
36
+ </div>
37
+
38
+ </div>
39
+
40
+ <hr style="border: 2px solid gray;"></hr>
41
+
42
+
43
+ ## Pre-Trained Models
44
+
45
+ ### PS3 models
46
+
47
+ | Vision Model | Max Resolution | Pre-Trained Weights |
48
+ |-----------------|----------------|-------------------------------------------------------------------------|
49
+ | PS3-1.5K-SigLIP | 1512 * 1512 | [nvidia/PS3-1.5K-SigLIP](https://huggingface.co/nvidia/PS3-1.5K-SigLIP) |
50
+ | PS3-4K-SigLIP | 3780 * 3780 | [nvidia/PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) |
51
+
52
+ <hr style="border: 2px solid gray;"></hr>
53
+
54
+ ## Performance
55
+
56
+ ### Performance of PS3 models
57
+
58
+ See Table 1 in the paper for full results.
59
+
60
+ | Vision Model | Pre-Trained Weights | Max Resolution | # High-Res Token | TextVQA | ChartQA | DocVQA | InfoVQA | OCRBench | V*Bench | RealWorldQA | Avg |
61
+ |---------------------|-------------------------------------------------------------------------|----------------|------------------|---------|---------|--------|---------|----------|---------|-------------|------|
62
+ | SigLIP | | 378 | 0 | 62.3 | 56.6 | 51.9 | 30.7 | 387 | 51.8 | 57.1 | 49.9 |
63
+ | SigLIP + AnyRes | | 1512 | 3136 | 67.4 | 58.4 | 67.9 | 34.1 | 468 | 60.2 | 59.0 | 56.3 |
64
+ | SigLIP + S2 | | 1512 | 2916 | 66.1 | 71.0 | 78.3 | 41.1 | 526 | 55.2 | 61.0 | 60.8 |
65
+ | **PS3-1.5K-SigLIP** | [nvidia/PS3-1.5K-SigLIP](https://huggingface.co/nvidia/PS3-1.5K-SigLIP) | 1512 | 3645 | 69.3 | 71.1 | 79.4 | 41.3 | 534 | 64.0 | 63.8 | 63.2 |
66
+ | SigLIP + AnyRes | | 3780 | 19600 | OOM | OOM | OOM | OOM | OOM | OOM | OOM | OOM |
67
+ | SigLIP + S2 | | 3780 | 18225 | OOM | OOM | OOM | OOM | OOM | OOM | OOM | OOM |
68
+ | **PS3-4K-SigLIP** | [nvidia/PS3-4K-SigLIP](https://huggingface.co/nvidia/PS3-4K-SigLIP) | 3780 | 3840 | 69.8 | 70.9 | 79.1 | 40.5 | 543 | 67.8 | 64.7 | 63.9 |
69
+
70
+
71
+ ## Installation
72
+
73
+ Install through pip to use PS3 out of the box.
74
+ ```bash
75
+ pip install ps3-torch
76
+ ```
77
+
78
+ If you would like to make changes to the PS3 code, clone this repo and install in editable mode.
79
+ ```bash
80
+ cd PS3
81
+ pip install -e .
82
+ ```
83
+
84
+ <hr style="border: 2px solid gray;"></hr>
85
+
86
+
87
+ ## Quick Start
88
+
89
+ Here we show example usage including
90
+ - loading the model
91
+ - selectively encoding high-res image based on image saliency (bottom-up selection) and visualizing the selection probabilities
92
+ - selectively encoding high-res image based on text prompts (top-down selection) and visualizing the selection probabilities
93
+ - formatting the encoded features into (masked) feature maps
94
+
95
+ ### 1. Load Model and Image
96
+ ```python
97
+ from PIL import Image
98
+ from ps3 import PS3VisionModel, PS3ImageProcessor
99
+
100
+ # Load the PS3 model and processor.
101
+ vision_model = PS3VisionModel.from_pretrained("nvidia/PS3-4K-SigLIP")
102
+ processor = PS3ImageProcessor.from_pretrained("nvidia/PS3-4K-SigLIP")
103
+ vision_model.cuda().eval()
104
+
105
+ # You can replace it with your own image.
106
+ image = Image.open("assets/test_images/dock.jpg")
107
+
108
+ # Preprocess the image.
109
+ x = processor(image)["pixel_values"][0].unsqueeze(0).cuda()
110
+ ```
111
+
112
+ ### 2. Encode High-Res Image with Bottom-Up Selection
113
+
114
+ PS3 can select important high-res patches baed on visual saliency and encode those patches.
115
+
116
+ **You can encode the whole high-res image using PS3.**
117
+ ```python
118
+ outs = vision_model(x, num_look_close="all")
119
+ features = outs.last_hidden_state
120
+ print(features.shape) # (1, 88209, 1152)
121
+ ```
122
+ Note the PS3-4K model processes the image at multiple scales: 378 (low-res), 756, 1512, and 3780, and it has a patch size of 14.
123
+
124
+ Then the number of tokens at each scale is (378/14)^2 = 729, (756/14)^2 = 2916, (1512/14)^2 = 11664, and (3780/14)^2 = 72900.
125
+
126
+ The output hidden state concatenates all the tokens along sequence dimension.
127
+ That gives us 729 + 2916 + 11664 + 72900 = 88209 tokens in total.
128
+
129
+ **You can encode parts of the high-res image by setting `num_look_close`, i.e., how many times to run the high-res selection and encoding.**
130
+ ```python
131
+ outs = vision_model(x, num_look_close=2)
132
+ features = outs.last_hidden_state
133
+ print(features.shape) # (1, 5849, 1152)
134
+ ```
135
+ In this example, it only runs the high-res selection and encoding for twice.
136
+
137
+ Note that PS3 processes at most 2560 high-res patches at a time. Then running high-res selection and encoding for twice gives us 2560 * 2 = 5120 high-res tokens. There is also 729 low-res tokens. That gives us 729 + 5120 = 5849 tokens in total.
138
+
139
+ **You can also decide how many high-res tokens to process by setting `num_token_look_close`.**
140
+ ```python
141
+ outs = vision_model(x, num_token_look_close=3000)
142
+ features = outs.last_hidden_state
143
+ print(features.shape) # (1, 3729, 1152)
144
+ ```
145
+ In this example, it only processes 3000 high-res tokens. Note that PS3 only processes 2560 high-res patches at a time. This means it needs to run the high-res selection and encoding for twice, with the first time processing 2560 high-res tokens and the second time processing 440 tokens. In the end it outputs 3729 tokens (3000 high-res + 729 low-res).
146
+
147
+ **Visualize the bottom-up patch selection probabilities.**
148
+ ```python
149
+ ############## Helper functions for visiualization ##############
150
+
151
+ # install cv2, matplotlib, scipy for visualization purpose
152
+ os.system("pip install opencv-python matplotlib scipy")
153
+ from torchvision import transforms
154
+ import numpy as np
155
+ import os
156
+ import cv2
157
+ import matplotlib.pyplot as plt
158
+ from scipy.ndimage import gaussian_filter
159
+
160
+ def create_heatmap_overlay(image, heatmap, alpha=0.4, colormap=plt.cm.jet, sigma=10.0):
161
+ if len(image.shape) == 2:
162
+ image = cv2.cvtColor(image, cv2.COLOR_GRAY2RGB)
163
+
164
+ smoothed_heatmap = gaussian_filter(heatmap.astype(np.float32), sigma=sigma)
165
+ smoothed_heatmap = (smoothed_heatmap - smoothed_heatmap.min()) / \
166
+ (smoothed_heatmap.max() - smoothed_heatmap.min())
167
+ colored_heatmap = (colormap(smoothed_heatmap) * 255).astype(np.uint8)
168
+
169
+ if colored_heatmap.shape[-1] == 4:
170
+ colored_heatmap = colored_heatmap[:, :, :3]
171
+
172
+ overlay = cv2.addWeighted(image, 1 - alpha, colored_heatmap, alpha, 0)
173
+ return Image.fromarray(overlay)
174
+
175
+ def save_visualization(selection_probs, image, output_dir):
176
+ os.makedirs(output_dir, exist_ok=True)
177
+ resize_transform = transforms.Resize(image.size[::-1])
178
+ for i, prob in enumerate(selection_probs):
179
+ prob = (prob - prob.min()) / (prob.max() - prob.min() + 1e-6)
180
+ prob = resize_transform(prob)
181
+ prob = prob.squeeze(0).detach().cpu().numpy()
182
+ # overlay the selection probability map on the original image
183
+ overlay = create_heatmap_overlay(np.array(image), prob)
184
+ overlay.save(os.path.join(output_dir, f"selection_prob_scale_{i}.png"))
185
+ image.save(os.path.join(output_dir, f"image.png"))
186
+
187
+ #################### End of helper functions ####################
188
+
189
+ selection_probs = outs.selection_probs
190
+ print([p.shape for p in selection_probs]) # [(1, 54, 54), (1, 108, 108), (1, 270, 270)]
191
+ save_visualization(selection_probs, image, "save_path/bottom_up_selection_probs")
192
+ ```
193
+ `selection_probs` contains the selection probability map for each scale. In this case, the feature map of each scale has shapes of 54x54, 108x108, and 270x270. The selection probability reflects how salient/important each patch is and patches with higher probability are selected first. You can visit the demo for more visualization.
194
+
195
+ ![Bottom-Up Selection Probabilities](assets/example_selection_maps/bottom_up_selection_prob.png)
196
+
197
+
198
+
199
+
200
+
201
+ ### 3. Encode High-Res Image with Top-Down Selection
202
+
203
+ PS3 can also select important high-res patches based on any text prompt.
204
+
205
+ First of all, load the text model and encode the text prompt.
206
+ ```python
207
+ from ps3 import PS3Tokenizer, PS3TextModel
208
+
209
+ tokenizer = PS3Tokenizer.from_pretrained("nvidia/PS3-4K-SigLIP")
210
+ text_model = PS3TextModel.from_pretrained("nvidia/PS3-4K-SigLIP")
211
+ text_model.cuda().eval()
212
+
213
+ text = ["A tall spire with a cross at the top of the building."]
214
+ text = tokenizer(text).cuda()
215
+ prompt = text_model(text).prompt
216
+ ```
217
+
218
+ Then PS3 can select important high-res patches based on the text prompt and encode those patches.
219
+ ```python
220
+ outs = vision_model(x, num_look_close=2, prompt=prompt)
221
+ features = outs.last_hidden_state
222
+ print(features.shape) # (1, 5849, 1152)
223
+ ```
224
+
225
+ You can visualize the top-down selection probabilities. Usually the regions related to the text prompt have higher selection probabilities.
226
+ ```python
227
+ selection_probs = outs.selection_probs
228
+ save_visualization(selection_probs, image, "save_path/top_down_selection_probs_1")
229
+ ```
230
+
231
+ ![Top-Down Selection Probabilities](assets/example_selection_maps/top_down_selection_prob_1.png)
232
+
233
+ You can change to another text prompt and see different selection probabilities.
234
+ ```python
235
+ text = ["A green rope on the green and red boat."]
236
+ text = tokenizer(text).cuda()
237
+ prompt = text_model(text).prompt
238
+ outs = vision_model(x, num_look_close=2, prompt=prompt)
239
+ selection_probs = outs.selection_probs
240
+ save_visualization(selection_probs, image, "save_path/top_down_selection_probs_2")
241
+ ```
242
+
243
+ ![Top-Down Selection Probabilities](assets/example_selection_maps/top_down_selection_prob_2.png)
244
+
245
+ ### 4. Format the Encoded Features into (Masked) Feature Maps
246
+
247
+ The features returned above are the concatenation of all the low-res and high-res features.
248
+
249
+ You can format the features into masked feature maps for each scale.
250
+ ```python
251
+ feature_maps = vision_model.vision_model.format_features_into_feature_maps(outs.last_hidden_state, outs.selection_maps)
252
+ print([x.shape for x in feature_maps]) # [(1, 1152, 27, 27), (1, 1152, 54, 54), (1, 1152, 108, 108), (1, 1152, 270, 270)]
253
+ ```
254
+ This will create a masked feature map `feature_maps` which is a list of feature maps (B * C * H * W) for each scale and each feature map contains the actual feature for the selected patches at that scaleand zero vector for the unselected patches.
255
+
256
+
257
+
258
+ <hr style="border: 2px solid gray;"></hr>
259
+
260
+ ## Inference
261
+
262
+ [Quick Start](#quick-start) gives some examples of how to use PS3 to encode an image. Below are more detailed explanations of the arguments of model inference.
263
+
264
+ ```python
265
+ class PS3VisionModel(PS3PreTrainedModel):
266
+ ...
267
+ def forward(
268
+ self,
269
+ pixel_values,
270
+ num_look_close,
271
+ num_token_look_close=None,
272
+ prompt=None,
273
+ gt_selection_maps=None,
274
+ smooth_selection_prob=False,
275
+ only_select_first_n_scale=None,
276
+ is_global_text=None,
277
+ pool_gt_token_only=False,
278
+ ):
279
+ ...
280
+ ```
281
+ `pixel_values`: the input images with shape (B, C, H, W).
282
+
283
+ `num_look_close`: how many times to run high-res selection and encoding. PS3 selects and processes 2560 patches each time. If set to `all` then it selects all the high-res patches. If set to `0` then PS3 only returns the low-res features. If set to a larger number than what it needs to encode all the high-res patches, then PS3 will clamp it to the max number needed.
284
+
285
+ `num_token_look_close`: (optinoal) how many high-res patches to select and process. Similar to `num_look_close` but counts the number of high-res tokens instead of number of running high-res encoding.
286
+
287
+ `prompt`: (optional) the prompt embedding used to select high-res patches. The prompt embedding can be embedding of some text, or some embedding output by an LLM (see paper). The shape of prompt embedding is (B, C) where B is the batch size (same in `pixel_values`) and C is the embedding dimension (same as PS3 token embedding dimension). If `prompt=None`, then PS3 will select high-res patches based on visual saliency (bottom-up selection).
288
+
289
+ `gt_selection_maps`: (optional) the ground truth selection maps for the image. It should be a tensor of 0/1 values with shape (B, h, w). Regions with value 1 means they should be selected. When selectin high-res patches, PS3 will interpolate the `gt_selection_maps` to the same size as the feature map at each scale, prioritize selecting the tokens where the value is 1, and if there's still budget for selecting more tokens, select the rest based on the original selection probability.
290
+
291
+ `smooth_selection_prob`: (optional) smooth the selectino probability map such that the selected patches won't be distributed too scarcely each time it runs high-res selection. It slightly improves the performance occasinoally when selecting all the patches but usually hurts when selecting parts of the patches.
292
+
293
+ `only_select_first_n_scale`: (optional) only select the first n high-res scales. For example, for PS3-4K model, if `only_select_first_n_scale=2`, then only select and process scales of 756 and 1512, and ignore the scale of 3780.
294
+
295
+ `is_global_text`: (optional) only return the pooled low-res feautres. *It will only be used during pre-training.*
296
+
297
+ `pool_gt_token_only`: (optional) only pool the tokens inside the gt selection regions. *It will only be used during pre-training.*
298
+
299
+
300
+
301
+ <hr style="border: 2px solid gray;"></hr>
302
+
303
+
304
+ ## More Details
305
+ Please refer to the [PS3 codebase](https://github.com/NVlabs/PS3) for more details.
306
+
307
+ ## Citation
308
+
309
+ If you find this work useful in your research, please consider citing:
310
+
311
+ ```bibtex
312
+ @article{shi2025scaling,
313
+ title={Scaling Vision Pre-Training to 4K Resolution},
314
+ author={Shi, Baifeng and Li, Boyi and Cai, Han and Lu, Yao and Liu, Sifei and Pavone, Marco and Kautz, Jan and Han, Song and Darrell, Trevor and Molchanov, Pavlo and others},
315
+ journal={arXiv preprint arXiv:2503.19903},
316
+ year={2025}
317
+ }
318
+ ```
319
+
320
+
assets/example_selection_maps/bottom_up_selection_prob.png ADDED

Git LFS Details

  • SHA256: 497792f28e133233b02988881b2cd4600ebc9fab29fd74120697c8f527b2ed5a
  • Pointer size: 132 Bytes
  • Size of remote file: 1.21 MB
assets/example_selection_maps/top_down_selection_prob_1.png ADDED

Git LFS Details

  • SHA256: d8f193d0a1a8065070b86cefa41dcefd27389cf88693a7b272547b8a98a2001e
  • Pointer size: 132 Bytes
  • Size of remote file: 1.2 MB
assets/example_selection_maps/top_down_selection_prob_2.png ADDED

Git LFS Details

  • SHA256: 0e191dc01285341f6e353e147ed88f492a74cd282135a2e6bc3af6ee2741c54f
  • Pointer size: 132 Bytes
  • Size of remote file: 1.18 MB
assets/test_images/dock.jpg ADDED

Git LFS Details

  • SHA256: 2c35ed6357e5eed620bbcfda31315dead1e9ad8b6a4f324131705f61f489d99d
  • Pointer size: 132 Bytes
  • Size of remote file: 1.71 MB