Any documentation for the inputs to the vision model?
#2
by
PaulTheHuman
- opened
For the vision part of the model only, are there any documentation for the inputs? phi-3-v-128k-instruct-vision.onnx
for example it has two inputs: image_sizes and pixel_values
in the pixel_values there is a index called max_num_crops. Can you explain what this number does, and if this is set >1 what is the data that you send to the pixel_values? Is it a selection of crops from the same image? I'm a bit confused by this.
Also, what are the restrictions on the image sizes? Do they have to be a multiple of 336 pixels in width and height? What is the smallest and largest possible?
Could someone please share the steps to convert the Microsoft/Phi-3.5-Vision-Instruct model to ONNX-DirectML? Any insights or guidance would be greatly appreciated!