I want to know, what format of data is the model better at returning?
Is it the bounding box of the UI element or the center coordinates of the UI element? Normalized?
We had no rigorous ablation study on this. But from SeeClick's experience, within a certain scale, I think it's point instead of box. I chose to use points because such data are easier to obtain and the point doesn't need to perfectly fall at the center point (for example, the data in PixMO is actually quite noisy).
And, in the long run, ideally vision-only agents can go beyond element-based interactions, for example, smoothly clicking and draging cursors in a text box, working on Google Doc and Slides, etc. In those cases, we will need really visually-grounded agents, that can act accurately and freely beyond boxed elements
Even Operator is not perfect at this point.
I understand now.
So is it better to output normalized coordinates or absolute coordinates? Do I need to limit the output normalization or absolute coordinates in the prompt words?
One more thing I want to know is that Qwen2-VL, as a base model, does not have any restrictions on input size. So for the UGround-V1 model, even if there is no size limit, does it not affect the inference performance?