I want to know, what format of data is the model better at returning?

by Kongyafei - opened 1 day ago

Discussion

Kongyafei

1 day ago

Is it the bounding box of the UI element or the center coordinates of the UI element? Normalized?

BoyuNLP

OSU NLP Group org 1 day ago

•

edited 1 day ago

We had no rigorous ablation study on this. But from SeeClick's experience, within a certain scale, I think it's point instead of box. I chose to use points because such data are easier to obtain and the point doesn't need to perfectly fall at the center point (for example, the data in PixMO is actually quite noisy).

BoyuNLP changed discussion status to closed 1 day ago

BoyuNLP

OSU NLP Group org 1 day ago

•

edited 1 day ago

And, in the long run, ideally vision-only agents can go beyond element-based interactions, for example, smoothly clicking and draging cursors in a text box, working on Google Doc and Slides, etc. In those cases, we will need really visually-grounded agents, that can act accurately and freely beyond boxed elements

Even Operator is not perfect at this point.

BoyuNLP changed discussion status to open 1 day ago

Kongyafei

1 day ago

I understand now.
So is it better to output normalized coordinates or absolute coordinates? Do I need to limit the output normalization or absolute coordinates in the prompt words?
One more thing I want to know is that Qwen2-VL, as a base model, does not have any restrictions on input size. So for the UGround-V1 model, even if there is no size limit, does it not affect the inference performance?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment