UGround
Navigating GUIs as Humans Do: Universal Visual Grounding for GUI Agents (ICLR'25 Oral)
Viewer • Updated • 1.23M • 2.55k • 13Note The training data used in the paper
osunlp/UGround-V1-Data-Box
Viewer • Updated • 488k • 772 • 5Note data with bounding box coordinates
osunlp/UGround-V1-2B
Image-Text-to-Text • 2B • Updated • 2.12k • 8Note Based on Qwen2-VL-2B-Instruct
osunlp/UGround-V1-7B
Image-Text-to-Text • 8B • Updated • 3.7k • 17Note Based on Qwen2-VL-7B-Instruct
osunlp/UGround-V1-72B
Image-Text-to-Text • 73B • Updated • 58 • 4Note Based on Qwen2-VL-72B-Instruct. Full training without LoRA.
osunlp/UGround-V1-72B-Preview
Image-Text-to-Text • 73B • Updated • 10 • 2Note Based on Qwen2-VL-72B-Instruct. Trained with LoRA.
osunlp/UGround
Image-Text-to-Text • 7B • Updated • 117 • 23Note The initial model. Based on the modified LLaVA arch (CLIP + Vicuna-7B) describe in the paper
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Paper • 2410.05243 • Published • 19Note Low-cost, scalable and effective data synthesis pipeline for GUI visaul grounding; SOTA GUI visual grounding model UGround; purely vision-only (modular) GUI agent framework SeeAct-V; first time demonstrating SOTA performance of vision-only GUI agents.
15UGround
📱Note Paused. Will open a new one for Qwen2-VL-based UGround
1UGround-V1-2B
📱Note Paused. Trying to figure out how to accelerate the inference.