ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

We introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open-ended GUI environments for diverse exploration. ScreenExplorer is trained to explore and interact with the screen environment, learning to interact effectively with environments based on screenshots and a fixed instruction to encourage exploration.

This repo contains the LoRA checkpoints in the training process of ScreenExplorer-3B-E1 and ScreenExplorer-7B-E1. And LoRA checkpoints of ScreenExplorer-3B-Distill.

Citation

@misc{niu2025screenexplorertrainingvisionlanguagemodel,
      title={ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World}, 
      author={Runliang Niu and Jinglong Ji and Yi Chang and Qi Wang},
      year={2025},
      eprint={2505.19095},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.19095}, 
}

niurl
/

ScreenExplorer

ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

Citation

Model tree for niurl/ScreenExplorer