Holo1: New family of GUI automation VLMs powering GUI agent Surfer-H

Community Article Published June 3, 2025

Today, at H Company, we are releasing Holo1, a family of Action Vision Language Models (VLMs) and WebClick, a new multimodal localization benchmark on the Hugging Face Hub.

Surfer-H, a web-native agent that interacts with browsers like a human relies on the Holo1.

Technical Report

Holo1

Holo1 is the first family of open-source Action VLMs designed specifically for deep web UI understanding and precise localization. The family includes Holo1-3B and Holo1-7B models, with the latter achieving 76.2% average accuracy on common UI localization benchmarks—the highest among small-size models. H Company has released these models with open-source on Hugging Face, along with the WebClick benchmark containing 1,639 human-like UI tasks.

Use with Transformers

Holo1 models are based on the Qwen2.5-VL architecture, and are fully compatible with transformers. Here we provide a simple usage example. You can load the model and the processor as follows.

from transformers import AutoModelForImageTextToText, AutoProcessor
import torch
model = AutoModelForImageTextToText.from_pretrained(
    "Hcompany/Holo1-3B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained("Hcompany/Holo1-3B")

Load the image and preprocess.

image_url = "https://huggingface.co/Hcompany/Holo1-3B/resolve/main/calendar_example.jpg" 


guidelines = "Localize an element on the GUI image according to my instructions and output a click position as Click(x, y) with x num pixels from the left edge and y num pixels from the top edge."
instruction = "Select July 14th as the check-out date"
messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "url": image_url,
                },
                {"type": "text", "text": f"{guidelines}\n{instruction}"},
            ],
        }
    ]


inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

We can now infer.

generated_ids = model.generate(**inputs, max_new_tokens=128)

decoded = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

# Click(352, 348)

Surfer-H

Web automation represents one of AI's most practical applications for businesses, but until now, solutions have often sacrificed cost-efficiency for performance. By making our Holo1 Action Models available in Hugging Face, users can now implement web automation solutions that achieve 92.2% accuracy on real-world web tasks at only $0.13 per task.

Surfer-H relies on the Holo1 family of open-weights models. It is a modular architecture for complete web task automation, which performs reading, thinking, clicking, scrolling, typing, and validating. It is designed to be flexible and modular, composed of three independent components: a Policy model that plans and drives the agent's behavior, a Localizer model that understands visual UIs for precise interactions, and a Validator model that confirms whether tasks are completed successfully. Unlike other agents that rely on custom APIs or brittle wrappers, Surfer-H operates purely through the browser — just like a real user.

Together, these solutions represent a new frontier in web automation, achieving state-of-the-art localization performance and setting the Pareto frontier in cost-efficient web navigation on the WebVoyager benchmark:

We're looking forward to see what you'll build with Holo1! Let's meet under the discussion tab of this blog post and the model repository!

Citation

@misc{andreux2025surferhmeetsholo1costefficient,
      title={Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights}, 
      author={Mathieu Andreux and Breno Baldas Skuk and Hamza Benchekroun and Emilien Biré and Antoine Bonnet and Riaz Bordie and Matthias Brunel and Pierre-Louis Cedoz and Antoine Chassang and Mickaël Chen and Alexandra D. Constantinou and Antoine d'Andigné and Hubert de La Jonquière and Aurélien Delfosse and Ludovic Denoyer and Alexis Deprez and Augustin Derupti and Michael Eickenberg and Mathïs Federico and Charles Kantor and Xavier Koegler and Yann Labbé and Matthew C. H. Lee and Erwan Le Jumeau de Kergaradec and Amir Mahla and Avshalom Manevich and Adrien Maret and Charles Masson and Rafaël Maurin and Arturo Mena and Philippe Modard and Axel Moyal and Axel Nguyen Kerbel and Julien Revelle and Mats L. Richter and María Santos and Laurent Sifre and Maxime Theillard and Marc Thibault and Louis Thiry and Léo Tronchon and Nicolas Usunier and Tony Wu},
      year={2025},
      eprint={2506.02865},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.02865}, 
}

Community

deleted
This comment has been hidden

Generate a real-world

Sign up or log in to comment