|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- Qwen/Qwen2.5-VL-7B-Instruct |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
tags: |
|
- multimodal |
|
- action |
|
- agent |
|
--- |
|
# Holo1-7B |
|
|
|
## Model Description |
|
|
|
Holo1 is an Action Vision-Language Model (VLM) developed by [HCompany](https://www.hcompany.ai/) for use in the Surfer-H web agent system. It is designed to interact with web interfaces like a human user. |
|
|
|
As part of a broader agentic architecture, Holo1 acts as a policy, localizer, or validator, helping the agent understand and act in digital environments. |
|
|
|
Trained on a mix of open-access, synthetic, and self-generated data, Holo1 enables state-of-the-art (SOTA) performance on the [WebVoyager](https://arxiv.org/pdf/2401.13919) benchmark, offering the best accuracy/cost tradeoff among current models. |
|
It also excels in UI localization tasks such as [Screenspot](https://huggingface.co/datasets/rootsautomation/ScreenSpot), [Screenspot-V2](https://huggingface.co/datasets/HongxinLi/ScreenSpot_v2), [Screenspot-Pro](https://huggingface.co/datasets/likaixin/ScreenSpot-Pro), [GroundUI-Web](https://huggingface.co/datasets/agent-studio/GroundUI-1K), and our own newly introduced |
|
benchmark [WebClick](https://huggingface.co/datasets/Hcompany/WebClick). |
|
|
|
Holo1 is optimized for both accuracy and cost-efficiency, making it a strong open-source alternative to existing VLMs. |
|
|
|
For more details, check our paper and our blog post. |
|
|
|
- **Developed by:** [HCompany](https://www.hcompany.ai/) |
|
- **Model type:** Action Vision-Language Model |
|
- **Finetuned from model:** Qwen/Qwen2.5-VL-7B-Instruct |
|
- **Paper:** https://arxiv.org/abs/2506.02865 |
|
- **Blog Post:** https://www.hcompany.ai/surfer-h |
|
- **License:** Apache 2.0 |
|
|
|
## Results |
|
|
|
### Surfer-H: Pareto-Optimal Performance on [WebVoyager](https://arxiv.org/pdf/2401.13919) |
|
|
|
Surfer-H is designed to be flexible and modular. It is composed of three independent components: |
|
- A Policy model that plans, decides, and drives the agent's behavior |
|
- A Localizer model that sees and understands visual UIs to drive precise interactions |
|
- A Validator model that checks whether the answer is valid |
|
|
|
The agent thinks before acting, takes notes, and can retry if its answer is rejected. It can operate with different models for each module, allowing for tradeoffs between accuracy, speed, and cost. |
|
|
|
We evaluated Surfer-H on the [WebVoyager](https://arxiv.org/pdf/2401.13919) benchmark: 643 real-world web tasks ranging from retrieving prices to finding news or scheduling events. |
|
|
|
<div style="text-align: center;"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/682c3e22650f6bbe33bb9d94/kO_4DlW_O45Wi7eK9-r8v.png" width="800"/> |
|
</div> |
|
|
|
We’ve tested multiple configurations, from GPT-4-powered agents to 100% open Holo1 setups. Among them, the fully Holo1-based agents offered the strongest tradeoff between accuracy and cost: |
|
- Surfer-H + Holo1-7B: 92.2% accuracy at $0.13 per task |
|
- Surfer-H + GPT-4.1: 92.0% at $0.54 per task |
|
- Surfer-H + Holo1-3B: 89.7% at $0.11 per task |
|
- Surfer-H + GPT-4.1-mini: 88.8% at $0.26 per task |
|
|
|
This places Holo1-powered agents on the Pareto frontier, delivering the best accuracy per dollar. |
|
Unlike other agents that rely on custom APIs or brittle wrappers, Surfer-H operates purely through the browser — just like a real user. Combined with Holo1, it becomes a powerful, general-purpose, cost-efficient web automation system. |
|
|
|
### Holo1: State-of-the-Art UI Localization |
|
|
|
A key skill for the real-world utility of our VLMs within agents is localization: the ability to identify precise |
|
coordinates on a user interface (UI) to interact with to complete a task or follow an instruction. To assess |
|
this capability, we evaluated our Holo1 models on several established localization benchmarks, including |
|
[Screenspot](https://huggingface.co/datasets/rootsautomation/ScreenSpot), [Screenspot-V2](https://huggingface.co/datasets/HongxinLi/ScreenSpot_v2), [Screenspot-Pro](https://huggingface.co/datasets/likaixin/ScreenSpot-Pro), [GroundUI-Web](https://huggingface.co/datasets/agent-studio/GroundUI-1K), and our own newly introduced |
|
benchmark [WebClick](https://huggingface.co/datasets/Hcompany/WebClick). |
|
|
|
<div style="text-align: center;"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/682c3e22650f6bbe33bb9d94/UutD2Meevd5Xw0_mhX2wK.png" width="600"/> |
|
</div> |
|
|
|
<div style="text-align: center;"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/682c3e22650f6bbe33bb9d94/NhzkB8xnEQYMqiGxPnJSt.png" width="600"/> |
|
</div> |
|
|
|
## Get Started with the Model |
|
|
|
We provide 2 spaces to experiment with Localization and Navigation: |
|
- https://huggingface.co/spaces/Hcompany/Holo1-Navigation |
|
- https://huggingface.co/spaces/Hcompany/Holo1-Localization |
|
|
|
We provide starter code for the localization task: i.e. image + instruction -> click coordinates |
|
|
|
We also provide code to reproduce screenspot evaluations: screenspot_eval.py |
|
|
|
### Prepare model, processor |
|
|
|
Holo1 models are based on Qwen2.5-VL architecture, which comes with transformers support. Here we provide a simple usage example. |
|
You can load the model and the processor as follows: |
|
|
|
```python |
|
import json |
|
import os |
|
from typing import Any, Literal |
|
|
|
from transformers import AutoModelForImageTextToText, AutoProcessor |
|
|
|
# default: Load the model on the available device(s) |
|
# We recommend enabling flash_attention_2 for better acceleration and memory saving. |
|
model = AutoModelForImageTextToText.from_pretrained( |
|
"Hcompany/Holo1-7B", |
|
torch_dtype="auto", |
|
# torch_dtype=torch.bfloat16, |
|
# attn_implementation="flash_attention_2", |
|
device_map="auto", |
|
) |
|
|
|
# default processor |
|
processor = AutoProcessor.from_pretrained("Hcompany/Holo1-7B") |
|
# The default range for the number of visual tokens per image in the model is 4-1280. |
|
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost. |
|
# processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels) |
|
|
|
# Helper function to run inference |
|
def run_inference(messages: list[dict[str, Any]]) -> str: |
|
# Preparation for inference |
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
inputs = processor( |
|
text=[text], |
|
images=image, |
|
padding=True, |
|
return_tensors="pt", |
|
) |
|
inputs = inputs.to("cuda") |
|
|
|
generated_ids = model.generate(**inputs, max_new_tokens=128) |
|
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)] |
|
return processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False) |
|
``` |
|
|
|
### Prepare image and instruction |
|
|
|
WARNING: Holo1 is using absolute coordinates (number of pixels) and HuggingFace processor is doing image resize. To have matching coordinates, one needs to smart_resize the image. |
|
|
|
```python |
|
from PIL import Image |
|
from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize |
|
|
|
# Prepare image and instruction |
|
image_url = "https://huggingface.co/Hcompany/Holo1-7B/resolve/main/calendar_example.jpg" |
|
image = Image.open(requests.get(image_url, stream=True).raw) |
|
|
|
# Resize the image so that predicted absolute coordinates match the size of the image. |
|
image_processor = processor.image_processor |
|
resized_height, resized_width = smart_resize( |
|
image.height, |
|
image.width, |
|
factor=image_processor.patch_size * image_processor.merge_size, |
|
min_pixels=image_processor.min_pixels, |
|
max_pixels=image_processor.max_pixels, |
|
) |
|
image = image.resize(size=(resized_width, resized_height), resample=None) # type: ignore |
|
``` |
|
|
|
### Navigation with Structured Output |
|
|
|
```python |
|
import json |
|
from . import navigation |
|
|
|
task = "Book a hotel in Paris on August 3rd for 3 nights" |
|
prompt = navigation.get_navigation_prompt(task, image, step=1) |
|
navigation_str = run_inference(prompt)[0] |
|
navigation = navigation.NavigationStep(**json.loads(navigation_str)) |
|
print(navigation) |
|
# Expected NavigationStep(note='', thought='I need to select the check-out date as August 3rd and then proceed to search for hotels.', action=ClickElementAction(action='click_element', element='August 3rd on the calendar', x=777, y=282)) |
|
``` |
|
|
|
### Localization with click(x, y) |
|
|
|
```python |
|
from . import localization |
|
|
|
instruction = "Select July 14th as the check-out date" |
|
prompt = localization.get_localization_prompt(image, instruction) |
|
coordinates = run_inference(prompt)[0] |
|
print(coordinates) |
|
# Expected Click(352, 348) |
|
``` |
|
|
|
### Localization with Structured Output |
|
|
|
We trained Holo1 as an Action VLM with extensive use of json and tool calls. Therefore, it can be queried reliably with structured output: |
|
|
|
```python |
|
import json |
|
from . import localization |
|
|
|
instruction = "Select July 14th as the check-out date" |
|
prompt = localization.get_localization_prompt_structured_output(image, instruction) |
|
coordinates_structured_str = run_inference(prompt)[0] |
|
coordinates_structured = localization.ClickAction(**json.loads(coordinates_structured_str)) |
|
print(coordinates_structured) |
|
# Expected ClickAction(action='click', x=352, y=340) |
|
``` |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
``` |
|
@misc{andreux2025surferhmeetsholo1costefficient, |
|
title={Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights}, |
|
author={Mathieu Andreux and Breno Baldas Skuk and Hamza Benchekroun and Emilien Biré and Antoine Bonnet and Riaz Bordie and Matthias Brunel and Pierre-Louis Cedoz and Antoine Chassang and Mickaël Chen and Alexandra D. Constantinou and Antoine d'Andigné and Hubert de La Jonquière and Aurélien Delfosse and Ludovic Denoyer and Alexis Deprez and Augustin Derupti and Michael Eickenberg and Mathïs Federico and Charles Kantor and Xavier Koegler and Yann Labbé and Matthew C. H. Lee and Erwan Le Jumeau de Kergaradec and Amir Mahla and Avshalom Manevich and Adrien Maret and Charles Masson and Rafaël Maurin and Arturo Mena and Philippe Modard and Axel Moyal and Axel Nguyen Kerbel and Julien Revelle and Mats L. Richter and María Santos and Laurent Sifre and Maxime Theillard and Marc Thibault and Louis Thiry and Léo Tronchon and Nicolas Usunier and Tony Wu}, |
|
year={2025}, |
|
eprint={2506.02865}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.AI}, |
|
url={https://arxiv.org/abs/2506.02865}, |
|
} |
|
``` |
|
|
|
|