File size: 10,366 Bytes
12a48e5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89b224a
8bcd674
89b224a
8bcd674
89b224a
9bf8c94
 
8bcd674
89b224a
8bcd674
12a48e5
 
9bf8c94
7765490
12a48e5
619bd79
249a48b
12a48e5
 
b8e0ef3
 
7fad5b8
b8e0ef3
7fad5b8
b8e0ef3
 
 
 
 
 
7fad5b8
b8e0ef3
aee8dfa
7e248b3
aee8dfa
b8e0ef3
 
d16b29f
 
 
 
b8e0ef3
 
7fad5b8
b8e0ef3
 
 
 
 
 
 
 
 
3f397d2
202ecc3
3f397d2
 
 
202ecc3
3f397d2
b8e0ef3
 
12a48e5
d78a7a0
 
 
 
427e294
41cf454
427e294
 
25f8600
 
89b224a
25f8600
ca30e72
8bcd674
41cf454
 
 
 
8bcd674
 
 
c64f733
8bcd674
249a48b
8bcd674
 
 
 
 
 
41cf454
249a48b
41cf454
8bcd674
 
 
25f8600
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5320aee
 
25f8600
5320aee
 
 
41cf454
249a48b
53ccd3d
41cf454
e2580b7
41cf454
 
 
 
 
 
 
8bcd674
41cf454
ca30e72
41cf454
d78a7a0
f374339
 
d78a7a0
 
 
 
 
 
d2d07b1
d78a7a0
 
f374339
 
d78a7a0
41cf454
ca30e72
d78a7a0
41cf454
d78a7a0
 
 
 
 
 
41cf454
d78a7a0
41cf454
d78a7a0
41cf454
d78a7a0
 
 
41cf454
d78a7a0
 
 
 
 
41cf454
8bcd674
 
12a48e5
 
 
 
3843676
 
 
 
 
 
 
 
 
 
 
12a48e5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- multimodal
- action
- agent
---
# Holo1-7B

## Model Description

Holo1 is an Action Vision-Language Model (VLM) developed by [HCompany](https://www.hcompany.ai/) for use in the Surfer-H web agent system. It is designed to interact with web interfaces like a human user.

As part of a broader agentic architecture, Holo1 acts as a policy, localizer, or validator, helping the agent understand and act in digital environments.

Trained on a mix of open-access, synthetic, and self-generated data, Holo1 enables state-of-the-art (SOTA) performance on the [WebVoyager](https://arxiv.org/pdf/2401.13919) benchmark, offering the best accuracy/cost tradeoff among current models.
It also excels in UI localization tasks such as [Screenspot](https://huggingface.co/datasets/rootsautomation/ScreenSpot), [Screenspot-V2](https://huggingface.co/datasets/HongxinLi/ScreenSpot_v2), [Screenspot-Pro](https://huggingface.co/datasets/likaixin/ScreenSpot-Pro), [GroundUI-Web](https://huggingface.co/datasets/agent-studio/GroundUI-1K), and our own newly introduced
benchmark [WebClick](https://huggingface.co/datasets/Hcompany/WebClick).

Holo1 is optimized for both accuracy and cost-efficiency, making it a strong open-source alternative to existing VLMs.

For more details, check our paper and our blog post.

- **Developed by:** [HCompany](https://www.hcompany.ai/)
- **Model type:** Action Vision-Language Model
- **Finetuned from model:** Qwen/Qwen2.5-VL-7B-Instruct
- **Paper:** https://arxiv.org/abs/2506.02865
- **Blog Post:** https://www.hcompany.ai/surfer-h
- **License:** Apache 2.0

## Results

### Surfer-H: Pareto-Optimal Performance on [WebVoyager](https://arxiv.org/pdf/2401.13919)

Surfer-H is designed to be flexible and modular. It is composed of three independent components:
- A Policy model that plans, decides, and drives the agent's behavior
- A Localizer model that sees and understands visual UIs to drive precise interactions
- A Validator model that checks whether the answer is valid

The agent thinks before acting, takes notes, and can retry if its answer is rejected. It can operate with different models for each module, allowing for tradeoffs between accuracy, speed, and cost.

We evaluated Surfer-H on the [WebVoyager](https://arxiv.org/pdf/2401.13919) benchmark: 643 real-world web tasks ranging from retrieving prices to finding news or scheduling events.

<div style="text-align: center;">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/682c3e22650f6bbe33bb9d94/kO_4DlW_O45Wi7eK9-r8v.png" width="800"/>
</div>

We’ve tested multiple configurations, from GPT-4-powered agents to 100% open Holo1 setups. Among them, the fully Holo1-based agents offered the strongest tradeoff between accuracy and cost:
- Surfer-H + Holo1-7B: 92.2% accuracy at $0.13 per task
- Surfer-H + GPT-4.1: 92.0% at $0.54 per task
- Surfer-H + Holo1-3B: 89.7% at $0.11 per task
- Surfer-H + GPT-4.1-mini: 88.8% at $0.26 per task

This places Holo1-powered agents on the Pareto frontier, delivering the best accuracy per dollar.
Unlike other agents that rely on custom APIs or brittle wrappers, Surfer-H operates purely through the browser — just like a real user. Combined with Holo1, it becomes a powerful, general-purpose, cost-efficient web automation system.

### Holo1: State-of-the-Art UI Localization

A key skill for the real-world utility of our VLMs within agents is localization: the ability to identify precise
coordinates on a user interface (UI) to interact with to complete a task or follow an instruction. To assess
this capability, we evaluated our Holo1 models on several established localization benchmarks, including
[Screenspot](https://huggingface.co/datasets/rootsautomation/ScreenSpot), [Screenspot-V2](https://huggingface.co/datasets/HongxinLi/ScreenSpot_v2), [Screenspot-Pro](https://huggingface.co/datasets/likaixin/ScreenSpot-Pro), [GroundUI-Web](https://huggingface.co/datasets/agent-studio/GroundUI-1K), and our own newly introduced
benchmark [WebClick](https://huggingface.co/datasets/Hcompany/WebClick).

<div style="text-align: center;">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/682c3e22650f6bbe33bb9d94/UutD2Meevd5Xw0_mhX2wK.png" width="600"/>
</div>

<div style="text-align: center;">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/682c3e22650f6bbe33bb9d94/NhzkB8xnEQYMqiGxPnJSt.png" width="600"/>
</div>

## Get Started with the Model

We provide 2 spaces to experiment with Localization and Navigation:
 - https://huggingface.co/spaces/Hcompany/Holo1-Navigation
 - https://huggingface.co/spaces/Hcompany/Holo1-Localization

We provide starter code for the localization task: i.e. image + instruction -> click coordinates

We also provide code to reproduce screenspot evaluations: screenspot_eval.py

### Prepare model, processor

Holo1 models are based on Qwen2.5-VL architecture, which comes with transformers support. Here we provide a simple usage example.
You can load the model and the processor as follows:

```python
import json
import os
from typing import Any, Literal

from transformers import AutoModelForImageTextToText, AutoProcessor

# default: Load the model on the available device(s)
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
model = AutoModelForImageTextToText.from_pretrained(
    "Hcompany/Holo1-7B",
    torch_dtype="auto",
    # torch_dtype=torch.bfloat16,
    # attn_implementation="flash_attention_2",
    device_map="auto",
)

# default processor
processor = AutoProcessor.from_pretrained("Hcompany/Holo1-7B")
# The default range for the number of visual tokens per image in the model is 4-1280.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)

# Helper function to run inference
def run_inference(messages: list[dict[str, Any]]) -> str:
    # Preparation for inference
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(
        text=[text],
        images=image,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
    return processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
```

### Prepare image and instruction

WARNING: Holo1 is using absolute coordinates (number of pixels) and HuggingFace processor is doing image resize. To have matching coordinates, one needs to smart_resize the image.

```python
from PIL import Image
from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize

# Prepare image and instruction
image_url = "https://huggingface.co/Hcompany/Holo1-7B/resolve/main/calendar_example.jpg" 
image = Image.open(requests.get(image_url, stream=True).raw)

# Resize the image so that predicted absolute coordinates match the size of the image.
image_processor = processor.image_processor
resized_height, resized_width = smart_resize(
    image.height,
    image.width,
    factor=image_processor.patch_size * image_processor.merge_size,
    min_pixels=image_processor.min_pixels,
    max_pixels=image_processor.max_pixels,
)
image = image.resize(size=(resized_width, resized_height), resample=None)  # type: ignore
```

### Navigation with Structured Output

```python
import json
from . import navigation

task = "Book a hotel in Paris on August 3rd for 3 nights"
prompt = navigation.get_navigation_prompt(task, image, step=1)
navigation_str = run_inference(prompt)[0]
navigation = navigation.NavigationStep(**json.loads(navigation_str))
print(navigation)
# Expected NavigationStep(note='', thought='I need to select the check-out date as August 3rd and then proceed to search for hotels.', action=ClickElementAction(action='click_element', element='August 3rd on the calendar', x=777, y=282))
```

### Localization with click(x, y)

```python
from . import localization

instruction = "Select July 14th as the check-out date"
prompt = localization.get_localization_prompt(image, instruction)
coordinates = run_inference(prompt)[0]
print(coordinates)
# Expected Click(352, 348)
```

### Localization with Structured Output

We trained Holo1 as an Action VLM with extensive use of json and tool calls. Therefore, it can be queried reliably with structured output:

```python
import json
from . import localization

instruction = "Select July 14th as the check-out date"
prompt = localization.get_localization_prompt_structured_output(image, instruction)
coordinates_structured_str = run_inference(prompt)[0]
coordinates_structured = localization.ClickAction(**json.loads(coordinates_structured_str))
print(coordinates_structured)
# Expected ClickAction(action='click', x=352, y=340)
```

## Citation

**BibTeX:**

```
@misc{andreux2025surferhmeetsholo1costefficient,
      title={Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights}, 
      author={Mathieu Andreux and Breno Baldas Skuk and Hamza Benchekroun and Emilien Biré and Antoine Bonnet and Riaz Bordie and Matthias Brunel and Pierre-Louis Cedoz and Antoine Chassang and Mickaël Chen and Alexandra D. Constantinou and Antoine d'Andigné and Hubert de La Jonquière and Aurélien Delfosse and Ludovic Denoyer and Alexis Deprez and Augustin Derupti and Michael Eickenberg and Mathïs Federico and Charles Kantor and Xavier Koegler and Yann Labbé and Matthew C. H. Lee and Erwan Le Jumeau de Kergaradec and Amir Mahla and Avshalom Manevich and Adrien Maret and Charles Masson and Rafaël Maurin and Arturo Mena and Philippe Modard and Axel Moyal and Axel Nguyen Kerbel and Julien Revelle and Mats L. Richter and María Santos and Laurent Sifre and Maxime Theillard and Marc Thibault and Louis Thiry and Léo Tronchon and Nicolas Usunier and Tony Wu},
      year={2025},
      eprint={2506.02865},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.02865}, 
}
```