Holo1-7B / README.md

fix

d2d07b1 about 1 month ago

10.4 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- multimodal
	- action
	- agent
	---
	# Holo1-7B

	## Model Description

	Holo1 is an Action Vision-Language Model (VLM) developed by [HCompany](https://www.hcompany.ai/) for use in the Surfer-H web agent system. It is designed to interact with web interfaces like a human user.

	As part of a broader agentic architecture, Holo1 acts as a policy, localizer, or validator, helping the agent understand and act in digital environments.

	Trained on a mix of open-access, synthetic, and self-generated data, Holo1 enables state-of-the-art (SOTA) performance on the [WebVoyager](https://arxiv.org/pdf/2401.13919) benchmark, offering the best accuracy/cost tradeoff among current models.
	It also excels in UI localization tasks such as [Screenspot](https://huggingface.co/datasets/rootsautomation/ScreenSpot), [Screenspot-V2](https://huggingface.co/datasets/HongxinLi/ScreenSpot_v2), [Screenspot-Pro](https://huggingface.co/datasets/likaixin/ScreenSpot-Pro), [GroundUI-Web](https://huggingface.co/datasets/agent-studio/GroundUI-1K), and our own newly introduced
	benchmark [WebClick](https://huggingface.co/datasets/Hcompany/WebClick).

	Holo1 is optimized for both accuracy and cost-efficiency, making it a strong open-source alternative to existing VLMs.

	For more details, check our paper and our blog post.

	- Developed by: [HCompany](https://www.hcompany.ai/)
	- Model type: Action Vision-Language Model
	- Finetuned from model: Qwen/Qwen2.5-VL-7B-Instruct
	- Paper: https://arxiv.org/abs/2506.02865
	- Blog Post: https://www.hcompany.ai/surfer-h
	- License: Apache 2.0

	## Results

	### Surfer-H: Pareto-Optimal Performance on [WebVoyager](https://arxiv.org/pdf/2401.13919)

	Surfer-H is designed to be flexible and modular. It is composed of three independent components:
	- A Policy model that plans, decides, and drives the agent's behavior
	- A Localizer model that sees and understands visual UIs to drive precise interactions
	- A Validator model that checks whether the answer is valid

	The agent thinks before acting, takes notes, and can retry if its answer is rejected. It can operate with different models for each module, allowing for tradeoffs between accuracy, speed, and cost.

	We evaluated Surfer-H on the [WebVoyager](https://arxiv.org/pdf/2401.13919) benchmark: 643 real-world web tasks ranging from retrieving prices to finding news or scheduling events.

	<div style="text-align: center;">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/682c3e22650f6bbe33bb9d94/kO_4DlW_O45Wi7eK9-r8v.png" width="800"/>
	</div>

	We’ve tested multiple configurations, from GPT-4-powered agents to 100% open Holo1 setups. Among them, the fully Holo1-based agents offered the strongest tradeoff between accuracy and cost:
	- Surfer-H + Holo1-7B: 92.2% accuracy at $0.13 per task
	- Surfer-H + GPT-4.1: 92.0% at $0.54 per task
	- Surfer-H + Holo1-3B: 89.7% at $0.11 per task
	- Surfer-H + GPT-4.1-mini: 88.8% at $0.26 per task

	This places Holo1-powered agents on the Pareto frontier, delivering the best accuracy per dollar.
	Unlike other agents that rely on custom APIs or brittle wrappers, Surfer-H operates purely through the browser — just like a real user. Combined with Holo1, it becomes a powerful, general-purpose, cost-efficient web automation system.

	### Holo1: State-of-the-Art UI Localization

	A key skill for the real-world utility of our VLMs within agents is localization: the ability to identify precise
	coordinates on a user interface (UI) to interact with to complete a task or follow an instruction. To assess
	this capability, we evaluated our Holo1 models on several established localization benchmarks, including
	[Screenspot](https://huggingface.co/datasets/rootsautomation/ScreenSpot), [Screenspot-V2](https://huggingface.co/datasets/HongxinLi/ScreenSpot_v2), [Screenspot-Pro](https://huggingface.co/datasets/likaixin/ScreenSpot-Pro), [GroundUI-Web](https://huggingface.co/datasets/agent-studio/GroundUI-1K), and our own newly introduced
	benchmark [WebClick](https://huggingface.co/datasets/Hcompany/WebClick).

	<div style="text-align: center;">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/682c3e22650f6bbe33bb9d94/UutD2Meevd5Xw0_mhX2wK.png" width="600"/>
	</div>

	<div style="text-align: center;">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/682c3e22650f6bbe33bb9d94/NhzkB8xnEQYMqiGxPnJSt.png" width="600"/>
	</div>

	## Get Started with the Model

	We provide 2 spaces to experiment with Localization and Navigation:
	- https://huggingface.co/spaces/Hcompany/Holo1-Navigation
	- https://huggingface.co/spaces/Hcompany/Holo1-Localization

	We provide starter code for the localization task: i.e. image + instruction -> click coordinates

	We also provide code to reproduce screenspot evaluations: screenspot_eval.py

	### Prepare model, processor

	Holo1 models are based on Qwen2.5-VL architecture, which comes with transformers support. Here we provide a simple usage example.
	You can load the model and the processor as follows:

	```python
	import json
	import os
	from typing import Any, Literal

	from transformers import AutoModelForImageTextToText, AutoProcessor

	# default: Load the model on the available device(s)
	# We recommend enabling flash_attention_2 for better acceleration and memory saving.
	model = AutoModelForImageTextToText.from_pretrained(
	"Hcompany/Holo1-7B",
	torch_dtype="auto",
	# torch_dtype=torch.bfloat16,
	# attn_implementation="flash_attention_2",
	device_map="auto",
	)

	# default processor
	processor = AutoProcessor.from_pretrained("Hcompany/Holo1-7B")
	# The default range for the number of visual tokens per image in the model is 4-1280.
	# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
	# processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)

	# Helper function to run inference
	def run_inference(messages: list[dict[str, Any]]) -> str:
	# Preparation for inference
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(
	text=[text],
	images=image,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")

	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
	return processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
	```

	### Prepare image and instruction

	WARNING: Holo1 is using absolute coordinates (number of pixels) and HuggingFace processor is doing image resize. To have matching coordinates, one needs to smart_resize the image.

	```python
	from PIL import Image
	from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize

	# Prepare image and instruction
	image_url = "https://huggingface.co/Hcompany/Holo1-7B/resolve/main/calendar_example.jpg"
	image = Image.open(requests.get(image_url, stream=True).raw)

	# Resize the image so that predicted absolute coordinates match the size of the image.
	image_processor = processor.image_processor
	resized_height, resized_width = smart_resize(
	image.height,
	image.width,
	factor=image_processor.patch_size * image_processor.merge_size,
	min_pixels=image_processor.min_pixels,
	max_pixels=image_processor.max_pixels,
	)
	image = image.resize(size=(resized_width, resized_height), resample=None) # type: ignore
	```

	### Navigation with Structured Output

	```python
	import json
	from . import navigation

	task = "Book a hotel in Paris on August 3rd for 3 nights"
	prompt = navigation.get_navigation_prompt(task, image, step=1)
	navigation_str = run_inference(prompt)[0]
	navigation = navigation.NavigationStep(**json.loads(navigation_str))
	print(navigation)
	# Expected NavigationStep(note='', thought='I need to select the check-out date as August 3rd and then proceed to search for hotels.', action=ClickElementAction(action='click_element', element='August 3rd on the calendar', x=777, y=282))
	```

	### Localization with click(x, y)

	```python
	from . import localization

	instruction = "Select July 14th as the check-out date"
	prompt = localization.get_localization_prompt(image, instruction)
	coordinates = run_inference(prompt)[0]
	print(coordinates)
	# Expected Click(352, 348)
	```

	### Localization with Structured Output

	We trained Holo1 as an Action VLM with extensive use of json and tool calls. Therefore, it can be queried reliably with structured output:

	```python
	import json
	from . import localization

	instruction = "Select July 14th as the check-out date"
	prompt = localization.get_localization_prompt_structured_output(image, instruction)
	coordinates_structured_str = run_inference(prompt)[0]
	coordinates_structured = localization.ClickAction(**json.loads(coordinates_structured_str))
	print(coordinates_structured)
	# Expected ClickAction(action='click', x=352, y=340)
	```

	## Citation

	BibTeX:

	```
	@misc{andreux2025surferhmeetsholo1costefficient,
	title={Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights},
	author={Mathieu Andreux and Breno Baldas Skuk and Hamza Benchekroun and Emilien Biré and Antoine Bonnet and Riaz Bordie and Matthias Brunel and Pierre-Louis Cedoz and Antoine Chassang and Mickaël Chen and Alexandra D. Constantinou and Antoine d'Andigné and Hubert de La Jonquière and Aurélien Delfosse and Ludovic Denoyer and Alexis Deprez and Augustin Derupti and Michael Eickenberg and Mathïs Federico and Charles Kantor and Xavier Koegler and Yann Labbé and Matthew C. H. Lee and Erwan Le Jumeau de Kergaradec and Amir Mahla and Avshalom Manevich and Adrien Maret and Charles Masson and Rafaël Maurin and Arturo Mena and Philippe Modard and Axel Moyal and Axel Nguyen Kerbel and Julien Revelle and Mats L. Richter and María Santos and Laurent Sifre and Maxime Theillard and Marc Thibault and Louis Thiry and Léo Tronchon and Nicolas Usunier and Tony Wu},
	year={2025},
	eprint={2506.02865},
	archivePrefix={arXiv},
	primaryClass={cs.AI},
	url={https://arxiv.org/abs/2506.02865},
	}
	```