Holo1-7B / README.md
plcedoz38's picture
Update README.md
427e294 verified
|
raw
history blame
10 kB
metadata
license: apache-2.0
language:
  - en
base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
  - multimodal
  - action
  - agent

Holo1-7B

Model Description

Holo-1 is an Action Vision-Language Model (VLM) developed by HCompany for use in the Runner-H web agent system. It is designed to interact with web interfaces like a human user.

As part of a broader agentic architecture, Holo-1 acts as a policy, localizer, or validator, helping the agent understand and act in digital environments.

Trained on a mix of open-access, synthetic, and self-generated data, Holo-1 enables state-of-the-art (SOTA) performance on the WebVoyager benchmark, offering the best accuracy/cost tradeoff among current models. It also excels in UI localization tasks such as Screenspot, Screenspot-V2, Screenspot-Pro, GroundUI-Web, and our own newly introduced benchmark WebClick.

Holo-1 is optimized for both accuracy and cost-efficiency, making it a strong open-source alternative to existing VLMs.

For more details, check our paper and our blog post.

  • Developed by: HCompany
  • Model type: Action Vision-Language Model
  • Finetuned from model: Qwen/Qwen2.5-VL-7B-Instruct
  • Paper:
  • Blog Post:
  • License: Apache 2.0

Results

Runner-H: Pareto-Optimal Performance on WebVoyager

Runner-H is designed to be flexible and modular. It is composed of three independent components:

  • A Policy model that plans, decides, and drives the agent's behavior
  • A Localizer model that sees and understands visual UIs to drive precise interactions
  • A Validator model that checks whether the answer is valid

The agent thinks before acting, takes notes, and can retry if its answer is rejected. It can operate with different models for each module, allowing for tradeoffs between accuracy, speed, and cost.

We evaluated Runner-H on the WebVoyager benchmark: 643 real-world web tasks ranging from retrieving prices to finding news or scheduling events.

We’ve tested multiple configurations, from GPT-4-powered agents to 100% open Holo1 setups. Among them, the fully Holo1-based agents offered the strongest tradeoff between accuracy and cost:

  • Runner-H + Holo1-7B: 92.2% accuracy at only $0.13 per task
  • Runner-H + GPT-4o: 84.3% at $0.71
  • Runner-H + GPT-4.1-mini: 88.8% at $0.26
  • Runner-H + Holo1-3B: 89.7% at $0.11

This places Holo1-powered agents on the Pareto frontier, delivering the best accuracy per dollar. Unlike other agents that rely on custom APIs or brittle wrappers, Runner-H operates purely through the browser — just like a real user. Combined with Holo1, it becomes a powerful, general-purpose, cost-efficient web automation system.

Holo1: State-of-the-Art UI Localization

A key skill for the real-world utility of our VLMs within agents is localization: the ability to identify precise coordinates on a user interface (UI) to interact with to complete a task or follow an instruction. To assess this capability, we evaluated our Holo1 models on several established localization benchmarks, including Screenspot, Screenspot-V2, Screenspot-Pro, GroundUI-Web, and our own newly introduced benchmark WebClick.

Get Started with the Model

We provide starter code for the localization task: i.e. image + instruction -> click coordinates

We trained Holo1 as an Action VLM with extensive use of json and tool calls. Therefore, it can be queried reliably with structured output:

We also provide code to reproduce screenspot evaluations: screenspot_eval.py

import json
import os
from typing import Any, Literal

from PIL import Image
from pydantic import BaseModel, ConfigDict
from transformers import AutoModelForImageTextToText, AutoProcessor
from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize

# default: Load the model on the available device(s)
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
model = AutoModelForImageTextToText.from_pretrained(
    "Hcompany/Holo1-3B",
    torch_dtype="auto",
    # torch_dtype=torch.bfloat16,
    # attn_implementation="flash_attention_2",
    device_map="auto",
)

# default processor
processor = AutoProcessor.from_pretrained("Hcompany/Holo1-3B")
# The default range for the number of visual tokens per image in the model is 4-1280.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)


def run_inference(messages: list[dict[str, Any]]) -> str:
    # Preparation for inference
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(
        text=[text],
        images=image,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
    return processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)


# Prepare image and instruction
script_dir = os.path.dirname(os.path.abspath(__file__))
image_path = os.path.join(script_dir, "calendar_example.jpg")
image = Image.open(image_path)

image_processor = processor.image_processor
resized_height, resized_width = smart_resize(
    image.height,
    image.width,
    factor=image_processor.patch_size * image_processor.merge_size,
    min_pixels=image_processor.min_pixels,
    max_pixels=image_processor.max_pixels,
)
image = image.resize(size=(resized_width, resized_height), resample=None)  # type: ignore

instruction = "Select July 14th as the check-out date"


# Localization as click(x, y)
def get_localization_prompt(image, instruction: str) -> list[dict[str, Any]]:
    guidelines: str = "Localize an element on the GUI image according to my instructions and output a click position as Click(x, y) with x num pixels from the left edge and y num pixels from the top edge."

    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": image,
                },
                {"type": "text", "text": f"{guidelines}\n{instruction}"},
            ],
        }
    ]


messages = get_localization_prompt(image, instruction)
coordinates_str = run_inference(messages)[0]
print(coordinates_str)
# Expected Click(352, 348)


# Localization as structured output
class FunctionDefinition(BaseModel):
    """Function definition data structure.

    Attributes:
        name: name of the function.
        description: description of the function.
        parameters: JSON schema for the function parameters.
        strict: Whether to enable strict schema adherence when generating the function call.
    """

    name: str
    description: str = ""
    parameters: dict[str, Any] = {}
    strict: bool = True


class ClickAction(BaseModel):
    """Click at specific coordinates on the screen."""

    model_config = ConfigDict(
        extra="forbid",
        json_schema_serialization_defaults_required=True,
        json_schema_mode_override="serialization",
        use_attribute_docstrings=True,
    )

    action: Literal["click"] = "click"
    x: int
    """The x coordinate, number of pixels from the left edge."""
    y: int
    """The y coordinate, number of pixels from the top edge."""


function_definition = FunctionDefinition(
    name="click_action",
    description=ClickAction.__doc__ or "",
    parameters=ClickAction.model_json_schema(),
    strict=True,
)


def get_localization_prompt_structured_output(image, instruction: str) -> list[dict[str, Any]]:
    guidelines: str = "Localize an element on the GUI image according to my instructions and output a click position. You must output a valid JSON format."

    return [
        {
            "role": "system",
            "content": json.dumps([function_definition.model_dump()]),
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": image,
                },
                {"type": "text", "text": f"{guidelines}\n{instruction}"},
            ],
        },
    ]


messages = get_localization_prompt_structured_output(image, instruction)
coordinates_str = run_inference(messages)[0]
coordinates = ClickAction.model_validate(json.loads(coordinates_str)["arguments"])
print(coordinates)
# Expected ClickAction(action='click', x=352, y=340)

Citation

BibTeX:

[More Information Needed]

APA:

[More Information Needed]