File size: 5,263 Bytes
5e86e9c 6c7b3e8 079d451 6c7b3e8 079d451 6c7b3e8 eb8465b 5d15940 eb8465b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
license: apache-2.0
language:
- de
- en
base_model:
- allenai/Molmo-7B-D-0924
pipeline_tag: image-text-to-text
library_name: transformers
---
# ELAM-7B
ELAM (Evaluative Large Action Model) is a Molmo 7B-D-based LAM (Large Action Model) that is also able to evaluate user expectations on screenshots of user interfaces.
It was specifically fine-tuned on 17,708 instructions and evaluations for 6,230 automotive UI images. The images contained German and English text.
All training prompts were English. German content was either translated or quoted directly.
The evaluation dataset [AutomotiveUI-Bench-4K](https://huggingface.co/datasets/sparks-solutions/AutomotiveUI-Bench-4K) is available on Hugging Face.
# Results
## AutomotiveUI-Bench-4K
| Model | Test Action Grounding | Expected Result Grounding | Expected Result Evaluation |
|---|---|---|---|
| InternVL2.5-8B | 26.6 | 5.7 | 64.8 |
| TinyClick | 61.0 | 54.6 | - |
| UGround-V1-7B (Qwen2-VL) | 69.4 | 55.0 | - |
| Molmo-7B-D-0924 | 71.3 | 71.4 | 66.9 |
| LAM-270M (TinyClick) | 73.9 | 59.9 | - |
| ELAM-7B (Molmo) | **87.6** | **77.5** | **78.2** |
# Quick-Start
```
conda create -n elam python=3.10 -y
conda activate elam
pip install datasets==3.5.0 einops==0.8.1 torchvision==0.20.1 accelerate==1.6.0
pip install transformers==4.48.2
```
```python
import re
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
# Load processor
model_name = "sparks-solutions/ELAM-7B"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True, torch_dtype="bfloat16", device_map="auto")
# Load model
model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True, torch_dtype="bfloat16", device_map="auto"
)
def preprocess_elam_prompt(user_request: str, label_class: str):
"""Apply ELAM prompt template depending on class."""
if label_class == "Expected Result":
return f"Evaluate this statement about the image:\n'{user_request}'\nThink step by step, conclude whether the evaluation is 'PASSED' or 'FAILED' and point to the UI element that corresponds to this evaluation."
elif label_class == "Test Action":
return f"Identify and point to the UI element that corresponds to this test action:\n{user_request}"
def postprocess_response_elam(response: str):
"""Parse Molmo-style point coordinates from string and return tuple of floats in [0-1]."""
pattern = r'<point x="(?P<x>\d+\.\d+)" y="(?P<y>\d+\.\d+)"'
match = re.search(pattern, response)
if match:
x_coord_raw = float(match.group("x"))
y_coord_raw = float(match.group("y"))
x_coord = x_coord_raw / 100
y_coord = y_coord_raw / 100
return [x_coord, y_coord]
else:
return [-1, -1]
```
Two prompt types were fine-tuned for UI testing:
1. *Test Action*: These prompts take an instruction (e.g., "tap music not in bottom navigation bar") and return the corresponding tap coordinates.
2. *Expected Results*: These prompts take an expectation (e.g., "notification toggle switch is disabled") and return "PASSED" or "FAILED" along with coordinates of the relevant UI element.
```python
image_path = "path/to/your/ui/image"
user_request = "Tap home button" # or "The home icon is white"
request_type = "Test Action" # or "Expected Result"
image = Image.open(image_path)
elam_prompt = preprocess_elam_prompt(user_request, request_type)
inputs = processor.process(
images=[image],
text=elam_prompt,
)
# Move inputs to the correct device and make a batch of size 1, cast to bfloat16
inputs_bfloat16 = {}
for k, v in inputs.items():
if v.dtype == torch.float32:
inputs_bfloat16[k] = v.to(model.device).to(torch.bfloat16).unsqueeze(0)
else:
inputs_bfloat16[k] = v.to(model.device).unsqueeze(0)
inputs = inputs_bfloat16 # Replace original inputs with the correctly typed inputs
# Generate output
output = model.generate_from_batch(
inputs, GenerationConfig(max_new_tokens=2048, stop_strings="<|endoftext|>"), tokenizer=processor.tokenizer
)
# Only get generated tokens; decode them to text
generated_tokens = output[0, inputs["input_ids"].size(1) :]
response = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
coordinates = postprocess_response_elam(response)
# Print outputs
print(f"ELAM response: {response}")
print(f"Got coordinates: {coordinates}")
```
# Citation
If you find ELAM useful in your research, please cite the following paper:
``` latex
@misc{ernhofer2025leveragingvisionlanguagemodelsvisual,
title={Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI},
author={Benjamin Raphael Ernhofer and Daniil Prokhorov and Jannica Langner and Dominik Bollmann},
year={2025},
eprint={2505.05895},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.05895},
}
```
# Acknowledgements
## Funding
This work was supported by German BMBF within the scope of project "KI4BoardNet".
## Models and Code
- ELAM is based on [Molmo](https://github.com/allenai/molmo) by Allen Institute for AI.
- Training was conducted using [ms-swift](https://github.com/modelscope/ms-swift) by ModelScope.
|