sparks-solutions
/

ELAM-7B

 - allenai/Molmo-7B-D-0924
 pipeline_tag: image-text-to-text
 library_name: transformers
+---
+# ELAM-7B
+ELAM (Evaluative Large Action Model) is a Molmo 7B-D-based LAM (Large Action Model) that is also able to evaluate user expectations on screenshots of user interfaces. It was specifically fine-tuned on 17,708 automotive UI images in German and English.
+The evaluation dataset [AutomotiveUI-Bench-4K](https://huggingface.co/datasets/sparks-solutions/AutomotiveUI-Bench-4K) is available on Hugging Face.
+# Quick-Start
+```
+conda create -n elam python=3.10 -y
+conda activate elam
+pip install datasets==3.5.0 einops==0.8.1 torchvision==0.20.1 accelerate==1.6.0
+pip install transformers==4.48.2
+```
+```python
+import re
+import torch
+from PIL import Image
+from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
+# Load processor
+model_name = "sparks-solutions/ELAM-7B"
+processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True, torch_dtype="bfloat16", device_map="auto")
+# Load model
+model = AutoModelForCausalLM.from_pretrained(
+    model_name, trust_remote_code=True, torch_dtype="bfloat16", device_map="auto"
+)
+def preprocess_elam_prompt(user_request: str, label_class: str):
+    """Apply ELAM prompt template depending on class."""
+    if label_class == "Expected Result":
+        return f"Evaluate this statement about the image:\n'{user_request}'\nThink step by step, conclude whether the evaluation is 'PASSED' or 'FAILED' and point to the UI element that corresponds to this evaluation."
+    elif label_class == "Test Action":
+        return f"Identify and point to the UI element that corresponds to this test action:\n{user_request}"
+def postprocess_response_elam(response: str):
+    """Parse Molmo-style point coordinates from string and return tuple of floats in [0-1]."""
+    pattern = r'<point x="(?P<x>\d+\.\d+)" y="(?P<y>\d+\.\d+)"'
+    match = re.search(pattern, response)
+    if match:
+        x_coord_raw = float(match.group("x"))
+        y_coord_raw = float(match.group("y"))
+        x_coord = x_coord_raw / 100
+        y_coord = y_coord_raw / 100
+        return [x_coord, y_coord]
+    else:
+        return [-1, -1]
+```
+Two prompt types were fine-tuned for UI testing:
+1. *Test Action*: These prompts take an instruction (e.g., "tap music not in bottom navigation bar") and return the corresponding tap coordinates.
+2. *Expected Results*: These prompts take an expectation (e.g., "notification toggle switch is disabled") and return "PASSED" or "FAILED" along with coordinates of the relevant UI element.
+```python
+image_path = "path/to/your/ui/image"
+user_request = "Tap home button"  # or "The home icon is white"
+request_type = "Test Action"  # or "Expected Result"
+image = Image.open(image_path)
+elam_prompt = preprocess_elam_prompt(user_request, request_type)
+inputs = processor.process(
+    images=[image],
+    text=elam_prompt,
+)
+# Move inputs to the correct device and make a batch of size 1, cast to bfloat16
+inputs_bfloat16 = {}
+for k, v in inputs.items():
+    if v.dtype == torch.float32:
+        inputs_bfloat16[k] = v.to(model.device).to(torch.bfloat16).unsqueeze(0)
+    else:
+        inputs_bfloat16[k] = v.to(model.device).unsqueeze(0)
+inputs = inputs_bfloat16  # Replace original inputs with the correctly typed inputs
+# Generate output
+output = model.generate_from_batch(
+    inputs, GenerationConfig(max_new_tokens=2048, stop_strings="<|endoftext|>"), tokenizer=processor.tokenizer
+)
+# Only get generated tokens; decode them to text
+generated_tokens = output[0, inputs["input_ids"].size(1) :]
+response = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
+coordinates = postprocess_response_elam(response)
+# Print outputs
+print(f"ELAM response: {response}")
+print(f"Got coordinates: {coordinates}")
+```