Using pre-computed embeddings for images/frames and using as input

by maxlun - opened Feb 21

Feb 21

Is it possible to feed pre-computed embeddings from e.g. the pretrained model google/siglip-so400m-patch14-384, and possibly even for frames of a video? Or alternatively, can you pre-compute embeddings using the siglip model of SmolVLM2 for use at a later stage (to speed up inference time)?

orrzohar

Feb 21

If you have access to the images in advance, load the SmolVLMVisionTransformer (this is wrapper class for SigLIP encoder), use it to extract the embeddings, then pass them as image_hidden_states instead of pixel_values.

You have to use the SmolVLM2 encoder weights as the encoder was unfrozen during training. The original SigLIP will not work.

maxlun

Feb 21

Thanks for the swift answer @orrzohar ! It's as I expected then. Did you try the SmolVLM2 SigLIP encoder on some standard retrieval benchmarks (MSCOCO, Flickr)?
Sorry if there is some paper/write-up for all this info, haven't found any yet.

orrzohar

Feb 21

It is not trivial to perform retrieval with encoders that are post-trained on the NTP loss, as they are no longer aligned with their original text encoder. To do this, I would:
a) Test retrieval with original SigLIP text encoder. Performance probably would be lower than before due to a lack of text-vision embedding alignment.
b) Test retrieval with the with the SmolLM2 text tokenizer
c) potentially re-align the SigLIP text encoder with the vision encoder (frozen) using LoRA or similar

samb271

Jun 19

@maxlun Hi! I'm trying to use SmolVLM2 using pre-encoded images. So far, I've managed to pass the images through the vision encoder to produce the image_hidden_states, but cannot figure out how to pass these to the VLM's generate method. Would you mind sharing a small snippet of your code to give me a working example please? That would be immensely appreciated.

samb271

Jun 20

I figured it out, and decided to leave my code here if anyone wants a template to get started.

model = SmolVLMForConditionalGeneration.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct").to(device)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")
vision_encoder = model.model.vision_model  # Vision encoder
connector = model.model.connector  # Projects vision features to text space

# Load your images
image1 = Image.open("image1.png")
image2 = Image.open("image2.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Consider the following two images:\n"},
            {"type": "text", "text": "Image 1:\n"},
            {"type": "image"},  # First image
            {"type": "text", "text": "Image 2:\n"},
            {"type": "image"},  # Second image
            {"type": "text", "text":"Describe the differences between the images."}
        ]
    }
]

# Apply chat template
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
print("Formatted prompt:", repr(prompt))

# Process the inputs - pass images as a LIST
inputs = processor(
    text=prompt,
    images=[image1, image2],
    return_tensors="pt"
)

# Get image embeddings
pixel_values = inputs['pixel_values'].to(device)
batch_size, num_images, channels, height, width = pixel_values.shape
pixel_values_reshaped = pixel_values.view(batch_size * num_images, channels, height, width)
with torch.no_grad():
    # Get vision encoder outputs
    vision_outputs = vision_encoder(
        pixel_values=pixel_values_reshaped,
        return_dict=True
    )
    
    raw_vision_features = vision_outputs.last_hidden_state
    projected_vision_features = connector(raw_vision_features)
    print(f"Projected vision features shape: {projected_vision_features.shape}")

# Move inputs to GPU
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

# Generate response
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        image_hidden_states=projected_vision_features,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.1,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode only the new part
input_length = inputs['input_ids'].shape[1]
generated_tokens = outputs[0][input_length:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print("Response:", response)```

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment