Using pre-computed embeddings for images/frames and using as input
Is it possible to feed pre-computed embeddings from e.g. the pretrained model google/siglip-so400m-patch14-384, and possibly even for frames of a video? Or alternatively, can you pre-compute embeddings using the siglip model of SmolVLM2 for use at a later stage (to speed up inference time)?
If you have access to the images in advance, load the SmolVLMVisionTransformer
(this is wrapper class for SigLIP encoder), use it to extract the embeddings, then pass them as image_hidden_states
instead of pixel_values
.
You have to use the SmolVLM2 encoder weights as the encoder was unfrozen during training. The original SigLIP will not work.
It is not trivial to perform retrieval with encoders that are post-trained on the NTP loss, as they are no longer aligned with their original text encoder. To do this, I would:
a) Test retrieval with original SigLIP text encoder. Performance probably would be lower than before due to a lack of text-vision embedding alignment.
b) Test retrieval with the with the SmolLM2 text tokenizer
c) potentially re-align the SigLIP text encoder with the vision encoder (frozen) using LoRA or similar
@maxlun Hi! I'm trying to use SmolVLM2 using pre-encoded images. So far, I've managed to pass the images through the vision encoder to produce the image_hidden_states, but cannot figure out how to pass these to the VLM's generate method. Would you mind sharing a small snippet of your code to give me a working example please? That would be immensely appreciated.
I figured it out, and decided to leave my code here if anyone wants a template to get started.
model = SmolVLMForConditionalGeneration.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct").to(device)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")
vision_encoder = model.model.vision_model # Vision encoder
connector = model.model.connector # Projects vision features to text space
# Load your images
image1 = Image.open("image1.png")
image2 = Image.open("image2.png")
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Consider the following two images:\n"},
{"type": "text", "text": "Image 1:\n"},
{"type": "image"}, # First image
{"type": "text", "text": "Image 2:\n"},
{"type": "image"}, # Second image
{"type": "text", "text":"Describe the differences between the images."}
]
}
]
# Apply chat template
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
print("Formatted prompt:", repr(prompt))
# Process the inputs - pass images as a LIST
inputs = processor(
text=prompt,
images=[image1, image2],
return_tensors="pt"
)
# Get image embeddings
pixel_values = inputs['pixel_values'].to(device)
batch_size, num_images, channels, height, width = pixel_values.shape
pixel_values_reshaped = pixel_values.view(batch_size * num_images, channels, height, width)
with torch.no_grad():
# Get vision encoder outputs
vision_outputs = vision_encoder(
pixel_values=pixel_values_reshaped,
return_dict=True
)
raw_vision_features = vision_outputs.last_hidden_state
projected_vision_features = connector(raw_vision_features)
print(f"Projected vision features shape: {projected_vision_features.shape}")
# Move inputs to GPU
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
# Generate response
with torch.no_grad():
outputs = model.generate(
input_ids=inputs["input_ids"],
image_hidden_states=projected_vision_features,
max_new_tokens=200,
do_sample=True,
temperature=0.1,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id
)
# Decode only the new part
input_length = inputs['input_ids'].shape[1]
generated_tokens = outputs[0][input_length:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print("Response:", response)```