Qwen/Qwen2-VL-7B-Instruct · Is it possible to get Image Embeddings from the VLM?

Hello! Yes you can using the Vision transformer part of the model. Here is the code that I used to get the embeddings:

model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id,attn_implementation='eager')
processor = AutoProcessor.from_pretrained(model_id)
inputs = processor.image_processor(images=img, return_tensors="pt")
pixel_values = inputs["pixel_values"].to(device) #vision model only takes pixel values
grid_thw = inputs["image_grid_thw"].to(device) #vision model only takes pixel values
model.visual.to(pixel_values.device)

with torch.no_grad():
vision_outputs = model.visual(pixel_values, grid_thw)
print(vision_outputs)
visual_embeds = vision_outputs.squeeze(0).cpu()