Is it possible to get Image Embeddings from the VLM?
I'm working on a project that requires me to create multi-modal embeddings and am planning to use Qwen to get the embeddings for images.
Is it possible to directly extract image embeddings from this VLM?
Hello! Yes you can using the Vision transformer part of the model. Here is the code that I used to get the embeddings:
model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id,attn_implementation='eager')
processor = AutoProcessor.from_pretrained(model_id)
inputs = processor.image_processor(images=img, return_tensors="pt")
pixel_values = inputs["pixel_values"].to(device) #vision model only takes pixel values
grid_thw = inputs["image_grid_thw"].to(device) #vision model only takes pixel values
model.visual.to(pixel_values.device)
with torch.no_grad():
vision_outputs = model.visual(pixel_values, grid_thw)
print(vision_outputs)
visual_embeds = vision_outputs.squeeze(0).cpu()