How to perform retrieval using fused [image, text] as the input query?

#65
by ququwowo - opened

Hi All!

Could you please advise: what would be the best option for user query which is a combination of [text, image]? Generally, how can I generate this "fused" embedding for [text, image] which works best with jina-v4?

For example, the user wanted to retrieve document using this query ["can you identify the mechanical tool type in this image and how should I operate this tool?" + img_of_tool]. In this case, both the image and text are important, I want jina-v4 to return documents discussing both the tool and the operation procedure.

Thank you!

Jina AI org

Hi @ququwowo ,

We haven’t trained or tested the model on the "fused" embeddings, this is why the model class does not support them. However, if you still want to try it, you can modify the prompt here when encoding an image and pass the desired text, for example:
<|im_start|>user\n Can you identify the mechanical tool type in this image and how should I operate this tool?<|vision_start|><|image_pad|><|vision_end|><|im_end|>\n

Sign up or log in to comment