Can we embed multiple images and text into a single embedding?

#38
by srinivasbilla - opened

Can we have like 5 images and 5 sentences in one embedding?

Hey, if you want to encode 5 sentences into one embedding, you can just concatenate them. For images the encode function does not support it. This means you need to implement yourself a function that converts the images into a sequence of tokens that you can pass to the model. So you basically need to implement something that does the functionality of the encode function [1] yourself but pass multiple images (that should not be too complicated). If you want to encode both text and images into a single embedding you can do it in a similar way. Nevertheless the model is only trained to encode single images and pure text into one embedding representation. So I don't now if multi-model inputs or inputs with multiple images with produce good embeddings.

[1] https://huggingface.co/jinaai/jina-embeddings-v4/blob/main/modeling_jina_embeddings_v4.py#L487-L546

We also plan to support encoding multiple images at the time into multiple embedding, i.e., late chunking for images, e.g., to preserve context between pdf pages of the same document by using the late chunking method [1] . But first we need to run some experiments how well this works.

[1] https://github.com/jina-ai/late-chunking

srinivasbilla changed discussion status to closed

Thank you for your reply! Makes sense. Super interesting work though! Thank you for sharing

Sign up or log in to comment