Does Llama-3.2 Vision model support MultiImages?
Does this model support Multi Images? if True,like this?
image1 = Image.open(url1)
image2 = Image.open(url2)
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "please describe these two images"}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor([image1, image2], input_text, return_tensors="pt").to(model.device)
Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images
Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images
Ok~ Thanks for your reply!
Hey Sanyam,
Thanks for the response.
Any idea why this is happening?
Is it a limitation of the model size or the lack of training?
What I understood from the documentation was that the model was trained with videos, so I was curious why it is not performant on multiple images.
I am cuda out of memory message when i use multiple images
I have the same question, can this model infer video files? For example, using cv2 to generate a set of frames?
I have the same question, I am trying to infer video files, extracting frames and transcripts to infer the video on a whole. However, an accumulation of frame understanding is needed instead of single frame inferencing. Llama3.2 vision is unable to do this it seems.
Same here, I would also like to have multi image support in 1 conversation. What is the ETA on this? Will it be supported in the future?
And what about images across history?
messages = [
{
"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "please describe the image"}
]
},
{
"role": "assistant", "content": "It shows a cat fighting with a dog"
},
{
"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Can you explain more? Here's another perspective"}
]
},
]
did we get an answer to this ?
I have set of images and set of context which I got from my retriever engine - I need to now pass these in my generation Model [ any vision model ] to get the final response
"Our training pipeline consists of multiple stages, starting from pretrained Llama 3.1 text models. First, we add image adapters and encoders, then pretrain on large-scale noisy (image, text) pair data. Next, we train on medium-scale high quality in-domain and knowledge-enhanced (image, text) pair data." - from LLAMA 3.2 blog by Meta
I don't think LLAMA 3.2 models can handle multiple images as they are not trained that way. I am planning to use MIVC (https://assets.amazon.science/5b/f4/131b6a25445fae6d1fec2befbb84/mivc-multiple-instance-visual-component-for-visual-language-models.pdf) on LLAMA 3.2 models to aggregate embeddings of multiple images to one embedding. If anyone is interested, you can join me.
There are a few other VLMs which allow multiple images in inference time: NVLM, LLAVA, GPT-4o