Does Llama-3.2 Vision model support MultiImages?

#43
by JOJOHuang - opened

Does this model support Multi Images? if True,like this?

image1 = Image.open(url1)
image2 = Image.open(url2)

messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "please describe these two images"}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor([image1, image2], input_text, return_tensors="pt").to(model.device)

Meta Llama org

Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images

Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images

Ok~ Thanks for your reply!

Sanyam changed discussion status to closed

Hey Sanyam,

Thanks for the response.

Any idea why this is happening?

Is it a limitation of the model size or the lack of training?

What I understood from the documentation was that the model was trained with videos, so I was curious why it is not performant on multiple images.

I am cuda out of memory message when i use multiple images

I have the same question, can this model infer video files? For example, using cv2 to generate a set of frames?

I have the same question, I am trying to infer video files, extracting frames and transcripts to infer the video on a whole. However, an accumulation of frame understanding is needed instead of single frame inferencing. Llama3.2 vision is unable to do this it seems.

Same here, I would also like to have multi image support in 1 conversation. What is the ETA on this? Will it be supported in the future?

And what about images across history?

messages = [
  {
    "role": "user", "content": [
      {"type": "image"},
      {"type": "text", "text": "please describe the image"}
   ]
  },
  {
    "role": "assistant", "content": "It shows a cat fighting with a dog"
  },
  {
    "role": "user", "content": [
      {"type": "image"},
      {"type": "text", "text": "Can you explain more? Here's another perspective"}
    ]
  },
]

did we get an answer to this ?
I have set of images and set of context which I got from my retriever engine - I need to now pass these in my generation Model [ any vision model ] to get the final response

"Our training pipeline consists of multiple stages, starting from pretrained Llama 3.1 text models. First, we add image adapters and encoders, then pretrain on large-scale noisy (image, text) pair data. Next, we train on medium-scale high quality in-domain and knowledge-enhanced (image, text) pair data." - from LLAMA 3.2 blog by Meta

I don't think LLAMA 3.2 models can handle multiple images as they are not trained that way. I am planning to use MIVC (https://assets.amazon.science/5b/f4/131b6a25445fae6d1fec2befbb84/mivc-multiple-instance-visual-component-for-visual-language-models.pdf) on LLAMA 3.2 models to aggregate embeddings of multiple images to one embedding. If anyone is interested, you can join me.

There are a few other VLMs which allow multiple images in inference time: NVLM, LLAVA, GPT-4o

Sign up or log in to comment