Does Llama-3.2 Vision model support MultiImages?

#43

by JOJOHuang - opened Sep 29, 2024

Sep 29, 2024

Does this model support Multi Images？ if True，like this？

image1 = Image.open(url1)
image2 = Image.open(url2)

messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "please describe these two images"}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor([image1, image2], input_text, return_tensors="pt").to(model.device)

Sanyam

Meta Llama org Sep 29, 2024

Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images

JOJOHuang

Sep 30, 2024

Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images

Ok~ Thanks for your reply！

Sanyam changed discussion status to closed Oct 3, 2024

h3045

Oct 16, 2024

Hey Sanyam,

Thanks for the response.

Any idea why this is happening?

Is it a limitation of the model size or the lack of training?

What I understood from the documentation was that the model was trained with videos, so I was curious why it is not performant on multiple images.

danmir

Oct 22, 2024

I am cuda out of memory message when i use multiple images

sraliu

Oct 29, 2024

I have the same question, can this model infer video files? For example, using cv2 to generate a set of frames?

globalinnovationhub

Nov 25, 2024

I have the same question, I am trying to infer video files, extracting frames and transcripts to infer the video on a whole. However, an accumulation of frame understanding is needed instead of single frame inferencing. Llama3.2 vision is unable to do this it seems.

ascension-hf

Nov 27, 2024

Same here, I would also like to have multi image support in 1 conversation. What is the ETA on this? Will it be supported in the future?

henry-m

Dec 22, 2024

And what about images across history?

messages = [
  {
    "role": "user", "content": [
      {"type": "image"},
      {"type": "text", "text": "please describe the image"}
   ]
  },
  {
    "role": "assistant", "content": "It shows a cat fighting with a dog"
  },
  {
    "role": "user", "content": [
      {"type": "image"},
      {"type": "text", "text": "Can you explain more? Here's another perspective"}
    ]
  },
]

luci007

Jan 19

did we get an answer to this ?
I have set of images and set of context which I got from my retriever engine - I need to now pass these in my generation Model [ any vision model ] to get the final response

AbhiKand

Jan 29

"Our training pipeline consists of multiple stages, starting from pretrained Llama 3.1 text models. First, we add image adapters and encoders, then pretrain on large-scale noisy (image, text) pair data. Next, we train on medium-scale high quality in-domain and knowledge-enhanced (image, text) pair data." - from LLAMA 3.2 blog by Meta

I don't think LLAMA 3.2 models can handle multiple images as they are not trained that way. I am planning to use MIVC (https://assets.amazon.science/5b/f4/131b6a25445fae6d1fec2befbb84/mivc-multiple-instance-visual-component-for-visual-language-models.pdf) on LLAMA 3.2 models to aggregate embeddings of multiple images to one embedding. If anyone is interested, you can join me.

There are a few other VLMs which allow multiple images in inference time: NVLM, LLAVA, GPT-4o

juburr

Feb 11

•

edited Feb 13

vLLM appears to have added support for multiple images with Llama 3.2 here: https://github.com/vllm-project/vllm/pull/9095 (v0.6.3.post1 and later)

The official answer from Meta, however, is that this model doesn't work well with more than one image: https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct/discussions/43#66f98f742094ed9e5f5107d4

Based on those sources and some experimentation, it appears that the answer is: yes, it can support multiple images, but the response quality will suffer, and you should really only use one image with this model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment