Error during inference with image and text.

#12

by aarbelle - opened Feb 28

Feb 28

Running into the following error when trying inference with Image+Text

/home/.cache/huggingface/modules/transformers_modules/microsoft/Phi-4-multimodal-instruct/879783f7b23e43c12d1c682e3458f115f3a7718d/modeling_phi4mm.py", line 399, in forward
    assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
AssertionError: temp_len: 5409, output_imgs[-1].shape[1]: 5393

It doesn't happen for all images, just some.

EricB

Feb 28

Same error with:

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="mps", 
    trust_remote_code=True, 
    _attn_implementation='eager',
).to("mps")

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('mps')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=256,
    generation_config=generation_config,
)
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

Error:

File "/Users/ericbuehler/.cache/huggingface/modules/transformers_modules/phi4_multimodal/modeling_phi4mm.py", line 399, in forward
    assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: temp_len: 1381, output_imgs[-1].shape[1]: 933```

PrateekTikku

Mar 3

@aarbelle @EricB Did you find a solution to this problem yet? I am facing the same error...

nguyenbh changed discussion status to closed Mar 9

DefOs9

Mar 13

@nguyenbh , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.

output_imgs[-1].shape[1] comes from a concatenation: torch.cat([sub_img, self.glb_GN, glb_img], dim=1), where:

sub_image.shape[1] comes from a concatenation of something the size of (a) the useful_height and useful_width product (16 * 12), and (b) temp_sub_GN (16)
self.glb_GN.shape[1] is 1
glb_img.shape[1] is 16 * 16 + 16 = 272

Summing up (in my case, with a small 448x448 image) to: 208 + 1 + 272 = 481. This looks theoretically correct to me.

With the image attention mask on (default), temp_len is defined as:

temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction

In my case, the numbers for these are: 320 + 17 + 16 = 353.

The whole code for Phi4MMImageEmbedding is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around temp_len calculation.

DefOs9

Mar 13

Even when I simply comment out the assertion statement, I get an error right below when I send a large image:

  File "/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 769, in forward
    image_hidden_states = self.image_embed(
  File "/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 448, in forward
    new_hidden_states = hidden_states.index_put(
RuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]

Interestingly, both earlier variables that are asserted on, i.e.temp_len (1332) and output_imgs[-1].shape[1] (900), are not the expected shape (1792) here!

nguyenbh changed discussion status to open Mar 14

EricB

Mar 14

@nguyenbh have you been able to reproduce this?

liyunsheng13

Mar 14

Hi, we test your code on our side and we cannot reproduce the issue you reported. I'm not sure whether the issue is caused by mps, since we use cuda and we do not get this error.

DefOs9

Mar 14

When running on my Macbook using device = "cpu", I still get:

AssertionError: temp_len: 1536, output_imgs[-1].shape[1]: 1792

But if I comment out the assertion, things actually run fine.

It looks like perhaps the code is a bit too CUDA-specific.

DefOs9

Mar 14

I believe I fixed it, see PR: https://huggingface.co/microsoft/Phi-4-multimodal-instruct/discussions/45

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment