Error during inference with image and text.
#12
by
aarbelle
- opened
Running into the following error when trying inference with Image+Text
/home/.cache/huggingface/modules/transformers_modules/microsoft/Phi-4-multimodal-instruct/879783f7b23e43c12d1c682e3458f115f3a7718d/modeling_phi4mm.py", line 399, in forward
assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
AssertionError: temp_len: 5409, output_imgs[-1].shape[1]: 5393
It doesn't happen for all images, just some.
Same error with:
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"
# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="mps",
trust_remote_code=True,
_attn_implementation='eager',
).to("mps")
# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)
# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('mps')
# Generate response
generate_ids = model.generate(
**inputs,
max_new_tokens=256,
generation_config=generation_config,
)
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
Error:
File "/Users/ericbuehler/.cache/huggingface/modules/transformers_modules/phi4_multimodal/modeling_phi4mm.py", line 399, in forward
assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: temp_len: 1381, output_imgs[-1].shape[1]: 933```