About the fundus model
I encountered the following problem while using this fundus model: RuntimeError: The size of tensor a (449) must match the size of tensor b (225) at non-singleton dimension 3. I have tried various methods, including modifying parameters and inputting dimensions, but still haven't solved the problem.
Thanks for your questions! Have you tried other models? Does only the fundus model fail?
Also, make sure your input image is resized to 384×384.
I just tried fundus model. And the input image is resized to 384×384. However, it still fails. The code is below:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0"
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image
MODEL_PATH = "./deepseek_fundus"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = Qwen2VLForConditionalGeneration.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
device_map=device,
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)
def process_single_message(message):
image_path = message["content"][0]["image"]
# print("image_path:", image_path)
image = Image.open(image_path).convert("RGB")
print("image size:", image.size)
image = image.resize((384, 384))
text = processor.apply_chat_template(
[message],
tokenize=False,
add_generation_prompt=True
)
inputs = processor(
text=text,
images=[image],
videos=None,
padding=True,
return_tensors="pt"
).to(device)
# print("Pixel values shape:", inputs.pixel_values.shape)
# print("inputs:", inputs)
generated_ids = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True
)
# print("generated_ids:", generated_ids)
output_text = processor.batch_decode(
generated_ids[:, inputs.input_ids.shape[1]:],
skip_special_tokens=True
)
return output_text[0]
message = {
"role": "user",
"content": [
{
"type": "image",
"image": "0_left.jpg"
},
{
"type": "text",
"text": "这张眼底图有什么异常?"
}
]
}
result = process_single_message(message)
print("模型输出:", result)
Thanks for reaching out.
I’ve verified that my code runs correctly, and I’m using the code provided in ModelCard for inference. Please try running your code based on this structure and double-check that your input data (including image paths and JSON formatting) matches exactly.
A possible problem may be caused by not using the:
process_vision_info()
to process data.
If you still encounter issues, feel free to let me know, and we can troubleshoot further together.
Thank you for your help.
I have successfully reproduced it based on this structure.