Qwen/Qwen2.5-VL-32B-Instruct · HF Model has visibly lower performance than chat.qwen.ai

I was using this model for person appearance matching, and I realized that the hf model of "Qwen/Qwen2.5-VL-32B-Instruct" running both on my local environment and on "https://huggingface.co/spaces/Qwen/Qwen2.5-VL-32B-Instruct" have visibly lower performance than if I select "Qwen2.5-VL-32B-Instruct" on chat.qwen.ai. Interestingly, my local inference and "https://huggingface.co/spaces/Qwen/Qwen2.5-VL-32B-Instruct" consistently have the same accuracy. The following is the text prompt, due to compliance requirements I cannot share the images. My current suspicion is that chat.qwen.ai's backend ignores my model selection and quietly uses a different, perhaps the bigger 72B model. Can someone from qwen team confirm this?

Prompt: Based on the appearance of the person in each image, are they likely the same person? You should ignore the background and only focus on the person's appearance, clothing, etc. If their clothing is visibly different, they are not the same person. Your output should be a score from 0 to 10, where 0 means definitely not the same person and 10 means definitely the same person. Please only output the score.