zai-org/GLM-4.1V-9B-Thinking · Unable to get the model to process an image on my PC

I can run zai-org/GLM-4.1V-9B-Thinking behind vLLM and chat normally.

Vision also works when I pass an HTTPS image URL. But any attempt to use a local image (file://…) either (a) returns a 400 “must be a subpath of --allowed-local-media-path” even when it is, or (b) just hangs (no response body), with no useful error in the API logs. Earlier, base64 data URLs returned 500 with a Pillow decode error. The net effect: model can “see” web images, but cannot “see” local files.

Environment
OS: Windows 11 + WSL2 (Ubuntu)

Python: 3.12.9 (conda env)

vLLM: 0.10.1 (works great for other models)

Transformers: (whatever ships in this env; can provide exact version)

CUDA / Torch: CUDA 12.8 and 2.9.0.dev20250811+cu128

GPU(s): RTX 5090 and 3090

This is what I'm seeing no matter what I've tried, with trying to process an image:

(APIServer pid=1806) INFO: Started server process [1806]
(APIServer pid=1806) INFO: Waiting for application startup.
(APIServer pid=1806) INFO: Application startup complete.
(APIServer pid=1806) Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
(APIServer pid=1806) INFO 08-12 21:48:42 [chat_utils.py:470] Detected the chat template content format to be 'openai'. You can set --chat-template-content-format to override this.

Launch command:
nova@DESKTOP-L4B0FRV:~/ai_projects$ VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
python3 -m vllm.entrypoints.openai.api_server
--model zai-org/GLM-4.1V-9B-Thinking
--served-model-name glm-4.1v-thinking
--trust-remote-code
--quantization bitsandbytes
--dtype auto
--host 172.25.46.120
--port 8000
--api-key vllm
--max-model-len 8192
--allowed-local-media-path /home/nova/ai_projects

I was able to get the model to run and process images using transformers just fine. For some reason no matter what I've tried, all I can get is normal text with vllm, reading images seems to be a problem.