How to run GGUF with vLLM ? "Unknown gguf model_type: gemma3_text" error

#4
by rkasem0 - opened

I have the latest vllm built from the main source
vLLM v0.9.2
with the latest Transformers

How to run the "medgemma-27b-text-it-UD-Q8_K_XL.gguf" using vLLM ?
what i'm using now :

vllm serve '/media/models/google/unsloth/medgemma-27b-text-it-GGUF-UD-Q8_K_XL/medgemma-27b-text-it-UD-Q8_K_XL.gguf'
--device cuda
--tensor-parallel-size 4
--gpu_memory_utilization 0.85
--port 5008
--max-model-len 18000
--block-size 128
--disable-custom-all-reduce
--trust-remote-code
--enable-chunked-prefill
--enable-prefix-caching

but i'm getting this error :

(VllmWorker rank=2 pid=3369240) ERROR 06-25 16:48:30 [multiproc_executor.py:487] raise RuntimeError(f"Unknown gguf model_type: {model_type}")
(VllmWorker rank=2 pid=3369240) ERROR 06-25 16:48:30 [multiproc_executor.py:487] RuntimeError: Unknown gguf model_type: gemma3_text

Sign up or log in to comment