How to run in vLLM
#1
by
boshko
- opened
Can you please update the instructions on how to run this quantized model in vLLM
Thanks
try using sglang, it has vllm backend
python3 -m sglang.launch_server
--served-model-name tonjoo-coder
--model-path unsloth/Devstral-Small-2505-bnb-4bit
--chat-template /models/devstral.jinja
--port 8000
--host 0.0.0.0
--mem-fraction-static 0.8
actualy i try tp=2 but not working, using tp=1 might work
You should be able to do it fine vllm serve unsloth/Devstral-Small-2505-bnb-4bit --quantization bitsandbytes --load-format bitsandbytes
see https://docs.vllm.ai/en/latest/features/quantization/bnb.html
You should be able to do it fine
vllm serve unsloth/Devstral-Small-2505-bnb-4bit --quantization bitsandbytes --load-format bitsandbytes
see https://docs.vllm.ai/en/latest/features/quantization/bnb.html
That worked, thanks!
boshko
changed discussion status to
closed