How to run in vLLM

by boshko - opened 16 days ago

Discussion

boshko

16 days ago

Can you please update the instructions on how to run this quantized model in vLLM

Thanks

todiadiyatmo

16 days ago

try using sglang, it has vllm backend

      python3 -m sglang.launch_server
      --served-model-name tonjoo-coder
      --model-path unsloth/Devstral-Small-2505-bnb-4bit
      --chat-template /models/devstral.jinja
      --port 8000
      --host 0.0.0.0
      --mem-fraction-static 0.8

actualy i try tp=2 but not working, using tp=1 might work

danielhanchen

Unsloth AI org 14 days ago

You should be able to do it fine vllm serve unsloth/Devstral-Small-2505-bnb-4bit --quantization bitsandbytes --load-format bitsandbytes see https://docs.vllm.ai/en/latest/features/quantization/bnb.html

boshko

2 days ago

You should be able to do it fine vllm serve unsloth/Devstral-Small-2505-bnb-4bit --quantization bitsandbytes --load-format bitsandbytes see https://docs.vllm.ai/en/latest/features/quantization/bnb.html

That worked, thanks!

boshko changed discussion status to closed 2 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment