Low inference throughput?

#2
by RonanMcGovern - opened

Running:

python3 -m sglang.launch_server --model-path leon-se/gemma-3-27b-it-FP8-Dynamic --context-length 8192 --host 0.0.0.0 --port 8000 --chat-template gemma-it

I'm getting 42 toks with concurrency of 1. Seems a little low I would think, but I'm not sure.

Owner

I only test the models with vLLM, what hardware are you using?

ah yeah sorry, 1xH100SXM

@leon-se I'm getting strange outputs from the model, what vLLM version did you use?

Edit: I figured it out, I was using the nightly vllm but I reinstalled the latest stable version (0.8.4) and it works!

Sign up or log in to comment