Low inference throughput?
#2
by
RonanMcGovern
- opened
Running:
python3 -m sglang.launch_server --model-path leon-se/gemma-3-27b-it-FP8-Dynamic --context-length 8192 --host 0.0.0.0 --port 8000 --chat-template gemma-it
I'm getting 42 toks with concurrency of 1. Seems a little low I would think, but I'm not sure.
I only test the models with vLLM, what hardware are you using?
ah yeah sorry, 1xH100SXM
@leon-se I'm getting strange outputs from the model, what vLLM version did you use?
Edit: I figured it out, I was using the nightly vllm but I reinstalled the latest stable version (0.8.4
) and it works!