Low inference throughput?

by RonanMcGovern - opened Apr 9

Apr 9

Running:

python3 -m sglang.launch_server --model-path leon-se/gemma-3-27b-it-FP8-Dynamic --context-length 8192 --host 0.0.0.0 --port 8000 --chat-template gemma-it

I'm getting 42 toks with concurrency of 1. Seems a little low I would think, but I'm not sure.

leon-se

Owner Apr 9

I only test the models with vLLM, what hardware are you using?

RonanMcGovern

Apr 9

ah yeah sorry, 1xH100SXM

qingy2024

Apr 27

•

edited Apr 27

@leon-se I'm getting strange outputs from the model, what vLLM version did you use?

Edit: I figured it out, I was using the nightly vllm but I reinstalled the latest stable version (0.8.4) and it works!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment