ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g

I was able to load 8k context with max-num-seq 10 and --gpu-memory-utilization 0.99, though

it barely fits
v1 engine implementation doesn't calculate attention accurately, so some quality loss is expected
v0 engine doesn't support flash attention with this model, so context size is gigantic.

Also, seems like vLLM with this quant always tries to add (english translation) in brackets after answering requests in other languages. Maybe vllm implementation is incorrect or something is wrong with calibration.

ISTA-DASLab
/

gemma-3-27b-it-GPTQ-4b-128g

vLLM on 24gb gpu