vLLM on 24gb gpu
#2
by
roadtoagi
- opened
I was able to load 8k context with max-num-seq 10 and --gpu-memory-utilization 0.99, though
- it barely fits
- v1 engine implementation doesn't calculate attention accurately, so some quality loss is expected
- v0 engine doesn't support flash attention with this model, so context size is gigantic.
Also, seems like vLLM with this quant always tries to add (english translation) in brackets after answering requests in other languages. Maybe vllm implementation is incorrect or something is wrong with calibration.