Reply with 333333333....

#2
by pipilok - opened

Problem with your version with latest llama.cpp (b6144):

F:\llama.cpp\llama.cpp\build\bin\Release\llama-server.exe --port 9292 --flash-attn -ngl 999 --n-gpu-layers 999 --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 32768 --model .\models\BasedBase\Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2\Qwen3-30B-A3B-Coder-480B-Distill-v2-Q8_0.gguf --temp 0.7 --repeat-penalty 1.0 --min-p 0.00 --top-k 20 --top-p 0.8 --n-cpu-moe 2 --no-kv-offload -t 4

изображение.png

This is most likely due to flash attention being enabled. When I tested using flash attention it would sometimes get caught in a coding loop or just degrade the quality if generated code. I also tested flash attention on the base qwen3 coder 30b model and it noticeably degraded coding performance.

Sign up or log in to comment