Reply with 333333333....

by pipilok - opened about 3 hours ago

about 3 hours ago

Problem with your version with latest llama.cpp (b6144):

F:\llama.cpp\llama.cpp\build\bin\Release\llama-server.exe --port 9292 --flash-attn -ngl 999 --n-gpu-layers 999 --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 32768 --model .\models\BasedBase\Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2\Qwen3-30B-A3B-Coder-480B-Distill-v2-Q8_0.gguf --temp 0.7 --repeat-penalty 1.0 --min-p 0.00 --top-k 20 --top-p 0.8 --n-cpu-moe 2 --no-kv-offload -t 4

BasedBase

Owner about 3 hours ago

This is most likely due to flash attention being enabled. When I tested using flash attention it would sometimes get caught in a coding loop or just degrade the quality if generated code. I also tested flash attention on the base qwen3 coder 30b model and it noticeably degraded coding performance.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment