5.0bpw output token errors?

#1
by recallmenot - opened

Hi,
I'm trying to move from llama.cpp to exllamav2 (with tabby) and this is my favorite local model, Q5 used to be quite good and I've tried using the 5.0bpw EXL3 version but I'm noticing some strange stray tokens (symbols, cyrillics, chinese characters, plain wrong tokens) in the output.

Example prompt: "Write me a long poem about the development of the Klingon empire since Kahles."
Errors collected from 6 runs (not every run has them):

The United Federation of Planets, with values 생성 to Avaliar fear.
Quark's Bar,]=[ an unlikely place, for diplomacy to bloom,
For honor, for glory, forア,ア Qapla'! (Success!) we cry,
Qo'noS: The homeworld of the Klingon Empire_CONST.
With the forging of the First Sword, aFolderPath laid,
Implemented reforms, theュHome's power to resound,
Reflections on honor, in the faceKG.of scorn,

I've never seen any of this with the Q5 GGUF in llama.cpp.

I've tried re-downloading the model, same sha1sum:
./start.sh download turboderp/Llama-3.3-Nemotron-Super-49B-v1-exl3 --revision 5.0bpw
find . -type f ( -exec sha1sum "$PWD"/{} ; ) | awk '{print $1}' | sort | sha1sum
d8382d265b354adfb9174908a7da07fb82877b75 -

This is my tabbyAPI loader:
curl http://localhost:5000/v1/model/load
-H "Content-Type: application/json"
-H "Authorization: Bearer key"
-d '{
"model_name": "Llama-3.3-Nemotron-Super-49B-v1-exl3",
"max_seq_len": 24000,
"gpu_split_auto": true,
"tensor_parallel": true
}'

tensor_parallel makes no difference here.

Client is open-webui but that shouldn't matter.

The client probably does matter here. You should check if you have any sort of truncation enabled and what temperature you're sampling at. The default in Tabby is raw sampling, so if the client doesn't provide any sampling settings with the request, that's just what you get. Without truncation sampling there's always a non-zero chance of sampling a nonsense token at any given moment, which could be what you're seeing.

NVIDIA recommends a temperature of 0.6 and top-P of 0.95 for this model. I'm not sure if llama.cpp bakes those settings into the model, perhaps, or maybe it just always enforces some amount of truncation no matter what. ExLlama doesn't force anything, but you can set global overrides in Tabby's config to adjust settings that aren't exposed in any given client.

Sign up or log in to comment