Quantization
I wanted to do some tests myself with different quantization. First step, to just reproduce one of yours with the recepie for (IQ2_KS 144.126 GiB (2.578 BPW)).
But ik_llama.cpp runs into a problem, I am not sure I need to bother about (it continous to run, so I am just waiting to see if it works)
================================ Have weights data with 497 entries
[ 1/ 747] token_embd.weight - [ 6144, 151936, 1, 1], type = bf16, Using custom type iq4_ks for tensor token_embd.weight
====== llama_model_quantize_internal: did not find weights for token_embd.weight
Exciting, yeah let me know what you find!
That warning is no problem, the imatrix does not apply to the token_embd.weight non-repeating layer. I get the same thing and it is fine.
As you make progress, keep in mind that measuring perplexity is very sensitive to things like context length etc. I use simple defaults as much as possible and keep it consistent e.g.
wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
gunzip wiki.test.raw.gz
$ sha1sum wiki.test.raw
6f1fe2054a940eebfc76b284b09680763b37f5ea wiki.test.raw
# CPU-only compiled example
./build/bin/llama-perplexity \
-m "$model" \
-f wiki.test.raw \
--seed 1337 \
-fa -fmoe \
--ctx-size 512 \
-ub 4096 -b 4096 \
--numa numactl \
--threads 16 \
--no-mmap
You can find a full GPU offload example here under Perplexity
in my quant cookers guide: https://github.com/ikawrakow/ik_llama.cpp/discussions/434
The --seed
doesn't matter actually, but I leave it in my command to see who is using it haha... If you have only one GPU, you can do hybrid CPU+GPU offload with the usual -ngl 99 -ot ...
method that is fine.
Also finally, keep in mind that Perplexity is not everything. I could also measure KLD and sometimes do, but these models behave fairly well so using perplexity to educate my recipe decisions is sufficient for my needs.
Cheers!