Thank you VERY MUCH for this!
I heard of Psyonic Cetacean for a long time, and this remastering (and the one following, if I understood well), a demarch that I salute with enthusiasm, just gave me the incentive to test it in its prime.
Thank you very much, as well as to Jeb Carter.
A few questions :
- Could you always put the 2 absolute values when you make a PPL comparison, so numbers are non-ambiguous?
- Could you precise which kind of iMatrix you are using ("dataset"', ctx, chunks), and share the dataset and the imatrix both? We all quantizers/uploaders should adopt such a practice to allow reproduction of results!
- If I understand correctly, do you edit the quant strategies, to pump up some tensors quants in your "iMatrix plus" (like attn.v.weight, attn.k.weight)?
Thank you for your feedback.
A few notes:
Perplexity was listed on purpose this way. This will be addressed in paper(s) coming out soon.
The bottom line is the +- is not only showing "error" but RANGE.
Yet, like perplexity it is an average which does not even begin to show the "true change" in the model ... it is just a rough indicator.
"Average" is a terrible mathematical constant when it comes to statistics.
Rant over...
RE: Imatrix.
This is actually still undergoing testing.
I have found through running tests that the wrong Imatrix dataset can:
1 - Damage the model to the point of crashing completely (especially lower quants) and/or render it useless under specific prompt conditions.
2 - That sometimes the NON imatrix version is superior depending on use case(s).
3 - That the variance between quants can be larger, relative to the neighboring quant which is outsized at lower quants.
4 - That all these factors must be weighted, considered and in some case on a model by model basis and even a quant by quant basis (ie different dataset for lower quants vs higher quants)
The current "go to" for stable, reliable imatrix at the moment I am using is "wiki.test.raw" ; this is a smaller dataset than "wiki.train.raw" (10 times the size).
I have found that "wiki.test.raw" (at 655 chunks) is a good measure for both perplexity and imatrix.
It creates consistently stable imatrix quants with no issues I have detected yet whereas other "short cuts" DO CAUSE ISSUES.
That being said, "wiki.test.raw" MAY NOT be the optimum choice. This is still under testing.
Likewise there are a lot of options during the "imatrix" dataset generation (for quanting) yet to be fully explored.
RE: Imatrix plus.
Please note the quants here are NOT imatrix plus. This will follow in a separate repo.
The "plus" extends "embedding" and "output" tensor to F32.
This adds about 300-600 mb to each quant.
It upscale all quants - imatrix or non imatrix.
Adding/activating this should be on a case by case basis and tested.
Most of the time it is a positive upscale, sometimes it changes output just enough (especially creative) that it might not meet use case requirements.
I agree with your 4 points in the exact words you formulated them into, because that's what I noticed during extensive tests I made on iMatrixes in the first trimester of the year. Then my nerves just broke ! ^^
I'd add a 5., that an iMatrix should also be ideally adapted on case use, as demonstrated by Artefact2 and Ikawrakow on the use of language. Small quants and <33b models can be vastly helped in non-English language by an ad-hoc iMatrix (For example, I could drop ppl by 10-15% in French (difference clearly visible in usage, even more than in English) on sub 2bpw quants of Yi-34b in French just with a French iMatrix).
I often use wiki.train.raw to make the matrix, and wiki.test.raw to test all imatrixes, but Kalomaze and Turboderp's work, as well as those who do the "enhanced" group merged matrixes are also interesting, though I didn't test them.
But if I find the nerves, I'll make my own as well, with the kind of content I'm usually prompting for.
This being said, iMatrix at high quants (5BPW and beyond) might be more damageable than anything else.
As for token.embed and outpout.weight, I also think that it's optimal to keep them untouched, especially considering that (I believe) the embeddings remain in RAM and do not load in VRAM, while the output can also be quantized with benefit in Q8_0 if needs be, rather than Q6_K (that quant gives weird results sometimes, especially on k and v attn.weights).
As for your rant, +1. LlamaCPP can make several benches helping to see better the differences. Here's one of my bench command list :
perplexity -m N:\text-generation-webui\models\TeeZee_Kyllene-Yi_34B-v1.1-b2584-iMat-EnFr-IQ2_BLR.gguf -bf arc-challenge-validation.bin --multiple-choice --parallel 2 -ngl 100 -b 1024 -mg 0 -ts 1,0
perplexity -m N:\text-generation-webui\models\TeeZee_Kyllene-Yi_34B-v1.1-b2584-iMat-EnFr-IQ2_BLR.gguf -bf arc-easy-validation.bin --multiple-choice --parallel 2 -ngl 100 -b 1024 -mg 0 -ts 1,0
perplexity -m N:\text-generation-webui\models\TeeZee_Kyllene-Yi_34B-v1.1-b2584-iMat-EnFr-IQ2_BLR.gguf -f winogrande-debiased-eval.csv --winogrande --parallel 2 -ngl 100 -b 1024 -mg 0 -ts 1,0
perplexity -m N:\text-generation-webui\models\TeeZee_Kyllene-Yi_34B-v1.1-b2584-iMat-EnFr-IQ2_BLR.gguf -f wiki.test.raw -ngl 100 -b 1024 -mg 0 -ts 1,0 -c 512
perplexity -m N:\text-generation-webui\models\TeeZee_Kyllene-Yi_34B-v1.1-b2584-iMat-EnFr-IQ2_BLR.gguf -f wikitext_test_tokensupershort.fr.raw -ngl 100 -b 1024 -mg 0 -ts 1,0 -c 512
perplexity -m N:\text-generation-webui\models\TeeZee_Kyllene-Yi_34B-v1.1-b2584-iMat-EnFr-IQ2_BLR.gguf -bf mmlu-validation.bin --multiple-choice --multiple-choice-tasks 1548 --parallel 2 -ngl 100 -b 1024 -mg 0 -ts 1,0 -c 1024
perplexity -m N:\text-generation-webui\models\TeeZee_Kyllene-Yi_34B-v1.1-b2584-iMat-EnFr-IQ2_BLR.gguf -bf truthful-qa-validation.bin --multiple-choice --parallel 2 -ngl 100 -b 1024 -mg 0 -ts 1,0
perplexity -m N:\text-generation-webui\models\TeeZee_Kyllene-Yi_34B-v1.1-b2584-iMat-EnFr-IQ2_BLR.gguf -f hellaswag_val_full.txt --hellaswag --hellaswag-tasks 1000 --parallel 2 -ngl 150 -b 1024 -mg 0 -ts 1,0
perplexity -m N:\text-generation-webui\models\TeeZee_Kyllene-Yi_34B-v1.1-b2584-iMat-EnFr-IQ2_BLR.gguf -f wiki.test.raw -ngl 100 -b 1024 -mg 0 -ts 1,0 -c 4096
perplexity -m N:\text-generation-webui\models\TeeZee_Kyllene-Yi_34B-v1.1-b2584-iMat-EnFr-IQ2_BLR.gguf -f wikitext_test_tokensupershort.fr.raw -ngl 100 -b 1024 -mg 0 -ts 1,0 -c 4096
Here are the benchmark files : https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/tree/main
Thanks - lot to process / and test there.
RE: As for token.embed and outpout.weight, I also think that it's optimal to keep them untouched, especially considering that (I believe) the embeddings remain in RAM and do not load in VRAM, while the output can also be quantized with benefit in Q8_0 if needs be, rather than Q6_K (that quant gives weird results sometimes, especially on k and v attn.weights).
I would say depending on the use case here. Creative use cases (on a model by model basis) really benefit . Real world upscale.
This will barely show on "perplex" - maybe 50-100 points but you can see it clearly especially on output gen - long form.
With that being said - if the model is NOT optimized first , then adding these at F16/F32 may cause more harm than good.
Like... drinking coke after brushing your teeth.
Thanks again ;
A side note :
Among quants, IQ4_XS (or IQ2_S for the smallest quants) has been a quasi constant winner in my tests for the token.embed for L2 and L3, dropping the most the ppl.
For Mistral and Mixtral, It seems to be the opposite : Q2K and Q4K.
Then, PPL is just PPL, and it'd need extensive test to know if, on the other hand, other aspects of the model are actually harmed by the quantization of token.embed or not.
Heads up:
Could you precise which kind of iMatrix you are using ("dataset"', ctx, chunks), and share the dataset and the imatrix both? We all quantizers/uploaders should adopt such a practice to allow reproduction of results!
This answer (I wrote previously) will be revised; based on feedback and "lab testing" because of Imatrix "over pruning" of the 32 bit remasters and outlier effects.
RE: I'd add a 5., that an iMatrix should also be ideally adapted on case use, as demonstrated by Artefact2 and Ikawrakow on the use of language. Small quants and <33b models can be vastly helped in non-English language by an ad-hoc iMatrix (For example, I could drop ppl by 10-15% in French (difference clearly visible in usage, even more than in English) on sub 2bpw quants of Yi-34b in French just with a French iMatrix).
Ok, until early this week I would have agreed with this statement completely, now it is with the following caveat:
If the source is an "Ultra"/ "32 bit" remaster, almost all imatrix datasets (common) will have a neg impact as imatrix will now be "pruning" refined and optimized data instead of "dead wood" or "rounding errors". This is based on direct real world feedback and confirmed by "lab work".
Although the thinking behind "dropping perplexity" is sound under most conditions in the case of "Ultra/32 bit" is it actually the opposite or close to it.
In other words, the "imatrix" function needs to be "trimmed" / "dialed down" or almost stopped when using a HQ / ULTRA / 32BIT source.
I would even go to far as to urge caution on a fine tuned model too - same reason : Too much pruning it hurting the quality.
In such cases using an "anti-theme" dataset for the imatrix can negate the pruning issue for IQ1, 2, and part of IQ3 (as layer filters are blocked unless a hard jailbreak is done in LLamacpp source) and "tensor layer filters" (exclude weights) to dial down the "pruning" from IQ3XS and up. This tempers the pruning and likewise lowers the amount of perplexity drop - with the intent to stop or mostly reduce over pruning.