Thanks and Question about quanting...
Hi;
First; thank you for all your hard work. I have downloaded and tested over 500 models and over 100 IQ versions - many made by you.
Last week I also embarked on a quest to locate non-quanted GGUF models and with the help of gguf-my-repo have automated some of this task. However the sheer number of models (sometimes part or entire repos of a model maker) are uhh... getting a little overwhelming.
If you don't mind, could you tell me how you quant the repos so fast?
My interest is two fold:
1 - Quanting more models, especially with Imatrix. [IQ4-NL especially]
2 - Adding more quant versions to my own repo in the future for not only my own use, but the community too.
Thank you for your time.
David.
Well... when I started (around february), quantising was trivial: few quant formats, no imatrix quants, the existing formats were fast to generate, basically at NVME I/O speed. Then we got more formats, and then imatrix quants, altogether of which required an order (or two) of magnitude more computing power.
So I grew into it - I probably wouldn't have started quantising stuff a month later, too daunting a task.
And just like that, I started with a simple posix shell script that generated a few quants, then another hack that downloaded and uploaded from huggingface. Then I took over eight company servers ("2013 quad core xeons and other garbage"), wrote a job scheduler, all while retaining the very hacky but well-tested scripts I wrote. Then did the same thing with imatrix computations (I do those on my computer at home).
So while my computational power is total sucks, I can keep my boxes close to 100% busy, day and night.
My typical workflow looks like this: "llmjob add static nice -100 https:/..." adds one or more huggingface model urls to the queue, which eventually get assigned to a server, downloaded, converted, quantized, uploaded, readme gets patched etc. For imatrix, it's the same since last week or so: "imatrixjob add https://" to add a quant or a source gguf to the queue, it gets downloaded, my graphics card crunches on it, and it uploads the resulting imatrix.dat to one of the servers, from where the job scheduler will usually pick it up automatically.
Oh, yeah, and in the morning it's all red because stuff went wrong.
Long story short: automation and reducing manual labor were the key for me. Not so much computing power (although I can already see the time where I won't keep up with bigger models. The mistral 8x22 ones for example blow up to ~600GB, which is too much diskspace for most servers). I am in a state where all I have to do queue models and clean up when something goes wrong. Keeps me busy enough.
And here, without much comment, is what I look at many times a day (hfd is a download job, noquant is gguf conversion, static/imatrix are quantisation jobs). My queue had a length of >120 last week, so I am definitely making progress, and will eventually crunch through the remaining "classic" models in my queue. Some are waiting for two months by now.
And, yeah, this is more red than usual (disk got full and the scheduler was too dumb to notice) - usually there are 3-4 failures per day either because the model has bugs and/or could not be converted, I gave the wrong url (e.g. to a lora), and most commonly, because quantize crashes for one reason or another).
My biggest issues are llama.cpp stability/bugs.
That's all rather anecdotal, but if you have more questions, feel free to ask.
Ah, yes, regarding IQ4_NL - after investigating it a bit more, this format is really superseded in practically every way by IQ4_XS.
Thank you so much the the detailed reply.
Looking at using colab(s) (there is an imatrix one... no idea if it works yet) and/or GGUF my repo(s) (running 3 at time) to automate some of the work.
Here you on the llama.cpp issues - > some FP16 will not quant because of "dumb reasons" -> could download, edit file files+ upload then quant... for the exceptional ones.
Right now I am finding between 25-100 new models per day - many never quanted... never mind the "new ones" popping up daily.
At the moment I quant at Q6, then Q8 and then if possible q4ks or q4km based on # of params in the model.
Going to add some GPTQ, EXL2 via a colab notebook hopefully in the next few days.
There are some really interesting MOES in the 1B range too ; have a whole bunch at the repo - but also a bunch that llama.cpp wouldn't quant too.
https://huggingface.co/DavidAU
See "TINY MOES" ...
Thanks again for your time and all you do.
David
Regarding colab - you have to start somewhere, and then refine your methods as you go forward. I am still tweaking my setup daily.
I also try to quant older models that have no modern quants, but I am concentrating on the classics (i.e. the models people likely want to try out even today, so as to not waste resources). If you find old models and think they are still relevant for current audiences, you can also submit them to me (at least for the quants I do). Although 25 models/day would kill me :) But when I added older models (other than newer ones) my queue also increased in size rather than decreased, and there is only a finite pool of old models, so maybe we can catch up with our goals.
I usually avoid the really tiny ones (<=3), mostly because I think generating very low bpp quantisations for those is wasted space. OTOH, those quants are tiny, so do not waste so much space. Practically, though, I think for tiny models one wants f16, maybe Q8_0, possibly Q4_K_S and nothing else. Would be a trivial modification to my process. Hmm.
Hear you on the "older" models. In another 6 months... will they be called "classics" ?
When there was the switch from ggml to gguf... a lot of good models (and a lot of great work) was lost.
Yeah... 25 models per day is getting stupid here... ; and sometimes that is from just one model maker !
I am cutting off models older than sept 2023 at the moment, unless I can find "stats" to justify the conversion.
However - sometimes "v1" models work better than "v2" models...
Likewise when reviewing depos... older "test models" don't make the cut. er... "Teeth cutting models".
However, if the model maker has done "ggufs" , I check to see if they are q4, q6 etc etc.
If there is no Q8/Q6 I then make them.
500+ models of testing has shown that there a difference between q4,q5,q6 and q8.
Older models especially -> Q6 to Q8 can be significant jump in quality at high context, nuanced output (my current use case).
BTW: Imats of Q6 / Q8? - Wish list. Just sayin.
From being active in several forums on a daily basis a lot of people will not download anything less than q8 up to 70B size.
The tiny ones - especially 2.7-3.4 - are powerful pound for pound and excellent from some uses cases and especially CPU only.
I only quant at q8/q6 - too much loss EXCEPT if they are "IQed".
I have tested IQ versions as low as I2XX2 at 1B ... and they work, one IQ "higher" then work reasonably well.
That being said almost nothing below 1B works at at "Q" or "IQ" except a few MOES. (8x220m)
Some of the 1Bish MOEs are really good... and fast too.
I have tested the FP versions as well via transformers.
Interestingly... GPTQ versions of 1B models (at about q4 size - 450 megs ) work exceptionally well.
Some old... but goodies (not checked if they have been quanted - burned up too much time trying that to begin with!)
Note the "Config" ones - you can set the level of "jailbreak".
If you want the current full list; say so... going to add another 50-100 today... lol
TeeZee/DarkSapling-7B-v1.0
TeeZee/DarkSapling-7B-v1.1
TeeZee/DarkSapling-7B-v2.0
vicgalle/ConfigurableHermes-7B
vicgalle/Worldsim-Hermes-7B
vicgalle/SystemConfigHermes-7B
Doctor-Shotgun/TinyLlama-1.1B-32k
Doctor-Shotgun/TinyLlama-1.1B-32k-Instruct
Errors - Mostly in "vocab" or need a flag set @Q"time" or minor fix(es) to create:
Will download/fix and hopefully quant the "key" ones in the near future.
This is a partial list of headaches.
err - q6 Sao10K/Zephyrus-L1-33B
err (odd file names) - q6 Sao10K/Stheno-Mix-L2-20B
vocab - q6 WendyHoang/roleplay-model
not supported - WendyHoang/reward_models
err - q6 tdh87/StoryTeller7b
err - q6 tdh87/StoryTeller11b
vocab - q6 tom92119/llama-2-7b-bedtime-story
vocab - q6 SamirXR/NyX-Roleplay-7b
vocab - q6 IDEA-CCNL/Ziya-Writing-13B-v2
? q6 Sao10K/HACM7-Mistral-7B
error - q6 Sao10K/14B-Glacier-Stack
not supported- phanerozoic/Phi-2-Pirate-v0.1
file error - vocab - q8 yzhuang/Llama-2-7b-chat-hf_fictional_v2
file error - vocab - q6 yzhuang/Llama-2-7b-chat-hf_fictional_v1
file error - vocab - yzhuang/MetaTree [specialized decision making]
file error - vocab - q8, yzhuang/TinyLlama-1.1B_fictional_v3
file error - vocab - q8 yzhuang/TinyLlama-1.1B_fictional_v2
file error - vocab - q8 yzhuang/TinyLlama-1.1B_fictional
? q8 [too many SFT files] - yzhuang/phi-1_5_fictional
n/a - q8 princeton-nlp/QuRater-1.3B
n/a ? vocab - Blackroot/Nous-Hermes-Llama2-13b-Storywriter
? vocab - TeeZee/DarkForest-20B-v1.1
n/a ? vocab - q6 - Severian/Nexus-IKM-RolePlay-StoryWriter-Hermes-2-Pro-7B
(only Q8 avail currently)
Hmm, at least darksapling 2 already has imatrix quants from lewd (who makes a lot good small model imatrix quants). but yes, a v2 is not necessarily better than v1, but "better" is hard to define :)
Let me see if I have a few tips or observations: convert.py is very good at producing garbage output with no indication. common issues are it autodetecting a spm or hfft vocabulary if the model uses bpe - I think the issue might be that many models simply have been uploaded with broken hfft vocabs and a working bpe one. So if llama doesn't load the quant, it is worth looking for a vocab.* file indicating a bpe vocabulary.
Then there is the very common error of a model with 32000 tokens but 32001 vocabulary (or other off-by-a-few combinations). I tend to patch convert.py to remove the extra tokens, which can cause issues but mostly works. There are other fixes, such as using a tokenizer.model from the base model, that I haven't tried yet.
convert-hf-to-gguf.py loads the whole model unnecesarily into memory (for idiotic reasons). Maybe not so much an issue with 7Bs, but for larger models, it obviously is. If thats an issue, you can use http://data.plan9.de/convert-hf-memory-patch.diff to fix it (make sure your TMPDIR is pointing to something that is not an in-memory fs).
And lastly, are you really wanting to make one repo per quant? The exllama repo name pollution from some people already hurts my eyes and pollutes the namespace :) Well, that's not a hint, just a sigh from me.
Thanks for the help and info ;
Yes... EXL2 -> same issue for me too ... one listing per BPW. -> Visual eyesore.
Don't like it either - just short on hardware horsepower at the moment, and most quants will default / stay at q6 or q8.
Looks like Colab (all quants-> one repo) will work - minus Imatrix (??) at the moment.
Unsure of costs - really hard to get a read on Google's billing ...!
"really hard to get a read on Google's billing ...!" Don't worry, that is totally on purpose :) Be careful, these clouds are designed to trick you.
Got that damned llamacpp and imatrix running on my pc finally... just a few things to go.
Imatrix -> .dat cpu ... for 7B -> 1 hour 30 minutes.
-> GPU when finally got it working ... 4 minutes.
7B compressions -> 65 seconds per.
Now we are talkin!
Just need to build repo for imat raw files / download some more.
Next... customize IMATRIX/Quantize programming files to hybrid some models.
Thanks for all your help.
I put the imatrix file into the gguf repo (as imatrix.dat) that way, it's always easy to locate, because it comes as part of the same repo that the gguf is in. While not everybody does it that way, the majority of people publishing quants and imatrix files do it like this (except everybody uses a different filename).
It's just that you make interesting-looking models. Can't expect them to all work perfectly, can we? Since your output is... quite enourmous... I usually pick a few from time to time to quantize.
As always, if you think I should provide quants or imatrix quants of a specific model, don't hesitate to ask - you know better than me what models are more interesting.
Thanks ; will do.
- Gonna see how many COlabs I can run at the same time... then it will be truly a party. (30-40 minutes to merge -> upload to Hugging)
Haha, hard to quantize in that time :)
Heads Up:
https://huggingface.co/DavidAU/Psyonic-Cetacean-full-precision
First proof of concept full 32 bit upscale - 20B by yours truly.
Tested by the original creator and his group.
Blew his doors off.
Real world tested - off the scale.
Discord: KoboldAI - comments are all over it.
This is a 20B upscale - all parts, merges, etc etc.
Average change in perplexity at q4km -> -976 points (data on the "gguf" page).
Average change in perplexity at q4km imatrix -> close to double that number.
Created FIVE 32-bit upscales so far... more in the pipe.
D
I actually looked at it a bit earlier, and thought I won't need to quantize it because ggufs are already readily available? But then, not my specifric flavor, whether good or bad, so let's add it the queu :)