Thank you!! (IQ quants)
I've been waiting for so long to try the Nemotron Ultra model locally. I've been waiting patiently for GGUFs to go up, but had a feeling because of the size it might take forever/be impossible for people to make an imatrix file and then quant them to IQ formats. I was just about to post around to ask if anyone was in a position to make them and saw you'd already generated an imatrix and uploaded IQ1s. Thank you! You're a star.
Are you planning to upload some IQ2s? I am hoping to be able to run an IQ2S or possibly IQ2M on my humble maxed-out 96BG Mac Studio M2 Max.
I don't know if my system is up for quanting imatrix IQs with 96GB universal RAM, but i'd be happy to put my resources to the task if it's capable. If you're using llama.cpp, how much memory are you finding you're using to quant these? Do you need enough to load in the whole Q8?
- FYI: I was unable to prduce the imatrix on the full f16 precision GGUF -> I've used the Q4 quant so that might affect the quality. I've also called this out on the modelcard.
For this I've used my local server with ~100GB VRAM(GPU) and 128GB RAM - it has taken like 19hrs
AFAIK for imatrix you don't have to be able to fit the full model to memory, on my other machine (256GB RAM and 32GB VRAM) I'm generating the imatrix for
Llama4 Maverick. It will take forever ~35hrs but it's making it. I'm not sure what as caused the imatrix generation fail here for the .f16.GGUF some might able to shine light on that. - IQ2_XXS is in the making
- Once you have the imatrix I think you can quantize to any quant. You don't need to be able to fit the model to memory.
Thanks for the quick and helpful reply!
I'm tempted to give it a whirl. I only have a total of 96GB (universal RAM) available but I can allocate almost all of that to 'GPU'. I'm tempted to download a Q8_0 and try to make various quants of it myself, but it'll take some time to do even that on my 80Mbps connection (I've been thinking about upgrading but it's on a "fixed price forever" plan, which is sweet, but I lose that if I upgrade...)
I think by the time I download all the files and complete my first quant, you'll likely have all the important quants done yourself anyway :D I can't imagine there are many people who will want to run a 253B quant above Q4KM! It's only a relatively few crazy diehards who have 96GB+ GPU on their home systems :D
I have 128GB VRAM + 192GB RAM (though slow RAM on a consumer motherboard and CPU, no quad channel), I'm downloading the Q4_K_M to see how it fares against Q3_K_M (from https://huggingface.co/nicoboss/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF). At Q3_K_M I can load 16K ctx with all layers on GPU with fa, ctk q8_0 and ctv q4_0. So Q4_K_M will be a good amount slower, but want to check the difference in quality.
Also, any chance of IQ2_K_XL?
Thanks for your work!
Okay Q4_K_M is noticeably better than Q3_K_M haha, but well, it is too slow to run on RAM + VRAM. I guess you need ~180GB VRAM or more to run it comfortably.
I don't know about IQ2_K_XL quant type, you mean IQ2_XS ?
Oh my bad, I wrote it wrong, it was UD-Q2_K_XL. it's some quants that unsloth does, I'm new on GGUF so not sure if those can be made globally. I.E. https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF/tree/main/UD-Q2_K_XL
Okay Q4_K_M is noticeably better than Q3_K_M haha, but well, it is too slow to run on RAM + VRAM. I guess you need ~180GB VRAM or more to run it comfortably.
Oh, please don't say that! I can't even run the Q3_K_M or Q2_K sizes haha. The ideal for me is the IQ2_M quant, which is smaller than Q2_K but comparable quality. I can probably run IQ2_S a little more comfortably.
At first I thought you were referring to another technique which one of the other big quanters uses, I think it's Bartowski. Basically the output and embed layers are done in full Q8_0 whilst the other layers are quantised normally. This leads to a slightly larger file but apparently better outputs. I remember that person used _XL to refer to their quants where L was the largest and XXL when XL was the largest.
Okay Q4_K_M is noticeably better than Q3_K_M haha, but well, it is too slow to run on RAM + VRAM. I guess you need ~180GB VRAM or more to run it comfortably.
Try IQ3_M. It should be around 110GB. Performance is quite close to Q4_K_M.
@ymcki I didn't know about IQ3_M quants! Those are not available yet for Nemotron 253B right?
@BoshiAI Oh sorry, it's just that you can notice the difference quite fast when comparing those two. Maybe IQ2_M is comparable to Q3_K_M?
It's okay, I was only jesting! But you're right, sometimes 'lower' quants outperform quants that are slightly larger.
I'm going to start by downloading the Q4_K_M quant and using the iMatrix file @DevQuasar has provided to generate IQ2_S and IQ2_M files.
If the iMatrix file can be used on any size quant and there's an advantage in doing so, I might grab an Q8_0 quant or FP8 quant and use that with the iMatrix to generate an IQ2_S and IQ2_M. I don't know if it'd be worth generating a new iMatrix file from the Q8_0 or if we're splitting hairs by that point, or whether a new iMatrix is needed from the Q8_0 to benefit from using that as a base for quanting down.
If you're interested (and nobody has created any by then) I can generate an IQ3_S and IQ3_M as well afterwards, though I won't be able to test them before uploading. Though if my other quants work they should be fine.
Many thanks! Probably an IQ3_M would be interesting, but now I have downloaded Q3_K_XL, which I will try soon https://huggingface.co/unsloth/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF/tree/main/UD-Q3_K_XL
I want to compare to 3.6bpw of exllamav3 as well.
Q3_K_M uploading now