DevQuasar/nvidia.Llama-3_1-Nemotron-Ultra-253B-v1-GGUF

about 1 month ago

I've been waiting for so long to try the Nemotron Ultra model locally. I've been waiting patiently for GGUFs to go up, but had a feeling because of the size it might take forever/be impossible for people to make an imatrix file and then quant them to IQ formats. I was just about to post around to ask if anyone was in a position to make them and saw you'd already generated an imatrix and uploaded IQ1s. Thank you! You're a star.

Are you planning to upload some IQ2s? I am hoping to be able to run an IQ2S or possibly IQ2M on my humble maxed-out 96BG Mac Studio M2 Max.

I don't know if my system is up for quanting imatrix IQs with 96GB universal RAM, but i'd be happy to put my resources to the task if it's capable. If you're using llama.cpp, how much memory are you finding you're using to quant these? Do you need enough to load in the whole Q8?

csabakecskemeti

DevQuasar org about 1 month ago

FYI: I was unable to prduce the imatrix on the full f16 precision GGUF -> I've used the Q4 quant so that might affect the quality. I've also called this out on the modelcard.
For this I've used my local server with ~100GB VRAM(GPU) and 128GB RAM - it has taken like 19hrs
AFAIK for imatrix you don't have to be able to fit the full model to memory, on my other machine (256GB RAM and 32GB VRAM) I'm generating the imatrix for
Llama4 Maverick. It will take forever ~35hrs but it's making it. I'm not sure what as caused the imatrix generation fail here for the .f16.GGUF some might able to shine light on that.
IQ2_XXS is in the making
Once you have the imatrix I think you can quantize to any quant. You don't need to be able to fit the model to memory.

BoshiAI

about 1 month ago

•

edited about 1 month ago

Thanks for the quick and helpful reply!

I'm tempted to give it a whirl. I only have a total of 96GB (universal RAM) available but I can allocate almost all of that to 'GPU'. I'm tempted to download a Q8_0 and try to make various quants of it myself, but it'll take some time to do even that on my 80Mbps connection (I've been thinking about upgrading but it's on a "fixed price forever" plan, which is sweet, but I lose that if I upgrade...)

I think by the time I download all the files and complete my first quant, you'll likely have all the important quants done yourself anyway :D I can't imagine there are many people who will want to run a 253B quant above Q4KM! It's only a relatively few crazy diehards who have 96GB+ GPU on their home systems :D

Panchovix

about 1 month ago

I have 128GB VRAM + 192GB RAM (though slow RAM on a consumer motherboard and CPU, no quad channel), I'm downloading the Q4_K_M to see how it fares against Q3_K_M (from https://huggingface.co/nicoboss/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF). At Q3_K_M I can load 16K ctx with all layers on GPU with fa, ctk q8_0 and ctv q4_0. So Q4_K_M will be a good amount slower, but want to check the difference in quality.

Also, any chance of IQ2_K_XL?

Thanks for your work!

Panchovix

about 1 month ago

Okay Q4_K_M is noticeably better than Q3_K_M haha, but well, it is too slow to run on RAM + VRAM. I guess you need ~180GB VRAM or more to run it comfortably.

csabakecskemeti

DevQuasar org about 1 month ago

I don't know about IQ2_K_XL quant type, you mean IQ2_XS ?

Panchovix

about 1 month ago

Oh my bad, I wrote it wrong, it was UD-Q2_K_XL. it's some quants that unsloth does, I'm new on GGUF so not sure if those can be made globally. I.E. https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF/tree/main/UD-Q2_K_XL

BoshiAI

30 days ago

Okay Q4_K_M is noticeably better than Q3_K_M haha, but well, it is too slow to run on RAM + VRAM. I guess you need ~180GB VRAM or more to run it comfortably.

Oh, please don't say that! I can't even run the Q3_K_M or Q2_K sizes haha. The ideal for me is the IQ2_M quant, which is smaller than Q2_K but comparable quality. I can probably run IQ2_S a little more comfortably.

At first I thought you were referring to another technique which one of the other big quanters uses, I think it's Bartowski. Basically the output and embed layers are done in full Q8_0 whilst the other layers are quantised normally. This leads to a slightly larger file but apparently better outputs. I remember that person used _XL to refer to their quants where L was the largest and XXL when XL was the largest.

ymcki

29 days ago

Okay Q4_K_M is noticeably better than Q3_K_M haha, but well, it is too slow to run on RAM + VRAM. I guess you need ~180GB VRAM or more to run it comfortably.

Try IQ3_M. It should be around 110GB. Performance is quite close to Q4_K_M.

Panchovix

29 days ago

@ymcki I didn't know about IQ3_M quants! Those are not available yet for Nemotron 253B right?

@BoshiAI Oh sorry, it's just that you can notice the difference quite fast when comparing those two. Maybe IQ2_M is comparable to Q3_K_M?

BoshiAI

28 days ago

@ymcki I didn't know about IQ3_M quants! Those are not available yet for Nemotron 253B right?

@BoshiAI Oh sorry, it's just that you can notice the difference quite fast when comparing those two. Maybe IQ2_M is comparable to Q3_K_M?

It's okay, I was only jesting! But you're right, sometimes 'lower' quants outperform quants that are slightly larger.

I'm going to start by downloading the Q4_K_M quant and using the iMatrix file @DevQuasar has provided to generate IQ2_S and IQ2_M files.

If the iMatrix file can be used on any size quant and there's an advantage in doing so, I might grab an Q8_0 quant or FP8 quant and use that with the iMatrix to generate an IQ2_S and IQ2_M. I don't know if it'd be worth generating a new iMatrix file from the Q8_0 or if we're splitting hairs by that point, or whether a new iMatrix is needed from the Q8_0 to benefit from using that as a base for quanting down.

If you're interested (and nobody has created any by then) I can generate an IQ3_S and IQ3_M as well afterwards, though I won't be able to test them before uploading. Though if my other quants work they should be fine.

Panchovix

28 days ago

Many thanks! Probably an IQ3_M would be interesting, but now I have downloaded Q3_K_XL, which I will try soon https://huggingface.co/unsloth/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF/tree/main/UD-Q3_K_XL

I want to compare to 3.6bpw of exllamav3 as well.

csabakecskemeti

DevQuasar org 27 days ago

Q3_K_M uploading now

BoshiAI

24 days ago

•

edited 24 days ago

Sorry for not reporting back with those quants.

I spent the better part of a couple of days trying to (and eventually succeeding in) creating an Q8_0 iMatrix and then quanting an IQ2S using the Q8 imatrix.

Unfortuantely, when I tried to run the resulting model (using the adapted version of llama.cpp) it reported that a 'blk.9' tensor was missing from the quant and I haven't been able to work out how to fix that. I assumed the modified llama distribution was meant to resolve that but I don't know if the 'llama-quantize' code needs to be adjusted also in order to create compatible GGUFs that the adapted 'llama.cpp' can load. I was hoping to generate an IQ2S and an IQ2M and then I'd have taken a shot at that IQ3M if they had worked.

csabakecskemeti

DevQuasar org 23 days ago

I suggest to use the code that @ymcki has been kindly provided. I've worked with that and tested both qunatization and inference

BoshiAI

23 days ago

I suggest to use the code that @ymcki has been kindly provided. I've worked with that and tested both qunatization and inference

That's what I did, although it didn't work for me for some reason. It seemed to generate an iMatrix and create a GGUF just fine, but when using llama.cpp it just failed. But I'll try downloading the latest version from ymcki's github (as I see the latest version is only a couple of days old), see if that helps.

csabakecskemeti

DevQuasar org 23 days ago

you have inference issue with the qunants you produced or also with the ones in this repo?

ymcki

23 days ago

Sorry for not reporting back with those quants.

I spent the better part of a couple of days trying to (and eventually succeeding in) creating an Q8_0 iMatrix and then quanting an IQ2S using the Q8 imatrix.

Unfortuantely, when I tried to run the resulting model (using the adapted version of llama.cpp) it reported that a 'blk.9' tensor was missing from the quant and I haven't been able to work out how to fix that. I assumed the modified llama distribution was meant to resolve that but I don't know if the 'llama-quantize' code needs to be adjusted also in order to create compatible GGUFs that the adapted 'llama.cpp' can load. I was hoping to generate an IQ2S and an IQ2M and then I'd have taken a shot at that IQ3M if they had worked.

"'blk.9' tensor was missing from the quant " means your llama.cpp expect blk.9 has weights. However, in this 253B model, layer 10 is a dummy layer that has no weight at all. The most likely cause is that you are using a version that doesn't know the new dummy layer. Please download code from my repo, compile it and run llama-cli with a gguf downloaded from this repo.

DevQuasar
/

nvidia.Llama-3_1-Nemotron-Ultra-253B-v1-GGUF

Thank you!! (IQ quants)