mradermacher/model_requests

jacek2024

4 days ago

the time has come

https://github.com/ggml-org/llama.cpp/pull/14939

4 days ago

Will be queued as soon @mradermacher updates to the latest version of our llama.cpp fork which I just updated. I will already manually prepare the GGUFs in the meantime as they are so large that manual handling makes sense.

nicoboss

4 days ago

•

edited 4 days ago

They are all queued and on their way! :D
Some GLM-4.5-Air quants are already uploaded.
Due to the massive size of some of those models it will take a few days for all quants to be done especially because GLM 4.5 and GLM 4.5-Base with a size of 355B will require RPC imatrix computation as I'm a perfectionist and want to imatrix compute in full precision which alone will probably take around 12 hours per model.

You can check for progress at http://hf.tst.eu/status.html or regularly check the model summary page at the following locations for quants to appear:

nicoboss

3 days ago

@mradermacher If we don't want to skip low bit-per-wight quants for GLM 4.5 we need to set the following for low bits-per-wight quants according to https://github.com/ggml-org/llama.cpp/pull/14939#issuecomment-3153670235:

Just FYI for anyone wanting to create i-quants; as the final layer will not get imatrix data until MTP is supported it has to be overridden for lower quants to work, eg. using --tensor-type 46=iq4_xs or --tensor-type 92=iq4_xs.

We will have to requant them anyways once Multi-Token Prediction (MTP) is implemented. I would be fine with skipping them if you don't want to change our quantization standard for these models.

mradermacher

Owner about 5 hours ago

•

edited about 5 hours ago

Skipping seems the right thing, indeed (or not providing any quants). It's nice to have a perspective that it will work some day.