bartowski/THUDM_GLM-4-32B-0414-GGUF · llama.cpp fixes have just been merged

Mushoz

Apr 23

It looks like a PR with the fixes for Llama.cpp has just been merged: https://github.com/ggml-org/llama.cpp/pull/13021

So working GGUFs should be possible to generate now with the latest master build :) FYI

bartowski

Owner Apr 23

Yup just waiting for a build release :) I like to keep it as official as possible 😂

MrDevolver

Apr 23

Yup just waiting for a build release :) I like to keep it as official as possible 😂

Dude, stop teasing us lol, wanna binaries, try the ones from the fix test repo here:
https://github.com/piDack/llama.cpp/releases

This would work too, because only the converter has changed.

urtuuuu

Apr 23

•

edited Apr 23

I bet this fix still not enough for amd cards using vulkan, but will see. For now i can only use cpu version of llama.cpp for this model, otherwise gibberish output.

bartowski

Owner Apr 23

Better to wait an extra couple hours to ensure it's not going to be another broken release

I've pulled in the release and am starting these quants up now, sorry for the delay!

MrDevolver

Apr 23

•

edited Apr 23

I bet this fix still not enough for amd cards using vulkan, but will see. For now i can only use cpu version of llama.cpp for this model, otherwise gibberish output.

It works on my AMD GPU. I tested the fixed Q8 quant of the smaller 9B model.

Better to wait an extra couple hours to ensure it's not going to be another broken release

I've pulled in the release and am starting these quants up now, sorry for the delay!

There's now official build which should have the fixes already in, because the build is like 33 minutes old at the time of posting this. https://github.com/ggml-org/llama.cpp/releases

bartowski

Owner Apr 23

Yeah I've already pulled that release and have started remaking the quants

Mushoz

Apr 23

Seeing the commits trickle in now. Some of the quants have already been updated, see: https://huggingface.co/bartowski/THUDM_GLM-4-32B-0414-GGUF/tree/main

urtuuuu

Apr 23

•

edited Apr 23

As i expected, still not fixed for AMD, apparently... When the promt is not too short, the output is something like this "=arg观vat limp加盐 일 Hemp descending accessible质dots gehscal Nir Clinrone b pik sw{ geh viableInfinity 将其-section� guardingوعةords Vari".
llama-b5173-bin-win-vulkan-x64. Without even offloading layers to GPU. Just cpu.
So i guess, i still have to use this model in cpu mode in llama-b5173-bin-win-avx2-x64

Mushoz

Apr 23

@urtuuuu It's not a AMD specific bug. Someone in the following issue tracker came with a solution: https://github.com/ggml-org/llama.cpp/issues/12946

You need to set the physical and logical batch size to a low value. -ub 32 -b 32 seemed for work for him, and as you can see in my screenshot in that topic it also fixed it for me.

urtuuuu

Apr 23

•

edited Apr 23

You need to set the physical and logical batch size to a low value. -ub 32 -b 32 seemed for work for him, and as you can see in my screenshot in that topic it also fixed it for me.

llama-cli -m THUDM_GLM-4-32B-0414-Q3_K_M.gguf -c 8192 --temp 0.5 -cnv --color --multiline-input -b 32 -ub 32
Like this? I tried and still get "ratio неот县人民政府 RHSCha一线 dabalieesarlak或少inus сут不限��的DUCT无语itech韬atorauses'anNES generating Fol XC维持escap.........."

Update: Ok, i just tried it at -b 16 -ub 16 and it works for now... :) Still sometimes get this output "GGGGGGGGGGGGGGGG" if offloading all 62 layers to GPU. With 60/62 seems to work again. Never had this weirdness with any other model before.

slyfox1186

Apr 23

This comment has been hidden (marked as Resolved)

kristaller486

Apr 24

@bartowski what's about 9b version?

bartowski

Owner Apr 24

@kristaller486 oh weird it looks like it got stuck on my end, restarted it, should go up shortly

knarp

Apr 24

•

edited Apr 24

As i expected, still not fixed for AMD, apparently... When the promt is not too short, the output is something like this "=arg观vat limp加盐 일 Hemp descending accessible质dots gehscal Nir Clinrone b pik sw{ geh viableInfinity 将其-section� guardingوعةords Vari".
llama-b5173-bin-win-vulkan-x64. Without even offloading layers to GPU. Just cpu.
So i guess, i still have to use this model in cpu mode in llama-b5173-bin-win-avx2-x64

For whatever reason, I'm able to use this model on my AMD Radeon 8060S (Ryzen AI MAX+ 395) using Ollama | Ollama + Open WebUI with default settings. It's using ~100% of the GPU.
In LM Studio, I can load this model with all GPU layers using the Vulkan llama.cpp Windows runtime. But when answering a question, the output is only 'GGGGGGGGGGGGGGGGGG'. However, switching to CPU llama.cpp Windows runtime, the model works as intended, albeit more than half the speed of Vulkan.
So whatever Ollama is doing under the hood to get this model to work on my AMD GPU, it's working.
So, maybe it'll be the same for you.

ilintar

Apr 24

@knarp there's still an ongoing issue under my original issue which probably has to do with some bug under Vulkan, you can follow here: https://github.com/ggml-org/llama.cpp/issues/12946

owao

Apr 24

@bartowski your gguf seems to all good, but I get a weird behavior:
For 32B Q4_K_M I can usually offload approx 16K tokens for 65 layers models (Qwen family). Here, it's not from the Qwen family, and only has 62 layers, also the gguf is arround 200MB less than Qwen's ones.
BUT I can offload up to 70K tokens!
Same for Z1 variant.

I'm running this through ollama and deleted the gguf before noticing it.
From what I tested so far with a 23K tokens research paper, I don't get OOM and it still manages to raw recopy the abstract!

Can you or anyone reproduce this? Because that feels a bit unexpected :D

ilintar

Apr 24

Yes, the KV cache here is very thin, so you can run huge contexts with very little RAM, it's much less demanding than Qwen or even Llama in terms of context memory.

owao

Apr 24

OK I'll try to understand better what are the implications, cause right now it feels like magic! Confirmed the same with 67K tokens!
Thanks for your explanation @ilintar

owao

Apr 24

@ilintar Are those the numbers to look at?

GLM-4

QwQ

I'm sorry I have zero knowledge

CHNtentes

Apr 27

@ilintar Are those the numbers to look at?

GLM-4

QwQ

I'm sorry I have zero knowledge

Basically you should check the second dim of the kv weight shape, this is the one that impacts kv cache size, so you can expect ~1/4 kv cache size compared to QwQ.

owao

Apr 27

Thanks @CHNtentes but why only the second dim? Shouldn't the product of the 2 dims be more accurate?
I just noticed The Rumination variant has a radical x4 increase in the second dim [6 144, 1 024] compared to GLM-4 and the Z1 variant [6 144, 256]. That's interesting, but I was wondering if it is really intended because the rest is identical.