llama.cpp fixes have just been merged

#5
by Mushoz - opened

It looks like a PR with the fixes for Llama.cpp has just been merged: https://github.com/ggml-org/llama.cpp/pull/13021

So working GGUFs should be possible to generate now with the latest master build :) FYI

Yup just waiting for a build release :) I like to keep it as official as possible 😂

Yup just waiting for a build release :) I like to keep it as official as possible 😂

Dude, stop teasing us lol, wanna binaries, try the ones from the fix test repo here:
https://github.com/piDack/llama.cpp/releases

This would work too, because only the converter has changed.

I bet this fix still not enough for amd cards using vulkan, but will see. For now i can only use cpu version of llama.cpp for this model, otherwise gibberish output.

Better to wait an extra couple hours to ensure it's not going to be another broken release

I've pulled in the release and am starting these quants up now, sorry for the delay!

I bet this fix still not enough for amd cards using vulkan, but will see. For now i can only use cpu version of llama.cpp for this model, otherwise gibberish output.

It works on my AMD GPU. I tested the fixed Q8 quant of the smaller 9B model.

Better to wait an extra couple hours to ensure it's not going to be another broken release

I've pulled in the release and am starting these quants up now, sorry for the delay!

There's now official build which should have the fixes already in, because the build is like 33 minutes old at the time of posting this. https://github.com/ggml-org/llama.cpp/releases

good luck

Yeah I've already pulled that release and have started remaking the quants

Seeing the commits trickle in now. Some of the quants have already been updated, see: https://huggingface.co/bartowski/THUDM_GLM-4-32B-0414-GGUF/tree/main

As i expected, still not fixed for AMD, apparently... When the promt is not too short, the output is something like this "=arg观vat limp加盐 일 Hemp descending accessible质dots gehscal Nir Clinrone b pik sw{ geh viableInfinity 将其-section� guardingوعةords Vari".
llama-b5173-bin-win-vulkan-x64. Without even offloading layers to GPU. Just cpu.
So i guess, i still have to use this model in cpu mode in llama-b5173-bin-win-avx2-x64

@urtuuuu It's not a AMD specific bug. Someone in the following issue tracker came with a solution: https://github.com/ggml-org/llama.cpp/issues/12946

You need to set the physical and logical batch size to a low value. -ub 32 -b 32 seemed for work for him, and as you can see in my screenshot in that topic it also fixed it for me.

You need to set the physical and logical batch size to a low value. -ub 32 -b 32 seemed for work for him, and as you can see in my screenshot in that topic it also fixed it for me.

llama-cli -m THUDM_GLM-4-32B-0414-Q3_K_M.gguf -c 8192 --temp 0.5 -cnv --color --multiline-input -b 32 -ub 32
Like this? I tried and still get "ratio неот县人民政府 RHSCha一线 dabalieesarlak或少inus сут不限����的DUCT无语itech韬atorauses'anNES generating Fol XC维 持escap.........."

Update: Ok, i just tried it at -b 16 -ub 16 and it works for now... :) Still sometimes get this output "GGGGGGGGGGGGGGGG" if offloading all 62 layers to GPU. With 60/62 seems to work again. Never had this weirdness with any other model before.

This comment has been hidden (marked as Resolved)

@bartowski what's about 9b version?

@kristaller486 oh weird it looks like it got stuck on my end, restarted it, should go up shortly

As i expected, still not fixed for AMD, apparently... When the promt is not too short, the output is something like this "=arg观vat limp加盐 일 Hemp descending accessible质dots gehscal Nir Clinrone b pik sw{ geh viableInfinity 将其-section� guardingوعةords Vari".
llama-b5173-bin-win-vulkan-x64. Without even offloading layers to GPU. Just cpu.
So i guess, i still have to use this model in cpu mode in llama-b5173-bin-win-avx2-x64

For whatever reason, I'm able to use this model on my AMD Radeon 8060S (Ryzen AI MAX+ 395) using Ollama | Ollama + Open WebUI with default settings. It's using ~100% of the GPU.
In LM Studio, I can load this model with all GPU layers using the Vulkan llama.cpp Windows runtime. But when answering a question, the output is only 'GGGGGGGGGGGGGGGGGG'. However, switching to CPU llama.cpp Windows runtime, the model works as intended, albeit more than half the speed of Vulkan.
So whatever Ollama is doing under the hood to get this model to work on my AMD GPU, it's working.
So, maybe it'll be the same for you.

@knarp there's still an ongoing issue under my original issue which probably has to do with some bug under Vulkan, you can follow here: https://github.com/ggml-org/llama.cpp/issues/12946

@bartowski your gguf seems to all good, but I get a weird behavior:
For 32B Q4_K_M I can usually offload approx 16K tokens for 65 layers models (Qwen family). Here, it's not from the Qwen family, and only has 62 layers, also the gguf is arround 200MB less than Qwen's ones.
BUT I can offload up to 70K tokens!
Same for Z1 variant.

I'm running this through ollama and deleted the gguf before noticing it.
From what I tested so far with a 23K tokens research paper, I don't get OOM and it still manages to raw recopy the abstract!

Can you or anyone reproduce this? Because that feels a bit unexpected :D

Yes, the KV cache here is very thin, so you can run huge contexts with very little RAM, it's much less demanding than Qwen or even Llama in terms of context memory.

OK I'll try to understand better what are the implications, cause right now it feels like magic! Confirmed the same with 67K tokens!
Thanks for your explanation @ilintar

@ilintar Are those the numbers to look at?

GLM-4
Screenshot from 2025-04-25 01-44-55.png

QwQ
Screenshot from 2025-04-25 01-45-09.png

I'm sorry I have zero knowledge

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment