ubergarm/GLM-4.5-Air-GGUF · Works like a charm on ik

Aug 3

Coherent context at 14k token.
One again, thank you, @ubergarm !

Owner Aug 3

Great to hear! Glad folks can test an early version. I'm still gonna wait a bit for the dust to settle as both the PR for mainline and also on ik is pretty busy. I hope they end up fairly equivalent in terms of tensor names and those MTP / NextN tensor things too.

Then I will feel more confident to do a more general release of quants and sizes as this model seems fairly nice for the size (and has first ffn dense/shexp layers too which helps speed on hybrid CPU+GPU with more proportion of active weights in VRAM.

Appreciate the testing and report on the PR!

Nexesenex

Aug 3

•

edited Aug 3

To pursue, 7 experts work as well as 8, with a 0.005 lower ppl compared to 8 experts.
Your IQ4_KSS quant with 8 experts stands at PPL wikitext 512 : 4.7419 +/- 0.02908.
In full offload, the model is quite fast, with a L3 70b feel (8 token/s TG at 15k context on my 2x3090+RTXa4000 undervolted rig with low PCIE bandwith).
As for the quality, my first impression is that it tops both Mistral 123b, Command A 111b, and Llama 3.3 70b / Llama 4 scout and its recent Cogito finetune both). The benchs might be pumped up, but it's far to be without merits.
I started to quant on my side with the help of your iMatrix, to see if I can get something even better for my rig. But that's being picky, your quant is already neat.

ubergarm

Owner Aug 3

•

edited Aug 3

Great to hear! I ran perplexity on Air for this IQ4_KSS and and Q8_0 baseline here: https://github.com/ggml-org/llama.cpp/pull/14939#issuecomment-3146551088

Also might need to use this for now too: --override-kv tokenizer.ggml.eot_token_id=int:151336 \

Still need some longer context testing too psure.

Still, very promising! Just watch out for all the ffn_down.* tensors on Air as they are annoying sizes so have limited quantization options.

Unrelated, I started on that Cogito finetune of DeepSeek, but it had some odd config.json lm_head settings or something and I can't get it to make imatrix, though I can quantize it using my old V3-0324 Imatrix, not sure what to do with it or if its really worth the trouble at this point.

phakio

Aug 4

Yeah, I can confirm great results with the current iq4_kss.
It just generated what I can only call the most polished and fluid dynamic html landing page I think any local model i've run has generated.

Also good speeds! around 50tk/s on 2x3090 1x4090 full gpu offload. This is only the air varian too... I'm excited to try its bigger brother

Hisma

Aug 6

•

edited Aug 6

I'm running the IQ5_K version with these args -

llama-server \
    -m /home/hisma/llama.cpp/models/zai-org_GLM-4.5-Air-IQ5_K/IQ5_K/GLM-4.5-Air-IQ5_K-00001-of-00002.gguf \
    --alias ubergarm/GLM-4.5-Air-IQ5_K \
    --chat-template chatglm4 \
    --ctx-size 65536 \
    -ub 4096 -b 4096 \
    --split-mode layer \
    --tensor-split 1,1,1,1,1,1 \
    -ngl 99 \
    -fa \
    --host 0.0.0.0 \
    --port 8888 \
    --no-mmap

I have a 6x3090 so all GPU. It just thinks endlessly in loops using the llama-server.

Any ideas?
edit: it was the prompt I used asking it to "fix any errors" - seems this model is very sensitive to prompting.

ubergarm

Owner Aug 6

Okay I've released a couple more sizes for both Air and the bigger GLM-4.5 as well as publishing my perplexity data and graphs. I have one request to fill in the gap for a middle sized model and will try to get to that one today.

@Hisma

Any ideas?

You shouldn't need to explicitly specify --chat-template chatglm4 anymore, but it doesn't hurt anything.
With ik_llama.cpp you can use -fmoe for fused-moe for a speed boost as two computations are run as one
When running full VRAM/GPU offload go with a single thread -t 1 and no need for --no-mmap as its all in CUDA buffers
If you want higher aggregate throughput and can run concurrent batches of prompts you could go with -c 262144 -ctk q8_0 -ctv q8_0 --parallel 4 for example for 4 slots each with 64k context etc...
I never specify --split-mode and believe layer is default so no need to specify

Otherwise looks pretty good.

If you want some speed benchmarks you could replace llama-server with llama-sweep-bench and drop down context to say -c 20480 and add --warmup-batch and share your fully offloaded PP/TG speeds.

Hisma

Aug 6

•

edited Aug 6

Wow! Your suggested tweaks improved speed AND the quality of the code.

I settled on these settings -

llama-server \
-m /home/hisma/llama.cpp/models/zai-org_GLM-4.5-Air-IQ5_K/IQ5_K/GLM-4.5-Air-IQ5_K-00001-of-00002.gguf   \
--alias ubergarm/GLM-4.5-Air-IQ5_K \
-ub 4096 \
-b 2048 \
-c 131072 \
-ctk q8_0 \
-ctv q8_0 \
--parallel 2 \
-t 1 \
--tensor-split 1,1,1,1,1,1 \
-ngl 99 \
-fa \
-fmoe \
--temp 0.6 \
--top-p 1.0 \
--host 0.0.0.0 \
--port 8888

With these settings I start at 45 t/s, (5 t/s faster than previous settings) but interestingly, the model is noticeably more accurate too. With my previous settings, I could not one-shot my flappy-bird prompt test. Using these updated settings, it not only one-shotted the prompt, it added extra "details" that went beyond what I asked.

Prompt -

Create a Flappy Bird game in Python. You must include these things:

You must use pygame.

The background color should be randomly chosen and is a light shade. Start with a light blue color.

Pressing SPACE multiple times will accelerate the bird.

The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.

Place on the bottom some land colored as dark brown or yellow chosen randomly.

Make a score shown on the top right side. Increment if you pass pipes and don't hit them.

Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.

When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.

The final game should be inside a markdown section in Python

Result -

Qwen 235B-instuct could not even pull this off. Across multiple runs it'd always take 2 shots to nail every requirement, and I'd never get those extra details on the pipe or the ground.
Before the tweaks you suggested, this model was behaving in a similar manner to Qwen. So perhaps the way I was running my models was nerfing them. I'll need to re-test Qwen similar settings and a/b them to see if my settings were degrading it's quality.
But regardless, this seems to be an extremely capable model! So thank you!

Hisma

Aug 6

OK, while good at doing few shot jobs, it's not so great at long multi-turn sessions, which was my fear. Intelligence drops quickly over longer context.

So GLM 4.5 Air - not so good as a replacement for claude sonnet for agentic coding :. Depending on your needs this is still a good model. But going to see if full GLM 4.5 holds up better, either that or Qwen 235B.

ubergarm

Owner about 1 month ago

@Hisma

Check out this discussion for some possible thoughts on longer context usage: https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/discussions/4#68960ee29e427b197ddde54c

Hisma

about 1 month ago

@Hisma

Check out this discussion for some possible thoughts on longer context usage: https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/discussions/4#68960ee29e427b197ddde54c

Thanks, good info. I actually switched to bartowski's quant and I'm using vanilla llama.cpp which allows me to pass the --jinja parameter. To use this model effectively beyond one or few shot use cases you MUST get it to load the chat template. Unfortunately, and I don't know why, llama.cpp is not capable of detecting the chat template without passing that flag, even if it is technically supposed to.
With the --jinja flag I get much better results for agentic coding. It knows how/when to use tools, and follows instructions much clearer, even over longer context. I usually try not to go past 65k, but with these coding agents, you can't get much useful work done without less than 65k context.

I saw your note about ik_llama.cpp now supporting the --jinja flag, which is great if it works. I can then try switching back to your quant and see if it performs better.

ubergarm

Owner about 1 month ago

Yeah it probably depends on if you're using the /text/completions/ endpoint or the /chat/completions/ endpoint with your client. There was a bug I fixed on ik's fork where the chat endpoint was originally using the wrong template and causing issue.

I saw your note about ik_llama.cpp now supporting the --jinja flag, which is great if it works. I can then try switching back to your quant and see if it performs better.

yes please give that PR a try and leave a report on GH if you can! Thanks!

ubergarm
/

GLM-4.5-Air-GGUF

Works like a charm on ik_llama.cpp server with PR 668