this looks really good on llama.cpp

by gopi87 - opened Jul 30

Discussion

gopi87

Jul 30

@ubergarm can you also check this quant too its doing very well tbh

Haihao

Intel org Jul 30

Follow https://github.com/intel/auto-round and get the latest release notification; star the project, if you find that the tool is helpful.

ubergarm

Jul 30

Some folks are asking me to compare intel auto-round in the ~2bit per weight range between ik's SOTA quant types e.g. IQ2_KT (trellis/qtip/exl3-ish style quantization). I might do some perplexities if I have a chance, though there were some reports of the intel quants throwing nans on perplexity suggesting possible numerical instability. I haven't done any research myself though yet.

https://github.com/ikawrakow/ik_llama.cpp/discussions/657

Also regarding the new sglang amx multi-numa work it seems to have definitely improved support for making use of system memory bandwidth across two sockets for a single generation, however might be slightly slower than existing ik_llama.cpp implementation in terms of aggregate throughput (two generations) and also requires int8 dtype quants to support the AMX extensions whereas with ik_llama.cpp one can make use of the SOTA quants for improved throughput at whatever quality you would like:

https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826/422

gopi87

Jul 30

•

edited Jul 30

Some folks are asking me to compare intel auto-round in the ~2bit per weight range between ik's SOTA quant types e.g. IQ2_KT (trellis/qtip/exl3-ish style quantization). I might do some perplexities if I have a chance, though there were some reports of the intel quants throwing nans on perplexity suggesting possible numerical instability. I haven't done any research myself though yet.

https://github.com/ikawrakow/ik_llama.cpp/discussions/657

Also regarding the new sglang amx multi-numa work it seems to have definitely improved support for making use of system memory bandwidth across two sockets for a single generation, however might be slightly slower than existing ik_llama.cpp implementation in terms of aggregate throughput (two generations) and also requires int8 dtype quants to support the AMX extensions whereas with ik_llama.cpp one can make use of the SOTA quants for improved throughput at whatever quality you would like:

https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826/422

2bit is not working very well with ik_llama but its working fine with llama.cpp i would say its sightly better then ud quant impo

ubergarm

Jul 30

@gopi87

I ask you over here instead: https://github.com/ikawrakow/ik_llama.cpp/discussions/657#discussioncomment-13934981

So far ik_llama.cpp SOTA quants seem to be doing better in my own benchmarking. I use the same methodology and hardware for all tests to keep it consistent relative to each other:

NineMeow

Aug 1

•

edited Aug 1

Yeah, I agree, wondering whether q4km autoround quant would be much better.

ubergarm

Aug 1

take it with a grain of salt as the imatrix was run on two different rigs (mine CPU only and i believe theirs has CUDA+CPU but also used q8_0 kv-cache for unsloths which would lower it a slight bit. Though they benchmarked mine a bit lower than I show in my own numbers in the plot:

There is some discussion on the matter here by both authors: https://github.com/ikawrakow/ik_llama.cpp/discussions/657#discussioncomment-13947079

In general the newer quants tend to perform a bit better perplexity but with some possible trade-offs with things like vulkan bakend support vs ik's older mainline quants and legacy quants. its good to pick the right quant for the right backend and vulkan has shown some suprises with faster TG than CUDA in recent case shown here: https://github.com/ikawrakow/ik_llama.cpp/discussions/657#discussioncomment-13947079

Download them all and try it out for speed and quality on your own rig! lol

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment