EXL3 - please, Sir, or anyone else
#2
by
sphiratrioth666
- opened
No description provided.
sphiratrioth666
changed pull request status to
open
? What?
I recommend llama.cpp/GGUF for this as exl3 is much much slower on MoE but uses less vram but Q5~Q6 is basically lossless.
If you do want you can just create yourself it's relatively fast and easy. I forgot how much vram needed though, should be less than 16G.
EXL2/3 is always faster than GGUF on my RTX 5090 and it used to be the same for 4090 and 4080. With RTX5000, there's a much bigger difference but it's still faster for every single model I run. The base Qwen that this stands on as well :-P
With a 5090 you can easily create the quant super fast yourself, tried it?
Nope. I used to always download them. I can try doing it, I guess...