EXL3 - please, Sir, or anyone else

#2
No description provided.
sphiratrioth666 changed pull request status to open

? What?
I recommend llama.cpp/GGUF for this as exl3 is much much slower on MoE but uses less vram but Q5~Q6 is basically lossless.
If you do want you can just create yourself it's relatively fast and easy. I forgot how much vram needed though, should be less than 16G.

EXL2/3 is always faster than GGUF on my RTX 5090 and it used to be the same for 4090 and 4080. With RTX5000, there's a much bigger difference but it's still faster for every single model I run. The base Qwen that this stands on as well :-P

With a 5090 you can easily create the quant super fast yourself, tried it?

Nope. I used to always download them. I can try doing it, I guess...

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment