Please share feedback here!
If you’ve tested any of the initial GGUFs, we’d really appreciate your feedback! Let us know if you encountered any issues, what went wrong, or how things could be improved. Also, feel free to share your inference speed results!
Is it working for you?
Q8_0, Llama.cpp:
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q_b.weight' has wrong shape; expected 1536, 73728, got 1536, 24576, 1, 1
llama_model_load_from_file_impl: failed to load model
Is it working for you?
Q8_0, Llama.cpp:
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q_b.weight' has wrong shape; expected 1536, 73728, got 1536, 24576, 1, 1
llama_model_load_from_file_impl: failed to load model
Could you try updating llama.cpp to the latest version?
Yes, resolved, thank you!
system prompt is added between bos and user token role right? it seems to work really well!
i suggest you state where the system prompt should be inserted in the prompt template so that it is clear for text completion users/ users not using something with an autotokenizer
I've tested the UD-Q3_K_XL in llama.cpp (Ubuntu), and it works great. I'm testing with a context size of around 14000.
add Q1 quant ie 1 bit as well
Yo, DeepSeek-V2-Lite 16B needs to be GUFF'ed!
I meant yo.
add Q1 quant ie 1 bit as well
its uploading
add Q1 quant ie 1 bit as well
They're up now!
ran the original Deepseek unsloth R1 quant with 2x 3090's with 128 GB of ram didn't get much as far as tokens 2-3/s.. interested to see if the new Unsloth Dynamic 2.0 GGUFs stack up with smart layering and shit
ran the original Deepseek unsloth R1 quant with 2x 3090's with 128 GB of ram didn't get much as far as tokens 2-3/s.. interested to see if the new Unsloth Dynamic 2.0 GGUFs stack up with smart layering and shit
if you're not on ik_llama.cpp fork you're missing out
Why are these sizes substantially larger than the other ones? For example UD-Q3-K-XL original vs this, 273gb vs 350gb.