Awesome model
When reflection turned out to be a cash grab, THIS model should be the real star of the moment. Very good model all around IMO.
Please, could you give some insight into why you feel this model is so good? I just assumed this was a 'unification' model that combined the strengths of V2-Chat and Coder-V2-Instruct without really changing anything. Some explanation and examples would be appreciated.
tl;dr;
It inferences quite fast even with most of the model offloaded into CPU RAM.
Details
I skipped over V2-Chat and Coder-V2-Instruct and just tried this V2.5 on llama.cpp with bartowski/DeepSeek-V2.5-GGUF
IQ3_XXS
on my R9 9950X w/ 96GiB RAM and 1x 3090TI FE w/ 24GiB VRAM.
While llama.cpp support doesn't seem complete yet with at least chat template and also crashing:
The chat template that comes with this model is not yet supported, falling back to chatml.
- Deepseek2 does not support K-shift
- flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
However, all that said, IT IS FAST! Getting 6-7 tok/sec as compared to Mistral-Large which gets barely 1-2 tok/sec with similar RAM usage. To be fair, only have 1k context right now as it will OOM easily on my limited hardware, but I haven't fiddled much with kv cache quants given the flash_attn thing going on in 3 above.
I wanted to try a different inference engine that supports offloading, but couldn't get ktransformers to build (might need to use older python 3.11 or 3.10 maybe)...
Anyone else having luck with it? What local hardware / inference engine are you using?
I tried ktransformers but couldn't get it to work on windows. Good to know about the deepseek t/s speed, I might have to give it a try with my 3090 and 96gb ram. Do you know how to solve the k shift error, I'm using oogabooga and no matter what I try it still crashes
@sm54 thanks for your report, yeah ktransformers is a bit tricky to get running it seems likely due to python wheel stuff (i'm trying on linux).
The relevant bit from the Deepseek2 K-shift github issue linked above is
num_predict must be less than or equal to num_ctx / process count.
I'm not 100% sure, but for now I'm just limiting my n-predict
a little bit lower than the ctx-size
. However, it is still crashing on longer generations or larger prompts...
Ahh I see, DeepSeek-V2.5 is also an MoE so only 22B parameters are active at a time - that is why it inferences much faster... now if only it wouldn't crash out with the k_shift error and actually supported flash attention so i could get over 1k context...
Please, could you give some insight into why you feel this model is so good? I just assumed this was a 'unification' model that combined the strengths of V2-Chat and Coder-V2-Instruct without really changing anything. Some explanation and examples would be appreciated.
Deepseek-AI said that DeepseekV2.5 not only combines DeepseekV2-chat and DeepseekV2-coder but improves its coding abilities compared to DeepseekV2-coder. It also produces output that is more in line with human preferences. You can visit Deepseek's official website to see DeepseekV2.5's scores on tests like HumanEval and MMLU: https://www.deepseek.com/en