https://huggingface.co/deepseek-ai/DeepSeek-V3-0324

#797
by nicoboss - opened

Another awesome 685B model for us to quant! :D

I'm currently downloading it. Once done I will BF16 it and then GGUF to /bpool.

I successfully converted DeepSeek-V3-0324 to BF16 and it is now converting to GGUF. It is getting stored under /bpool/DeepSeek-V3-0324.gguf

The /bpool/DeepSeek-V3-0324.gguf is now ready. I softlinked it to /tmp/quant, queued DeepSeek-V3-0324 and force pushed it to nico1. Quantisation of it will start as soo I installed the new Intel Arc 770 GPUs.

By the way there is nothing I can do about this imatrix task beeing stuck without ever starting so I will just shutdown once the quantisation tasks are done:

1400  377 Mega-Miqu-WizardLM-190B-v0.2                  run/imatrix (GPU-2d)

I know shutting down nico1 while it is in this state will make it fail so @mradermacher how do I restart a failed imatrix task?

Edit: I just checked the log and not even sure if I want to restart it - that model looks very broken.

system_info: n_threads = 1 (n_threads_batch = 1) / 36 | CUDA : ARCHS = 890 | FORCE_MMQ = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 488.062 ms
compute_imatrix: computing over 365 chunks with batch_size 512
compute_imatrix: 323.66 seconds per pass - ETA 32 hours 48.90 minutes
[1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,[37]nan,[38]nan,[39]nan,[40]nan,[41]nan,[42]nan,[43]nan,[44]nan,[45]nan,[46]nan,[47]nan,[48]nan,[49]nan,[50]nan,[51]nan,[52]nan,[53]nan,[54]nan,[55]nan,[56]nan,[57]nan,[58]nan,[59]nan,[60]nan,[61]nan,[62]nan,[63]nan,[64]nan,[65]nan,[66]nan,[67]nan,[68]nan,[69]nan,[70]nan,[71]nan,[72]nan,[73]nan,[74]nan,[75]nan,[76]nan,[77]nan,[78]nan,[79]nan,[80]nan,[81]nan,[82]nan,[83]nan,[84]nan,[85]nan,[86]nan,[87]nan,[88]nan,[89]nan,[90]nan,[91]nan,[92]nan,[93]nan,[94]nan,[95]nan,[96]nan,[97]nan,[98]nan,[99]nan,[100]nan,[101]nan,[102]nan,[103]nan,[104]nan,[105]nan,[106]nan,[107]nan,[108]nan,[109]nan,[110]nan,[111]nan,[112]nan,[113]nan,[114]nan,[115]nan,[116]nan,[117]nan,[118]nan,[119]nan,[120]nan,[121]nan,[122]nan,[123]nan,[124]nan,[125]nan,[126]nan,[127]nan,[128]nan,[129]nan,[130]nan,[131]nan,[132]nan,[133]nan,[134]nan,[135]nan,[136]nan,[137]nan,[138]nan,[139]nan,[140]nan,[141]nan,[142]nan,[143]nan,[144]nan,[145]nan,[146]nan,[147]nan,[148]nan,[149]nan,[150]nan,[151]nan,[152]nan,[153]nan,[154]nan,[155]nan,[156]nan,[157]nan,[158]nan,[159]nan,[160]nan,[161]nan,[162]nan,[163]nan,[164]nan,[165]nan,[166]nan,[167]nan,[168]nan,[169]nan,[170]nan,[171]nan,[172]nan,[173]nan,[174]nan,[175]nan,[176]nan,[177]nan,[178]nan,[179]nan,[180]nan,[181]nan,[182]nan,[183]nan,[184]nan,[185]nan,[186]nan,[187]nan,[188]nan,[189]nan,[190]nan,[191]nan,[192]nan,[193]nan,[194]nan,[195]nan,[196]nan,[197]nan,[198]nan,[199]nan,[200]nan,[201]nan,[202]nan,[203]nan,[204]nan,[205]nan,[206]nan,[207]nan,[208]nan,[209]nan,[210]nan,[211]nan,[212]nan,[213]nan,[214]nan,[215]nan,[216]nan,[217]nan,[218]nan,[219]nan,[220]nan,[221]nan,[222]nan,[223]nan,[224]nan,[225]nan,[226]nan,[227]nan,[228]nan,[229]nan,[230]nan,[231]nan,[232]nan,[233]nan,[234]nan,[235]nan,[236]nan,[237]nan,[238]nan,[239]nan,[240]nan,[241]nan,

Edit: Actually Mega-Miqu-WizardLM-190B-v0.2 seams to do something despite showing no procress as I can clearly see progress in the log and GPU resources beeing used for this. Hopefully nan doesn't mean garbage.

Edit: I let everything compleate gracefully after all:

compute_imatrix: 323.66 seconds per pass - ETA 32 hours 48.90 minutes
[1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,[37]nan,[38]nan,[39]nan,[40]nan,[41]nan,[42]nan,[43]nan,[44]nan,[45]nan,[46]nan,[47]nan,[48]nan,[49]nan,[50]nan,[51]nan,[52]nan,[53]nan,[54]nan,[55]nan,[56]nan,[57]nan,[58]nan,[59]nan,[60]nan,[61]nan,[62]nan,[63]nan,[64]nan,[65]nan,[66]nan,[67]nan,[68]nan,[69]nan,[70]nan,[71]nan,[72]nan,[73]nan,[74]nan,[75]nan,[76]nan,[77]nan,[78]nan,[79]nan,[80]nan,[81]nan,[82]nan,[83]nan,[84]nan,[85]nan,[86]nan,[87]nan,[88]nan,[89]nan,[90]nan,[91]nan,[92]nan,[93]nan,[94]nan,[95]nan,[96]nan,[97]nan,[98]nan,[99]nan,[100]nan,[101]nan,[102]nan,[103]nan,[104]nan,[105]nan,[106]nan,[107]nan,[108]nan,[109]nan,[110]nan,[111]nan,[112]nan,[113]nan,[114]nan,[115]nan,[116]nan,[117]nan,[118]nan,[119]nan,[120]nan,[121]nan,[122]nan,[123]nan,[124]nan,[125]nan,[126]nan,[127]nan,[128]nan,[129]nan,[130]nan,[131]nan,[132]nan,[133]nan,[134]nan,[135]nan,[136]nan,[137]nan,[138]nan,[139]nan,[140]nan,[141]nan,[142]nan,[143]nan,[144]nan,[145]nan,[146]nan,[147]nan,[148]nan,[149]nan,[150]nan,[151]nan,[152]nan,[153]nan,[154]nan,[155]nan,[156]nan,[157]nan,[158]nan,[159]nan,[160]nan,[161]nan,[162]nan,[163]nan,[164]nan,[165]nan,[166]nan,[167]nan,[168]nan,[169]nan,[170]nan,[171]nan,[172]nan,[173]nan,[174]nan,[175]nan,[176]nan,[177]nan,[178]nan,[179]nan,[180]nan,[181]nan,[182]nan,[183]nan,[184]nan,[185]nan,[186]nan,[187]nan,[188]nan,[189]nan,[190]nan,[191]nan,[192]nan,[193]nan,[194]nan,[195]nan,[196]nan,[197]nan,[198]nan,[199]nan,[200]nan,[201]nan,[202]nan,[203]nan,[204]nan,[205]nan,[206]nan,[207]nan,[208]nan,[209]nan,[210]nan,[211]nan,[212]nan,[213]nan,[214]nan,[215]nan,[216]nan,[217]nan,[218]nan,[219]nan,[220]nan,[221]nan,[222]nan,[223]nan,[224]nan,[225]nan,[226]nan,[227]nan,[228]nan,[229]nan,[230]nan,[231]nan,[232]nan,[233]nan,[234]nan,[235]nan,[236]nan,[237]nan,[238]nan,[239]nan,[240]nan,[241]nan,[242]nan,[243]nan,[244]nan,[245]nan,[246]nan,[247]nan,[248]nan,[249]nan,[250]nan,[251]nan,[252]nan,[253]nan,[254]nan,[255]nan,[256]nan,[257]nan,[258]nan,[259]nan,[260]nan,[261]nan,[262]nan,[263]nan,[264]nan,[265]nan,[266]nan,[267]nan,[268]nan,[269]nan,[270]nan,[271]nan,[272]nan,[273]nan,[274]nan,[275]nan,[276]nan,[277]nan,[278]nan,[279]nan,[280]nan,[281]nan,[282]nan,[283]nan,[284]nan,[285]nan,[286]nan,[287]nan,[288]nan,[289]nan,[290]nan,[291]nan,[292]nan,[293]nan,[294]nan,[295]nan,[296]nan,[297]nan,[298]nan,[299]nan,[300]nan,[301]nan,[302]nan,[303]nan,[304]nan,[305]nan,[306]nan,[307]nan,[308]nan,[309]nan,[310]nan,[311]nan,[312]nan,[313]nan,[314]nan,[315]nan,[316]nan,[317]nan,[318]nan,[319]nan,[320]nan,[321]nan,[322]nan,[323]nan,[324]nan,[325]nan,[326]nan,[327]nan,[328]nan,[329]nan,[330]nan,[331]nan,[332]nan,[333]nan,[334]nan,[335]nan,[336]nan,[337]nan,[338]nan,[339]nan,[340]nan,[341]nan,[342]nan,[343]nan,[344]nan,[345]nan,[346]nan,[347]nan,[348]nan,[349]nan,[350]nan,[351]nan,[352]nan,[353]nan,[354]nan,[355]nan,[356]nan,[357]nan,[358]nan,[359]nan,[360]nan,[361]nan,[362]nan,[363]nan,[364]nan,[365]nan,
Unexpected negative standard deviation of log(prob)

llama_perf_context_print:        load time =  649143.67 ms
llama_perf_context_print: prompt eval time = 12690486.65 ms / 186880 tokens (   67.91 ms per token,    14.73 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time = 13113368.96 ms / 186881 tokens

I successfully completed the 2x Intel Arc 770 installation and DeepSeek-V3-0324-GGUF is now quantizing: https://huggingface.co/mradermacher/DeepSeek-V3-0324-GGUF

Download Page: https://hf.tst.eu/model#DeepSeek-V3-0324-GGUF

I successfully completed the 2x Intel Arc 770 installation and DeepSeek-V3-0324-GGUF is now quantizing: https://huggingface.co/mradermacher/DeepSeek-V3-0324-GGUF

Download Page: https://hf.tst.eu/model#DeepSeek-V3-0324-GGUF

Nice, can't wait to grab the imatrix.dat file so I can make my own quants.

Thank you.

Actually Mega-Miqu-WizardLM-190B-v0.2 seams to do something despite showing no procress

NaN's usually trigger a very slow microcode path inside cpus (and presumably also nvidia gpus), which could explain this.

And nan means garbage, usually will crash during quantisation of low-bpp models. Not sure if we want to attempt it or just nuke. Somebody should try out the static quants :-)

I backed up the static Q8 quants of DeepSeek-V3-0324 to /bpool while it was uploading. I softlinked /tmp/DeepSeek-V3-0324.Q8_0.gguf to /bpool/DeepSeek-V3-0324.Q8_0.gguf so we can use it to compute the imatrix once we are done with static quants.

while we are at it, /bpool is full of old stuff that you might want to delete (I think it's the models you once queued that you provided local copies). I think everthing but sarashina and deepseek is unused.

Dear bros, can we please have a 1.58B quant of the full new Deepseek V3? I need to fit it in the Macbook 128Mb Ram.

I would be very grateful for this. Thank you!

Dear bros, can we please have a 1.58B quant of the full new Deepseek V3? I need to fit it in the Macbook 128Mb Ram.
I would be very grateful for this. Thank you!

Hello @dreamworks2050 . Yes we will providing all the usual static and weighted/imatrix quants including i1-IQ1_S and i1-IQ1_M. Due to the enormous size of this model this process will just take longer than usual as the imatrix computation alone will take almost a day due to requiring RPC. You can always take a look at https://hf.tst.eu/status.html to see our current process.

Do we need to redo all GGUFs because of this? I hope not.

grafik.png

This parameter seams to only be used in https://github.com/ggml-org/llama.cpp/blob/053b3f9aae63151732eccf6b7408c6418ba8746e/convert_hf_to_gguf.py#L1172 and under https://github.com/ggml-org/llama.cpp/blob/053b3f9aae63151732eccf6b7408c6418ba8746e/convert_hf_to_gguf.py#L1207 which only gets called for OrionForCausalLM, BaichuanForCausalLM and BaiChuanForCausalLM so I hope we should be fine.

We are fine! I downloaded the first data of some random GGUF to check metadata and I don't see the wrong 16384. It ironicaly somehow uses 163840 which is also wong but nobody cares as it is larger as the actual supported context length:

.\gguf-parser-windows-amd64.exe -m .\DeepSeek-V3-0324.Q2_K.gguf
+-----------------------------------------------------------------------------------------------------------------+
| METADATA                                                                                                        |
+-------+-------------------------+-----------+--------------+---------------+------------+------------+----------+
|  TYPE |           NAME          |    ARCH   | QUANTIZATION | LITTLE ENDIAN |    SIZE    | PARAMETERS |    BPW   |
+-------+-------------------------+-----------+--------------+---------------+------------+------------+----------+
| model | DeepSeek V3 0324 Bf1... | deepseek2 |     Q2_K     |      true     | 450.94 MiB |  671.03 B  | 0.01 bpw |
+-------+-------------------------+-----------+--------------+---------------+------------+------------+----------+

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| ARCHITECTURE                                                                                                                                      |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
| MAX CONTEXT LEN | EMBEDDING LEN | EMBEDDING GQA | ATTENTION CAUSAL | ATTENTION HEAD CNT | LAYERS | FEED FORWARD LEN | EXPERT CNT | VOCABULARY LEN |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+
|      163840     |      7168     |       1       |       true       |         N/A        |   61   |       18432      |     256    |     129280     |
+-----------------+---------------+---------------+------------------+--------------------+--------+------------------+------------+----------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| TOKENIZER                                                                                                                                             |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
| MODEL | TOKENS SIZE | TOKENS LEN | ADDED TOKENS LEN | BOS TOKEN | EOS TOKEN | EOT TOKEN | EOM TOKEN | UNKNOWN TOKEN | SEPARATOR TOKEN | PADDING TOKEN |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+
|  gpt2 |   2.21 MiB  |   129280   |        N/A       |     0     |     1     |    N/A    |    N/A    |      N/A      |       N/A       |       1       |
+-------+-------------+------------+------------------+-----------+-----------+-----------+-----------+---------------+-----------------+---------------+

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ESTIMATE                                                                                                                                                                                                                                       |
+-----------+--------------+--------------------+-----------------+-------------+----------------+-------------+---------------+----------------+----------------+-----------------------------------------+-------------------------------------+
|    ARCH   | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION |  MMAP LOAD  | EMBEDDING ONLY |  RERANKING  | DISTRIBUTABLE | OFFLOAD LAYERS | FULL OFFLOADED |                   RAM                   |                VRAM 0               |
|           |              |                    |                 |             |                |             |               |                |                +--------------------+--------+-----------+----------------+---------+----------+
|           |              |                    |                 |             |                |             |               |                |                | LAYERS (I + T + O) |   UMA  |   NONUMA  | LAYERS (T + O) |   UMA   |  NONUMA  |
+-----------+--------------+--------------------+-----------------+-------------+----------------+-------------+---------------+----------------+----------------+--------------------+--------+-----------+----------------+---------+----------+
| deepseek2 |    163840    |     2048 / 512     |     Disabled    | Unsupported |       No       | Unsupported |   Supported   |   62 (61 + 1)  |       Yes      |      1 + 0 + 0     | 14 GiB | 14.15 GiB |     61 + 1     | 1.06 TB | 1.01 TiB |
+-----------+--------------+--------------------+-----------------+-------------+----------------+-------------+---------------+----------------+----------------+--------------------+--------+-----------+----------------+---------+----------+

We are fine! I downloaded the first data of some random GGUF to check metadata and I don't see the wrong 16384. It ironicaly somehow uses 163840 which is also won g but nobody cares as it is larger as the actual supported context length:

Thank you for solving a mystery I had of why people where assuming this was a 163k model, the 16k being multiplied by 10 makes sense.

Also my performance on my dual socket Xeon E5-2690 v3 via ik_llama.cpp, no GPU, if a GPU was local it could boost inference speed a lot.

performance_comparison_pp.png

Just realized I posted PP twice and not TG

performance_comparison_tg.png

This quant mix is quality orientated, I think I can do better though with a different quant mix, that takes speed into account.

The imatrix is avilable now under https://huggingface.co/mradermacher/DeepSeek-V3-0324-i1-GGUF/blob/main/imatrix.dat and imatrix quants will appear shortly.

The imatrix is avilable now under https://huggingface.co/mradermacher/DeepSeek-V3-0324-i1-GGUF/blob/main/imatrix.dat and imatrix quants will appear shortly.

Thanks, I will use it for my new speed mix, attempt. The first V3-0324 quant I made was with bartowski's as it was the only one available.

The imatrix is avilable now under https://huggingface.co/mradermacher/DeepSeek-V3-0324-i1-GGUF/blob/main/imatrix.dat and imatrix quants will appear shortly.

Did the imatrix log look fine, and have any imatrix quants been tested? I'm only asking as I made a new quant changing the recipe and the imatrix (using this repo's imatrix and not bartowski's ) and this second quant is not functional. I think it is far more likely it is a bad recipe, but I thought I'd ask anyway.

@tdh111

dual socket Xeon E5-2690 v3 via ik_llama.cpp, no GPU

Oh hey I see you are also trying out CPU only dual socket Intel Xeon inferencing with ik_llama.cpp. I am doing the same thing lately and cooked up a quant using my own imatrix for V3-0324.

While the smaller quant size works great, I hit a snafu with a ~4-5bit experimental quant recipe I just tried and opened an issue. Hopefully can work through that.

Then maybe find tim to check out this interesting look dual socket numa code experiment discussion and fork here

Anyway, a lot of good discussions over on ik_llama.cpp if you are interested. Or maybe you're already over there with a different name haha.. Cheers!

@tdh111
Anyway, a lot of good discussions over on ik_llama.cpp if you are interested. Or maybe you're already over there with a different name haha.. Cheers!

Yes I am over there ;)

Also off topic, I did end up using the triton dequant method that is in your guide and it worked but I did have to fight triton-cpu a bit to build before it did.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment