https://huggingface.co/microsoft/NextCoder-14B
there are also 7B and 32B model.
It is odd that no much attendance since models have been released over a month. because of microsoft?
they also released dataset:
https://huggingface.co/datasets/microsoft/NextCoderDataset
https://huggingface.co/datasets/microsoft/NextCoderDataset-Conversational
They are all queued! :D
It is odd that no much attendance since models have been released over a month. because of microsoft?
Thanks a lot for recommending it. No idea why we missed it. You are the first one requesting it. Please continue to do so for other great nodels we missed.
You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary pages at the following locations for quants to appear:
sorry to bother again. the i1 of 7B and 32B have been released except 14B. could you please check why?
I requed it. Let's see why it failed.
Imatrix computation failed due to the original model containing a NaN value inside blk.47.attn_q.weight which is unfortinaterly an issue only the original model can fix. We just had a simular case a few days ago and back then concluded after traing everything that there is no way for us to workaround this issue. If you are interested please take a look at https://huggingface.co/mradermacher/model_requests/discussions/1131:
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 35/49 layers to GPU
load_tensors: CUDA0 model buffer size = 18377.32 MiB
load_tensors: CPU_Mapped model buffer size = 28173.21 MiB
............................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 512
llama_context: n_ctx_per_seq = 512
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.58 MiB
llama_kv_cache_unified: CUDA0 KV buffer size = 70.00 MiB
llama_kv_cache_unified: CPU KV buffer size = 26.00 MiB
llama_kv_cache_unified: size = 96.00 MiB ( 512 cells, 48 layers, 1 seqs), K (f16): 48.00 MiB, V (f16): 48.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: CUDA0 compute buffer size = 1792.00 MiB
llama_context: CUDA_Host compute buffer size = 11.01 MiB
llama_context: graph nodes = 1878
llama_context: graph splits = 186 (with bs=512), 3 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
system_info: n_threads = 1 (n_threads_batch = 1) / 54 | CUDA : ARCHS = 890 | FORCE_MMQ = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 213.94 ms
compute_imatrix: computing over 318 chunks with batch_size 512
compute_imatrix: 1.73 seconds per pass - ETA 9.17 minutes
[1]5.4094,[2]3.7304,[3]3.7226,[4]4.3005,[5]4.2047,[6]3.8600,[7]4.2429,[8]4.2283,[9]4.6589,[10]4.4994,[11]4.3601,[12]4.7589,[13]5.3065,[14]5.5800,[15]6.1146,[16]6.4383,[17]6.6192,[18]7.0366,[19]6.8403,[20]6.9488,[21]7.0202,[22]7.0410,[23]6.8499,[24]7.0588,[25]7.2380,[26]7.1263,[27]7.1357,[28]7.1457,[29]7.3199,[30]7.2777,[31]7.0631,[32]6.7516,[33]6.5843,[34]6.4749,[35]6.4110,[36]6.4669,[37]6.6251,[38]6.6867,[39]6.7120,[40]6.8963,[41]6.9656,[42]7.1524,[43]7.3077,nan detected in blk.47.attn_q.weight
Please keep in mind that this NaN issue will also impact static NextCoder-14B
quants. For most prompts you will get lucky and the model will work as expected but there will be prompt where the model "crashes" due to encountering a NaN and you have to reroll. This truly is a NaN and not an inf
or -inf
or some other non-representable/non-finite number like it also was the case for https://huggingface.co/mradermacher/model_requests/discussions/1131 as the error message explicitely specifies nan
which it wouldn't do if it is any onter non-finite number.
so it must be reported to microsoft? or origin model author, qwen?
as the error message explicitely specifies nan which it wouldn't do if it is any onter non-finite number.
The error message in the previous case also explicitly specified "nan" but actually checked via isfinite. Isn't it even the same emssage?
The error message in the previous case also explicitly specified "nan" but actually checked via isfinite.
The check is for isfinite but the message contains the actual value of the non-finite number. If the message states nan
the actual value was NaN. In the past we also saw cases where the actual message showed inf
to indicate an infinite value:
if (!std::isfinite(e.values[j])) {
LOG_ERR("%f detected in %s\n", e.values[j], wname.c_str());
exit(1);
}
Isn't it even the same emssage?
It is the same message and the same check. The only reason it slightly looked different in the previous case in some of the posted logs is because I slightly modified it to show some more information for debugging purposes.
so it must be reported to microsoft? or origin model author, qwen?
Microsoft but I have the feeling they for sure tested the model for NaN before releasing so maybe it is indeed convert_hf_to_gguf.py
introducing the NaN's in which case the issue is on llama.cpp. I will likely do some experimentation with F32 instead of F16 conversion and see if it still contains NaN's in that case.