From Ether to Syntax: A Meta-Analytic Exploration of Linguistic Algorithmic Landscapes
continued....
Here a compleate list of the newly added architectures.
The non-mm-archs are picked up automatically when llama is updated (rather, nothing checks for these archs, other than the script that shows me daily models).
Nice. Will do in caser you forgot any vision/audio architecture.
In case yopu need it, the list/regexc is currently in /llmjob/share/llmjob.pm - search for is_vision
Also, vision is mradermacher code for multi-modal from now on.
Bert based architectures seem to be incredible
I might exclude them from the daily list for that reason, and them being likely not popular with the people who consume ggufs. (and most fail because small models tend to have custom tokenizers).
Nice I just discover an easy way to requeue previously failed archidectures:
Yup, shell-greppable logs for the win.
Update: oh, it's not even the real log file, "just" the llmc why transform of it.
@RichardErkhov vision models should not be queued to rich1 unless they arte not being detected as such (and then no vision extraction should happen).
The non-vision jobs are limited to 32GB ram, too. No clue what happened. Very troubling.
However, this morning, only besteffort models were queued on rich1. Who knows what nico queued...
well, good to know. usually you take like 4-8gb, but something went wrong today. Peak recorded by proxmox was 24gb (so I assume it was even higher, but due to total OOM, it might not have recorded full number. I added swap on root just in case this happens again so at least other things on server dont die haha
llmc audit besteffort
skips the besteffort models for me.
Please restart Audio-Reasoner
imatrix computation. I killed it earlier today because it ran on CPU. I'm still not sure what makes GPUs occasionally temporary disappear but seams related to them being used on a different container.
llmc audit besteffort skips the besteffort models for me.
Right, arguments were not passed to llmjob audit. Should be fixed now.
Peak recorded by proxmox was 24gb
Well, given that I was officially allowed to use 64GB, 24GB seems absolutely normal. So what is the new limit? 24GB will only allow one quant, and maybe not even that.
the PLM-1.8 imatrix failure looks somewhat interesting:
/llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu:82: CUDA error
CUDA error: the requested functionality is not supported
current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at /llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu:1939
cublasGemmStridedBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, src0_ptr, cu_data_type_a, nb01/nb00, nb02/nb00, src1_ptr, cu_data_type_b, s11, s12, beta, dst_t, cu_data_type, ne0, ne1*ne0, ne12*ne13, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
I wonder what functionality is missing. bf16 support? Maybe something went wrong compiling kernels? llama.cpp "recently" (months ago?) did something with which kernels are actually compiled for which archs, to save space, but I can't seem to find the issue for it atm.
It's this call:
if (r2 == 1 && r3 == 1 && ggml_is_contiguous_2(src0) && ggml_is_contiguous_2(src1)) {
// there is no broadcast and src0, src1 are contiguous across dims 2, 3
// use cublasGemmStridedBatchedEx
CUBLAS_CHECK(
cublasGemmStridedBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N,
ne01, ne11, ne10,
alpha, src0_ptr, cu_data_type_a, nb01/nb00, nb02/nb00, // strideA
src1_ptr, cu_data_type_b, s11, s12, // strideB
beta, dst_t, cu_data_type, ne0, ne1*ne0, // strideC
ne12*ne13,
cu_compute_type,
CUBLAS_GEMM_DEFAULT_TENSOR_OP));
} else {
// use cublasGemmBatchedEx
const int64_t ne23 = ne12*ne13;
ggml_cuda_pool_alloc<const void *> ptrs_src(ctx.pool(), 2*ne23);
ggml_cuda_pool_alloc< void *> ptrs_dst(ctx.pool(), 1*ne23);
@mradermacher Please update to the latest llama.cpp version in ouer fork.
Our fork finally adds the --outtype source
option to convert_hf_to_gguf.py
. It now keeps F16, BF16 and F32 tensors in their original datatype, falls back to F16 for unknown datatypes and keeps storing tensors that should always be F32 in F32 according to GGUF specifications. I tested this options for a few models and found no issues fo far. I might even try to upstream this change as it seams really usefull so I recommend you make use of it after updating by specifying --outtype source
.
Imatrix changes:
- imatrix : use GGUF by default (https://github.com/ggml-org/llama.cpp/pull/14842)
- This adds the new argument
--output-format {gguf,dat}
- This adds the new argument
- imatrix : fix 3d activation handling for hybrid and recurrent models (https://github.com/ggml-org/llama.cpp/pull/14994)
- With this ouer imatrix.cpp patch to fix NaN's is no longer required - I'm so surprised git managed to remove it from imatrix.cpp automaticaly during merging
Other important changes:
- HunYuanDenseV1ForCausalLM support
- Qwen3-Embedding models
- fix tokenizer for JetBrain Mellu
- KimiVLForConditionalGeneration (text only)
- Glm4MoeForCausalLM support
@mradermacher
If you have time, please also mark the cogito-v2-preview-llama-405B
imatrix task as imatrix RPC. If you don't want to use RPC we could use /root/cogito-v2-preview-llama-405B.Q8_0.gguf
but I believe the model deserves RPC.
I'm so surprised git managed to remove it from imatrix.cpp automaticaly during merging
If trtue, wouldn't that be a bug? Git is not supposed to silently remove changes on conflicts. It's the whole point of such a system to make sure changes are not silently overwritten :)
--outtype source
that seems exactly what i was asking for/what i would have expected it would do by default already. the only issues I can see is either too big models, or issues with mixed arithmetic in kernels. Anyway, it's the dealt now, once llama has been updated.
updated, and using --outtype source for everything now
cogito-v2-preview-llama-405B
marked, no quant
and we have some fat glm 4.5 models in the queue
It would be nice to have some models where we can actually provide all quants for a change. Sigh. Right now, it feels like no nontrivial model survives quant creation without hacks.
PS: I've included IQ3_XXS in the quants we skip for "nolow".
PPS: especially frustrating because these big models really deserve low-bit quants.
@mradermacher Please update llama.cpp to latest version of ouer fork so we can do https://huggingface.co/openai/gpt-oss-120b and https://huggingface.co/openai/gpt-oss-20b