From Ether to Syntax: A Meta-Analytic Exploration of Linguistic Algorithmic Landscapes

#6
by mradermacher - opened

continued....

mradermacher changed discussion status to closed

Here a compleate list of the newly added architectures.

The non-mm-archs are picked up automatically when llama is updated (rather, nothing checks for these archs, other than the script that shows me daily models).

Nice. Will do in caser you forgot any vision/audio architecture.

In case yopu need it, the list/regexc is currently in /llmjob/share/llmjob.pm - search for is_vision

Also, vision is mradermacher code for multi-modal from now on.

Bert based architectures seem to be incredible

I might exclude them from the daily list for that reason, and them being likely not popular with the people who consume ggufs. (and most fail because small models tend to have custom tokenizers).

Nice I just discover an easy way to requeue previously failed archidectures:

Yup, shell-greppable logs for the win.

Update: oh, it's not even the real log file, "just" the llmc why transform of it.

@RichardErkhov vision models should not be queued to rich1 unless they arte not being detected as such (and then no vision extraction should happen).

The non-vision jobs are limited to 32GB ram, too. No clue what happened. Very troubling.

However, this morning, only besteffort models were queued on rich1. Who knows what nico queued...

well, good to know. usually you take like 4-8gb, but something went wrong today. Peak recorded by proxmox was 24gb (so I assume it was even higher, but due to total OOM, it might not have recorded full number. I added swap on root just in case this happens again so at least other things on server dont die haha

llmc audit besteffort skips the besteffort models for me.

Please restart Audio-Reasoner imatrix computation. I killed it earlier today because it ran on CPU. I'm still not sure what makes GPUs occasionally temporary disappear but seams related to them being used on a different container.

llmc audit besteffort skips the besteffort models for me.

Right, arguments were not passed to llmjob audit. Should be fixed now.

@RichardErkhov

Peak recorded by proxmox was 24gb

Well, given that I was officially allowed to use 64GB, 24GB seems absolutely normal. So what is the new limit? 24GB will only allow one quant, and maybe not even that.

the PLM-1.8 imatrix failure looks somewhat interesting:

/llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu:82: CUDA error
CUDA error: the requested functionality is not supported
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at /llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu:1939
  cublasGemmStridedBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, src0_ptr, cu_data_type_a, nb01/nb00, nb02/nb00, src1_ptr, cu_data_type_b, s11, s12, beta, dst_t, cu_data_type, ne0, ne1*ne0, ne12*ne13, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

I wonder what functionality is missing. bf16 support? Maybe something went wrong compiling kernels? llama.cpp "recently" (months ago?) did something with which kernels are actually compiled for which archs, to save space, but I can't seem to find the issue for it atm.

It's this call:

    if (r2 == 1 && r3 == 1 && ggml_is_contiguous_2(src0) && ggml_is_contiguous_2(src1)) {
        // there is no broadcast and src0, src1 are contiguous across dims 2, 3
        // use cublasGemmStridedBatchedEx
        CUBLAS_CHECK(
        cublasGemmStridedBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N,
                ne01, ne11, ne10,
                alpha, src0_ptr, cu_data_type_a, nb01/nb00, nb02/nb00, // strideA
                       src1_ptr, cu_data_type_b, s11,       s12,       // strideB
                beta,     dst_t, cu_data_type,   ne0,       ne1*ne0,   // strideC
                ne12*ne13,
                cu_compute_type,
                CUBLAS_GEMM_DEFAULT_TENSOR_OP));
    } else {
        // use cublasGemmBatchedEx
        const int64_t ne23 = ne12*ne13;

        ggml_cuda_pool_alloc<const void *> ptrs_src(ctx.pool(), 2*ne23);
        ggml_cuda_pool_alloc<      void *> ptrs_dst(ctx.pool(), 1*ne23);

@mradermacher Please update to the latest llama.cpp version in ouer fork.

Our fork finally adds the --outtype source option to convert_hf_to_gguf.py. It now keeps F16, BF16 and F32 tensors in their original datatype, falls back to F16 for unknown datatypes and keeps storing tensors that should always be F32 in F32 according to GGUF specifications. I tested this options for a few models and found no issues fo far. I might even try to upstream this change as it seams really usefull so I recommend you make use of it after updating by specifying --outtype source.

Imatrix changes:

Other important changes:

  • HunYuanDenseV1ForCausalLM support
  • Qwen3-Embedding models
  • fix tokenizer for JetBrain Mellu
  • KimiVLForConditionalGeneration (text only)
  • Glm4MoeForCausalLM support

@mradermacher If you have time, please also mark the cogito-v2-preview-llama-405B imatrix task as imatrix RPC. If you don't want to use RPC we could use /root/cogito-v2-preview-llama-405B.Q8_0.gguf but I believe the model deserves RPC.

I'm so surprised git managed to remove it from imatrix.cpp automaticaly during merging

If trtue, wouldn't that be a bug? Git is not supposed to silently remove changes on conflicts. It's the whole point of such a system to make sure changes are not silently overwritten :)

--outtype source

that seems exactly what i was asking for/what i would have expected it would do by default already. the only issues I can see is either too big models, or issues with mixed arithmetic in kernels. Anyway, it's the dealt now, once llama has been updated.

updated, and using --outtype source for everything now

cogito-v2-preview-llama-405B

marked, no quant

and we have some fat glm 4.5 models in the queue

It would be nice to have some models where we can actually provide all quants for a change. Sigh. Right now, it feels like no nontrivial model survives quant creation without hacks.

PS: I've included IQ3_XXS in the quants we skip for "nolow".

PPS: especially frustrating because these big models really deserve low-bit quants.

@mradermacher Please update llama.cpp to latest version of ouer fork so we can do https://huggingface.co/openai/gpt-oss-120b and https://huggingface.co/openai/gpt-oss-20b

Sign up or log in to comment