Whimsical Waffle: The Curious Case of LLMs and Their Linguistic Shenanigans

#4
by mradermacher - opened

yay

mradermacher changed discussion status to closed

Actually, to get the DRYRUN test, all we would have to do is to get rid of the MAP_POPULATE in:

mmap(NULL, 31937041504, PROT_READ, MAP_SHARED|MAP_POPULATE, 4, 0) = 0x7ff79c600000

Because I think with the right switches, we can otherwise avoid touching the memory (alternatively, map /dev/null). Of course, the measurements allowed by DRYRUN are much more worthwhile. Basically, it's the killer feature if we could make it available and it turns out to be feasible. Thats the really interesting (to me) todo point: create a script that downloads the gguf header only from huggingface and recreates a dummy gguf. Too bad the gguf file format is so badly designed - you have to decode the whole header incrementally to know how long it is.

(using fuse to mount a file via https is cheating)

btw., in the case of blacksheep, i take the lists of quants done from the "quantize" script and patch the job like this:

"iquants": "Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S",

and fore the jais models for example, I removed the *0, *1, IQ4_NL quzant, essentially:

"squants": "x-f16 Q4_K_S Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS",
"iquants": "Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M Q4_K_M IQ2_M Q6_K IQ4_XS Q2_K_S IQ1_M Q3_K_S IQ2_XXS Q3_K_L IQ2_XS Q5_K_S IQ2_S IQ1_S Q5_K_M IQ3_XS IQ3_S",

it's in theory possible to do this when adding the job (not via llmc, because reasons), but that requires us to predict with some accuracy that this will happen, so is rarely useful

Actually, to get the DRYRUN test, all we would have to do is to get rid of the MAP_POPULATE in:
mmap(NULL, 31937041504, PROT_READ, MAP_SHARED|MAP_POPULATE, 4, 0) = 0x7ff79c600000

I'm a bit confused. Dryrun doesn't even use mmap. I explicitly disable it and even print "mmap is not supported for dry-run so it is now disabled" as warning if you don't specify --no-mmap. Why would you even want mmap for dry-run? You are not allocating any memory when loading the model so what would be the point of it?

Because I think with the right switches, we can otherwise avoid touching the memory (alternatively, map /dev/null).

What you mean with touching memory? No additional RAM or GPU memory should get allocated when loading a model. Obviously llama.cpp requires some memory to function like any application but that is so little it can be ignored.

Of course, the measurements allowed by DRYRUN are much more worthwhile. Basically, it's the killer feature if we could make it available and it turns out to be feasible. Thats the really interesting (to me) todo point: create a script that downloads the gguf header only from huggingface and recreates a dummy gguf. Too bad the gguf file format is so badly designed - you have to decode the whole header incrementally to know how long it is.

I don't think the header can be that big so you can likely just download enough for the full header to always be present.

btw., in the case of blacksheep, i take the lists of quants done from the "quantize" script and patch the job like this
"iquants": "Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S"

I assume you are setting this inside llmjob edit.

Wouldn't the scripts synchronize when it is available again?

Altogether it's 3GB, not just scripts, but also, of course, llama.cpp. I added a hack so when removing the disable flag it will sync automatically, but I also update llama.cpp from home, and every node has a different combination of llama.cpp variants (probably the easiest way around is to change that).

But, yeah, that's not effectively automatable.

Yes even for me it would now be inconvenient to switch as I memorized the path so well.

embrace the difference :)

Oh, let's hope for the best. No imatrix failure so far but a lot of imatrix tasks will only be started at 22:00 due to most of them currently being timeofday blocked.

I am pretty sure the dryrun test works - the onyl way it could fail is if it somehow succeeds despite the model being broken. Likely there are some tests in llama.cpp that are only done at inference time, the question is, how many, and are they important :) We will find out.

Just so you know DRYRUN is supposed to work with every llama.cpp executable that loades a model so you are not limited to llama-cli.

To... some extent (i.e. tracking allocations)? Surely you have not found a generic way to exit all of these at just the right time.

Then just don't use llama-cli but any other one that doesn't do this.

Haha, "just". Love it :) Anyway, are there any? There is the server, but the server seems to do the same thing.

Nice. No idea why everyone keeps renaming thair models but us having a diffrent name makes ouer models hard to find so automated renames would be quite usefull.

They rename it because they want to be able to erase it and create a different one without having to come up with a new final name, in case it sucks. Models are also regularly moved, and sometimes even aparently cloned, to other users.

It does make them harder to find, but at least I stopped using the search function by hf and started to use the quantisations link.

That would be amazing! There are quite a lot of factors that influence vram usage but maybe you can find a pattern by playing around with dryrun.

I would allow the user to specify VRAM for 0, 1 or 2 gpus, tensor split, some flags like flash attention, and then probably do a binary search to find the maximum -ngl value.

models always show the date when they were last updated

You'll have to check wuant file dates anyway if you need some kind of date. And then, it's pretty useless.

I guess we can at least try to update them in chronological order, so the order stays the same. Or can we?!?

The updates would almost certainly go from newest to oldest, even (or rather, reverse order in how hf lists them for me), with some randomness.

GIT_COMMITTER_DATE and GIT_AUTHOR_DATE environment variables before committing using git

If I can't do it via the api it will not happen. Messing in scripts with git will be a disaster. Besides, will the server-side git really just accept any client-side garbage date when pushed?

as this will hopefully be the last time we ever edit all of them.

The other a-ha moment I had last week was when I realised that this is the problem and must give. I have versioned the model cards now, so we can keep any number of different co,patible card formats and update at our own pace.

I don't think with us publishing 100+ repos a day anybody would care about 20000 updates even per day.

I'm a bit confused. Dryrun doesn't even use mmap. I explicitly disable it and even print "mmap is not supported for dry-run so it is now disabled" as warning if you don't specify --no-mmap. Why would you even want mmap for dry-run? You are not allocating any memory when loading the model so what would be the point of it?

I was talking about an alternative way to achieve just the validity testing without changing llama.cpp. It's entirely hypothetical.

I don't think the header can be that big so you can likely just download enough for the full header to always be present.

The header is pretty massive - tiny if you look at the whole file, but many megabytes in size to warrant an optimisation. My first computer had ~100 octets usable memory. I sawe amazing sofwtare wirtten in 20k of memory. When I see a bash process using 2MB of RAM I regularly get dizzy.

Anyway, gguf is very wasteful, or example, every vocabulary entry is 8 bytes string length + string. Also, "likely enough" means you still have to be prepared for it to not be enough in edge cases.

And to be honest, what worries me most is that aws typically charges for the full file even if only a few bytes of it are being downloaded. But since the gguf parse on the hf page exists, I am sure it doesn't matter :)

To... some extent (i.e. tracking allocations)? Surely you have not found a generic way to exit all of these at just the right time.

It should work for majority of them. Almost all that load a model are using the same code to do so. I just tested llama-imatrix, llama-perplexity, llama-simple, llama-simple-chat and llama-run all of which were fully compatible with DRYRUN despite me never testing them before. It’s not that they are just working they also tell you how much memory would be required to fulfill to load the model in a way that fulfills thar purpose as they essentially just load the model with the exact parameters they require.

Haha, "just". Love it :) Anyway, are there any?

No ide. Try the ones I mentioned above and if they all do it than this is likely something in the model loading code in which case I can take a look at the code and see if we can change this.

I would allow the user to specify VRAM for 0, 1 or 2 gpus, tensor split, some flags like flash attention, and then probably do a binary search to find the maximum -ngl value.

That would be so awesome. This is actually exactly what I'm currently for what I'm using DRYRUN myself.

Keep in mind that DRYRUN only tells you the memory required to load the model and allocate enough memory for its context. Memory used during inference for things like attention is not considered but is easy to estimate. In fact, more memory is required to load a model if flash attention is enabled due to additional overheads associated with its implementation.

If I can't do it via the api it will not happen. Messing in scripts with git will be a disaster.

Totally understandable.

will the server-side git really just accept any client-side garbage date when pushed?

All git servers seam to do. git servers kind of trust client side garbage by design. I had to spoof dates/name/emails for author/committer so many times in the past and I not once had a git server refuse the commit. The only thing I'm not sure if HuggingFace uses the time in the git commit like GitHub/GitLab do or if it uses the server time of the push. Now I'm a bit curious so the next time I upload a model I might try it.

The other a-ha moment I had last week was when I realized that this is the problem and must give. I have versioned the model cards now, so we can keep any number of different compatible card formats and update at our own pace.
I don't think with us publishing 100+ repos a day anybody would care about 20000 updates even per day.

Yes it should be fine unless we hit some kind of rate limit.

The header is pretty massive - tiny if you look at the whole file, but many megabytes in size to warrant an optimization. My first computer had ~100 octets usable memory. I saw amazing software written in 20k of memory. When I see a bash process using 2MB of RAM I regularly get dizzy.

My first "Gameboy" which in fact was a Voyage 200 calculator for school had 188 kB RAM and 2,7 MB ROM and it was enough to play all kind of games. I even had something like Maro Maker on there. I actually had that Voyage 200 calculator 5 years before I had my first mobile phone and used it from everything from reading, writing, programming and gaming.

In case you wonder my first PC was a Windows 2000 with 13 GB of HDD storage and I think 128 MB of RAM. My first programming language was BlitzBasic to write PC games followed by Compact-C which I used to program C-Control Pro microcontrollers which had 2 KB of usable RAM, 10 KB of usable flash storage, 1 KB EEPROM and a 14.7456 MHz CPU so I know your feeling.

Anyway, gguf is very wasteful, or example, every vocabulary entry is 8 bytes string length + string.

That is indeed terrible wasteful. 1 byte would have been enough.

Also, "likely enough" means you still have to be prepared for it to not be enough in edge cases.

Which should be fine as llama.cpp was so nice to put stupid limits everywhere so most edge cases likely already failed when we tried converting them into GGUF.

And to be honest, what worries me most is that aws typically charges for the full file even if only a few bytes of it are being downloaded. But since the gguf parse on the hf page exists, I am sure it doesn't matter :)

S3 only charges for the actually used bandwidth as far I'm aware. So if you only download the first 10 MB HuggingFace should only be charged for 10 MB. They do charge per 10K API calls a very low amount but this doesn't at all matter as we only have around 500K quants. I'm mostly worried about HuggingFace might be using intelligent tiering in which case us accessing all the quants might cause them to be copied into hot storage which then would cost them the transfer fee plus 30 days of hot storage. But in any case, there is not much we can do about any of this unless we find a storage usage pattern and can based on one quant tell how much all the others require which I think might be possible.

Memory used during inference for things like attention is not considered but is easy to estimate. In fact, more memory is required to load a model if flash attention is enabled due to additional overheads associated with its implementation.

That's a bummer then... So how would you easily estimate it? And what you mean more is required to "load" a model - after loading, flash attention surely uses less memory.

Yes it should be fine unless we hit some kind of rate limit.

That doesn't worry me either - I envisaged some kind of bulk update because I thought versioning the readmes is a bad idea. But, I changed my mind. IF we hit a rate limit, it will take a few y<ears to update old repos - so what.

Voyage 200 calculator for school

I got the first HP 48SX in germany (or so I was actually told by HP). Sigh. hp calculators... were so nice...

Windows 2000

Wow. That is so long after I had switched to GNU/Linux. (I switched from DOS to Linux just before win 3 became ubiquitous (in 1994, with 1.0.2 or something - I was even late to the game, or so it felt))

That is indeed terrible wasteful. 1 byte would have been enough.

Yeah, or 4 octet (or even 8 octet) header length + json/msgpack/cbor/... and yes, one octet would be enough if you limit strings to 127 octets, but to be fair, that's a limit of the encoder, not a limit of the format.

I'd say whoever designed it (well, gerganov) was probably paranoid of not running into arbitrary 4GB limits anywhere. Puzzlingly enough, though, the primitive types numbers (there are 13) are stored in 32 bit ints. And no, everything is just octet-aligned, so it's nothing to do with that.

To it's defence, the gguf decoder I wrote in Perl is just 80 lines of code. So in that sense, it lends itself to a very simple implementation. But using an existing JSON decoder with that header would just be 3 lines or so...

I think ggerganov has a major fear of external dependencies - even more than me, and I thought I was a bit on the extreme side.

S3 only charges for the actually used bandwidth as far I'm aware.

I admit I am no expert, but it seems to be a well known attack to request only part of a large file and get billed with much larger transfer costs because aws does not bill octets downloaded but octets prepared for download, regardless of how much actually was used (or even requested). So yes, only actually used bandwidth, but it's their internal fantasy made up bandwidth, not the external customer-measurable bandwidth. It is possible that it only affects some S3 storage products, but it's a concern. Well, it's not a concern, because huggingface does it themselves, and I am happy to cache things...

S3

And don't they also bill GET requests? So there must be some optimal transfer size - probably in the megabyte range?

Sooooo, DRYRUN gives me an error message for a failed model, but exit status is 0:

load_tensors: loading model tensors, this can take a while... (mmap = false)
llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' has wrong shape; expected  5120, 152064, got  5120, 151665,     1,     1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'Methuselah-PSR-B1620-26b-14B-Exp.gguf'
main: Dryrun compleated!

changed the test to this:

      if DRYRUN= llama llama-cli -m "$SRC.gguf~" --no-warmup -n 0 -t 1 -no-cnv -st </dev/null 2>&1 | tee -a /dev/fd/2 | grep -q ": failed to load model"; then

That's a bummer then... So how would you easily estimate it? And what you mean more is required to "load" a model - after loading, flash attention surely uses less memory.

DRYRUN tells you how much memory you need to load a model and reserving the memory required for its context. So if you have as much memory as DRYRUN tells you, you will be able to load the model. However depending on context and prompt you might still OOM during inference as some memory is allocated during inference for algorithms like attention. The memory required for attention should more or less be the same for a given context with a give attention method. So you can likely measure it once and add it onto to what DRYRUN tells you is required to load the model. Flash attention needs more memory during the initial load, but the attention algorithm itself uses linear instead of quadratic memory for a given context which for large context should be more memory efficient.

That doesn't worry me either - I envisaged some kind of bulk update because I thought versioning the readmes is a bad idea. But, I changed my mind. IF we hit a rate limit, it will take a few y<ears to update old repos - so what.

The limit can't be so bad that it will take years. We should try to update them in a reasonable timeframe as the current model card isn’t that good in my opinion.

And don't they also bill GET requests? So there must be some optimal transfer size - probably in the megabyte range?

They do but it is $0.0004 per 1,000 requests so if we need 500K of them that is $0.2 which is so low it almost not worth mentioning.

HuggingFace will be fine:
"There are no retrieval charges in S3 Intelligent-Tiering. If an object in the infrequent access tier is accessed later, it is automatically moved back to the frequent access tier. No additional tiering charges apply when objects are moved between access tiers within the S3 Intelligent-Tiering storage class."

So if they use Intelligent-Tiering they are not getting charged for AWS being stupid beside paying slightly more for files being in less cold storage for 30 days which is almost nothing to what retrieval charges would be.

In case you wonder from S3 to Europe (Zurich) is $0.02 per GB and nothing if it only goes to Amazon CloudFront (which has their own billing for bandwidth) and really they seem to only calculate that data is actually getting sent to the internet based on their website and intelligent storage has no retrieval fee so they really shouldn't bill for the data we don't download unless they found some kind of loophole to trick their customers.

But in any case, there is nothing we can do about any of this.

Sooooo, DRYRUN gives me an error message for a failed model, but exit status is 0

That's so stupid. Sorry for this mistake. I forgot about that. I will be fixing it today evening.

changed the test to this

This will work in the meantime.

DRYRUN tells you how much memory you need

I realise what you mean. I guess it can also be handled by telling the user to reduce ngl a bit when in doubt. It will still be far more useful than the manual trial runs I have to do now.

The limit can't be so bad that it will take years.

I meant toi say "even it takes a few years..." and I didn't expect the repom create limit to be as bad as it is. Or erratic(?) still feels weird to get rate limited sometimes, even when we don't crunch through lots of models.

S3

Thanks, helps a lot :)

This will work in the meantime.

We are not in a hurry - assuming that we always get "failed to load model". Eh, evenif it would not, it'd still be a great improvement :)

model page

Well, my plan is to get rid of graphs and everything but the download table and the links, and also more or less fully generate the page and move all metadata to yaml. The only hindrance is that it is a lot of work, and even a single misplaced space or fixed typo will cause havoc :) Just not so much fun. But I am slowly working towards making it doable (and gaining motivation by not forcing me to work on it :)

If you have any conrete input (text fragments, layout) on the model page, I am happy to collect it. The general trend, though, should be to move as much of the info to the external model page, so there is only one place to improve. Unfortunately, the model download page already needs revamping, too, and already goes too much into the directioon of web development for my taste :)

Sooooo, DRYRUN gives me an error message for a failed model, but exit status is 0:

This should now be fixed in the latest version. I kind of forgot about llama.cpp sometimes using exceptions to jump out of heavily nested functions skipping all the code that would otherwise get executed by following the normal return path. I personally don't really like throwing exceptions somewhere and handling them on a completely different location - it feels like a modern version of goto but without labeling where it jumps to.

I fixed this by adding a dedicated exit point for dry-run inside common.cpp to no longer mess with llama.cpp's exception handling and removing all modifications from main.cpp. This now ensures exceptions skip past the dry-run dedicated exit point and are instead getting properly handled by main.cpp

I also updated the mradermacher branch to latest llama.cpp so we now have Gemma 3 and experimental Gemma 3 vision support.

You guys might find this interesting: https://arxiv.org/abs/2503.03592

Quote from conclusion:

Further, the usage of importance matrices written in non-English does not significantly improve performance on non-English datasets and might in fact slightly harm it. However, this reduction in performance is not statistically significant.

You guys might find this interesting: https://arxiv.org/abs/2503.03592

Thanks a lot for sharing! I looked at the paper and am really surprised by the result. Their testing methodology looks clean and the result tell quite a clear story. This means our primary English imatrix dataset is much better for non-English models than we thought. I now regret having non-English models only queued for static quants.

@nicoboss I assume you queued all/most of the nice lint check models that all fell through the llama loading code check? :)

Here's all errors (deduplicated), and they do all seem legit (and therefore I have nuked them):

/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
/llmjob/llama.cpp-cuda512/src/llama.cpp:8666: GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor") failed

I suspect these are all pure embeddings and therefore can't be used with llama-imatrix?

regarding the paper, it's one of the results I expected (either it's no big deal, because a lot with imatrix training data seems irrelevant), or it has a big effect. But finally I can choose between these extremes!

I also feel much better about my training data now, which is pretty incoherent. But given that random tokens seem to work relatively fine, it would actually be surprising if it were so detrimental.

The question is, what does that tell us about how llms store knowledge? and how about IQ quants, which are far, far more sensitive to imatrix weights?

@tdh111 anyway, very much appreciated, I would have never seen this paper without you

I assume you queued all/most of the nice lint check models that all fell through the llama loading code check? :)

I queued a quite a lot of trending models some of which turned out to be bad. Those errors are all legit and can be nuked.

I suspect these are all pure embeddings and therefore can't be used with llama-imatrix?

Yes exactly. I will improve my trending model discovery scripts to filter out embeddings in the next version. I will also check if there is a way dry-run can detect this. The main issue that this is a check that occurs during inference time inside llama_decode_impl and not while loading the model.

The last 2 failures you can nuke if you want.

https://huggingface.co/cl-nagoya/ruri-large-v2 likely requires manual GGUF conversion due to ModuleNotFoundError: No module named 'fugashi'

No idea why https://huggingface.co/google/flan-t5-xxl fails to download but if the just started Redown fail I guess I will provide the GGUF manually there as well.

Edit: Nevermind cl-nagoya/ruri-large-v2 likely is an embedding as well so I nuked it as we don't care about them.
Edit2: I think redown fixed flan-t5-xxl so must have just been some random HuggingFace download error.
Edit3: No flan-t5-xxl failed again: ValueError: Missing or incomplete model files: ['model-00001-of-00005.safetensors']

anyway, very much appreciated, I would have never seen this paper without you

Thanks a lot for sharing!

Glad you both liked it.

The question is, what does that tell us about how llms store knowledge? and how about IQ quants, which are far, far more sensitive to imatrix weights?

Both of those are separate from the paper I linked, but this paper is relevant to your first question: https://arxiv.org/abs/2503.05613 .

Your second question about IQ quants is best answered by ikawrakow, who would most likely answer if asked in a discussion post in ik_llama.cpp. I feel like I know the answer but I'm not confident enough to give my answer because I would rather not spread potential wrong information, but now that you ask I'm curious if the same holds true for his new quant types (IQ_K) which at low bpw offer better performance than I-quants and at higher bpw offer better performance and quality compared to K-quants.

I will also check if there is a way dry-run can detect this.

Be careful - the more checks you add, or rather, move, the more you will diverge from future llama.cpp versions that might do things differently. There is a trade-off here, between catching more things and maybe blocking future roads.

some random HuggingFace download error.

Possible, but unlikely, as hfd retries pretty aggressively. When you open a (s) in audit, the download is printed (it's in MODEL/log, too). If it's a new model, a much more common failure mode is actually not yet uploaded files. For example,, YOYO-AI loves to make elabnorate model cards before actuially uploading all files :/

I'm unexpectedly busy (and probably rather tired) for the next weeks, probably. I'll try to take care, but don't be alarmed if things get a bit more erratic.

also not caught:

llama_model_quantize: failed to quantize: key not found in model: llama.context_length

this is actually caught by quantize, so causes extra human work, but not extra computational work (it's caught during static jobs).

interesting that qantize even bothers...

and clearly, nice level 1200 is the junk class

How do I know if /tmp/quant/Samantha-1.11-70b-i1-GGUF/imatrix.dat is an old or new imatrix? I unfortunately nuked the existing imatrix repo before hash comparing them. I checked for Samantha-1.1-70b which is basically the same case and they were different so I'm almost certain the imatrix for Samantha-1.11-70b got recomputed as well. It seems like cases where after nuke existing imatirx are getting copied only happens if they were somewhat recently generated but not for this 1-year-old cases of repositories where static quants never even existed. In the future I will obviously use nukeall so none of this will be an issue.

and clearly, nice level 1200 is the junk class

I noticed this as well. I nuked so many errors this morning when I woke up. We had almost entire hosts filled with errors.

also not caught:
llama_model_quantize: failed to quantize: key not found in model: llama.context_length

This does get caught using dry-run. Not sure why you think it does not. I even tested one of the models that had this error today to confirm:

llama_model_loader: mmap is not supported for dry-run so it is now disabled
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 12.55 GiB (16.00 BPW) 
llama_model_load: error loading model: error loading model hyperparameters: key not found in model: llama.context_length
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/nico/law-LLM.gguf'
main: error: unable to load model

I'm unexpectedly busy (and probably rather tired) for the next weeks, probably. I'll try to take care, but don't be alarmed if things get a bit more erratic.

No problem. Now that you gave me all these amazing tools and I got familiar using them I should be able to solve most of the issues myself hopefully letting you focus as much on your job as possible. Just ignore things and only respond that what is important to safe time. I'm doing them same when I'm busy. Feel free to ignore user requests and audits as I can handle them myself.

Interesting so nico2 has still not turned itself off despite the repository creation issue beeing long gone and 3 hours after 17:00:

nico2    nice size (static/imatrix) -- jobs 3/4-12 maxm 300 free 1479 budget 769 uploads 0 hfd 0 32c
         1200    1 s  Qwen2.5-0.5B-SFT-merge                       blocked/frozen/timeofday repo create (interrupting)
         1200   15 s  mistral-openorca-openplatypus-alpacagpt4     blocked/admin/paused
         1200   17 s  WEPO-llama-3-8b                              blocked/admin/paused

Regarding Snowflake Arctic Instruct the source GGUF is under /apool/snowflake-arctic-instruct.gguf. We only want to regenerate the imatrix and the imatrix quants but keep the static quants. Before you add it you need to nukerepo https://huggingface.co/mradermacher/snowflake-arctic-instruct-i1-GGUF but only this one and NOT the static quants! You also need to archive the current imatrix quants WITHOUT using nukeall as we want to keep the static quants.

How do I know if /tmp/quant/Samantha-1.11-70b-i1-GGUF/imatrix.dat is an old or new imatrix?

If you are lucky, from the mtime. Otherwise, if we have a repo for it, we have it cached. If we don't have a repo for it, we shouldn't have it cached. In this case, it's old, though:

-rw------- 1 root root 4.6M Dec 31 15:16 imatrix-remote/Samantha-1.11-7b.imatrix
-rw------- 1 root root 7.2M Dec 14 13:59 imatrix-remote/Samantha-1.11-13b.imatrix
-rw------- 1 root root 25M Mar 10 08:47 imatrix-remote/Samantha-1.11-70b.imatrix

I assume I should remove it? (done)

If you are lucky, from the mtime. Otherwise, if we have a repo for it, we have it cached. If we don't have a repo for it, we shouldn't have it cached.

Thanks a lot! Can you in this case please check for Samantha-1.1-70b as well and delete if it is older than a few weeks? I have the feeling that has generated a new one despite https://huggingface.co/mradermacher/Samantha-1.1-70b-i1-GGUF existing as the sha256 hash of the imatrix file I have locally doesn't match the one inside the HuggingFace repository.

@mradermacher The status page and telnet 10.28.1.1 16732 are already frozen for an hour but llmc audit still works without any issue which is strange as shouldn't it break as well if someone doesn't release the lock? I plan on executing llmc killall9 should the issue still persist in half an hour.

Interesting so nico2 has still not turned itself off despite the repository creation issue beeing long gone and 3 hours after 17:00:

repo creation is currently not interruptible, and no, the status showsa it's frozen, so still ongoing.

there are two issues here: a) jobs should no longer be frozen on nico1 - if a job somehow isn't finished before 17:00, it should probably continue to run and b) if a job is still active, it will not shutdown

i removed the cron rules causing job freezes. but repo creates would still keep it on indefinitely atm.

i also don't think snowflake will help us. I think we should push a few of the big 1400 jobs to prio 1200 or so per day. maybe. will have to see.

I couldn't queue this morning, this is probably the root cause.

-rw------- 1 root root 25M Mar 10 05:55 imatrix-remote/Samantha-1.1-70b.imatrix

moved away

if there already is a job i need to manually add the imatrix job

repo creation is currently not interruptible, and no, the status showsa it's frozen, so still ongoing.
there are two issues here: a) jobs should no longer be frozen on nico1 - if a job somehow isn't finished before 17:00, it should probably continue to run and b) if a job is still active, it will not shutdown
i removed the cron rules causing job freezes. but repo creates would still keep it on indefinitely atm.

Thanks a lot. No worry about the repo creation. It will eventualy create it and shutdown. The main reason it didn't was likely because of timeofday. The rate limit usualy doesn't blocks a specific task for more than around a hour unless you get very unlucky and always loose the race to create a repo once a slot gets free.

I couldn't queue this morning, this is probably the root cause.

We did over 300 models today as 19 hours ago we had 1779 models in the queue and now there are 1492 not even considering all the ones we queued. It was crazy. I had to llmc audit sometimes even multible times per hour as we went through so many models. We need to do a somewhat healthy mix of difffrent sized models so we don't end up having days where we only do small ones or we will get rate limited. Next time I will queue some myself earlier.

I think we should push a few of the big 1400 jobs to prio 1200 or so per day. maybe. will have to see.

That sounds like a great idea for days where we there are not many new big great models.

i also don't think snowflake will help us.

It will at least keep nico1 busy and it's one of the massive models we had to do anyways. I'm currently also closely following llama.cpp decisions what MLA algorithm they decide on. Depending on which one they choose we may or may not need to requantize all the DeepSeek V2/V3/R1 models.

-rw------- 1 root root 25M Mar 10 05:55 imatrix-remote/Samantha-1.1-70b.imatrix
moved away

Thanks a lot!

if there already is a job i need to manually add the imatrix job

Now that I know the current imatrix was outdated I will be secured the source GGUF, use nukeall and queue them again. That should be the cleanest option and not require you to do anything.

I tried fixing the frozen status page using llmc killall9 but it timed out...

nico1 ~# llmc killall9
nico1
back
leia
nico2
rich1
kaos
marco
rain
Killed llmjob(2720126) with signal 9
Killed llmjob(1136699) with signal 9
Killed llmjob(1136722) with signal 9
Killed llmjob(1137378) with signal 9
Killed llmjob(296290) with signal 9
Killed llmjob(514440) with signal 9
Killed llmjob(3434878) with signal 9
Killed llmjob(661385) with signal 9
llmjob: no process found
Killed llmjob(2256273) with signal 9
nico2: Connection timed out

At least I now know which node is to blame. Guess time to turn on nico2 again to fix this.

With nico2 turned on llmc killall9 terminated successfully within a second but the status page is still frozen. This issue really seems quite different from how normally the frozen status page behaves. I turned off nico2 again as turning it on didn't help solving the issue.

Oh maybe I shouldn't have used llmc killall9. When I now check llmc audit I see a many entires like this but not entierly sure if related but they are not something I've seen before:

ionice -c3 chrt -b 0 systemd-run --scope --unit=llmjob-wrap-omega1.3-static-2719966 -G -p MemoryMax=32G
/llmjob/share/bin/quantize: line 295: 2720126 Killed                  llmjob hf-ensure-repo "$DST"
job finished, status 137
job-done<0 omega1.3 static 137>
https://huggingface.co/nicogptai/omega1.3

back from working...

yes, i just had the same experience. i have no clue what causes these deadlocks (basically, the master takes a global lock before connecting to workers, and each worker takes its own local lock, in random order. one would think the relatively new "upcalls" (via llmc) might be an issue, but i don't see a path where llmjob does a blocking llmc call - the only llmc call llmjob does is "push", which does not block if the lock is held. shucks).

killall -9 llmjob is no longer a crude but effective method, because llmjob has become a toolbox for lots of things, rather than only the scheduler itself, so killing it kills lots of other stuff, failing jobs. it's relatively simple to clean up for me, so if it means some other job will start instead, do it. the fix is to fix the deadlock problem...

@nicoboss so, i thought, great oppportunity to do the snowflake imatrix quant. i can mlock the Q8_0 (509G) without issue, but llama-imatrix is "Killed" (so probably oom killer), even with literally nothing else running.

@nicoboss so, i thought, great oppportunity to do the snowflake imatrix quant. i can mlock the Q8_0 (509G) without issue, but llama-imatrix is "Killed" (so probably oom killer), even with literally nothing else running.

It indeed got OOM killed despite nothing else being running. I was aware you wanted to run Snowflake Arctic imatrix so I turned of all services on StormPeak. The only thing I forgot was reducing the ZFS ARC cache from 24 GB to less but the last time we did snowflake arctic base this wasn't required. Here the kernel log oft he OOM event:

Mar 14 02:28:22 StormPeak kernel: llama-imatrix invoked oom-killer: gfp_mask=0x440dc0(GFP_KERNEL_ACCOUNT|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=800
Mem-Info:
active_anon:221810 inactive_anon:462979 isolated_anon:0
active_file:1412 inactive_file:3913 isolated_file:0
unevictable:124205714 dirty:195 writeback:108
slab_reclaimable:389941 slab_unreclaimable:342434
mapped:124210748 shmem:28017 pagetables:385960
sec_pagetables:0 bounce:0
kernel_misc_reclaimable:0
free:873272 free_pcp:132 free_cma:0
Node 0 active_anon:1733588kB inactive_anon:1005568kB active_file:5240kB inactive_file:14472kB unevictable:496822856kB isolated(anon):0kB>
Node 0 DMA free:11264kB boost:0kB min:0kB low:12kB high:24kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB i>
lowmem_reserve[]: 0 1432 515181 515181 515181
Node 0 DMA32 free:1520228kB boost:0kB min:252kB low:1716kB high:3180kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_>
lowmem_reserve[]: 0 0 513749 513749 513749
Node 0 Normal free:1958780kB boost:0kB min:91612kB low:617688kB high:1143764kB reserved_highatomic:1867776KB active_anon:1873616kB inact>
lowmem_reserve[]: 0 0 0 0 0
Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
Node 0 DMA32: 9*4kB (UM) 12*8kB (UM) 10*16kB (UM) 8*32kB (UM) 9*64kB (UM) 8*128kB (UM) 8*256kB (UM) 11*512kB (UM) 11*1024kB (UM) 12*2048>
Node 0 Normal: 1292*4kB (UME) 11246*8kB (UME) 14994*16kB (ME) 5833*32kB (ME) 2351*64kB (UME) 264*128kB (UME) 114*256kB (UM) 80*512kB (UM>
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
124203420 total pagecache pages
0 pages in swap cache
Free swap  = 0kB
Total swap = 0kB
134086427 pages RAM
0 pages HighMem/MovableOnly
2179266 pages reserved
0 pages hwpoisoned
Mar 14 02:28:26 StormPeak kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=ns,mems_allowed=0,global_oom,task_memcg=/lxc/108/ns/user.slice/user-0.slice/session-3129.scope,task=llama-imatrix,pid=2313118,uid=100000
Mar 14 02:28:26 StormPeak kernel: Out of memory: Killed process 2313118 (llama-imatrix) total-vm:502778504kB, anon-rss:77292kB, file-rss:280131584kB, shmem-rss:8192kB, UID:100000 pgtables:548784kB oom_score_adj:800
Mar 14 02:28:30 StormPeak kernel: oom_reaper: reaped process 2313118 (llama-imatrix), now anon-rss:0kB, file-rss:0kB, shmem-rss:72kB

I now reduced the ZFS ARC Cache from 24 GB to 1 GB. If this is still not enough, please offload layers to booth RTX 4090 GPUs and it will fit for sure. StormPeak is now ready for you to be used for Snowflake Arctic imatrix computation.

I now joined the waitlist for HuggingFace Xet. Xet is thair next generation storage solution replacing S3/GIT LFS. If my personal account gets accepted I will let you know if it is any good. You could join using https://huggingface.co/join/xet but I recommend to wait. Xet probably lifts the 50 GB limit so no more splitting/merging required. For our dry-run all GGUFs project Xet would be far superior compared to S3 as unlike S3 Xet is a block storage so you likely only need to download a single block per model.

grafik.png

@mradermacher I tried for the first time to manually run a massive imatrix job and everything seemed well but something is blocking it. Maybe because I paused the host to prevent other tasks from running as I had no clue how to put a host in that mode as llmc help had no command for it.

Edit: No also got stuck at this exact location when nico1 wasn't paused at all.

grafik.png

Also all thouse commands seam to be broken depite the host no longer beeing paused:

nico1 /tmp# llmc disable llmjob.nico1
disable.llmjob.nico1+: fail
nico1 /tmp# llmc disable imatrix.nico1
disable.imatrix.nico1+: fail
nico1 /tmp# llmc pause llmjob.nico1
pause.llmjob.nico1+: fail
nico1 /tmp# llmc pause imatrix.nico1
pause.imatrix.nico1+: fail

And resuming the GPUs I paused while playing around also seams no longer possible despite the host no longer beeing paused for a while:

nico1 ~# llmc resume GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc
pause.GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc-: fail
nico1 ~# llmc resume GPU-188a5143-db69-7058-63b5-f2f1d2354f91
pause.GPU-188a5143-db69-7058-63b5-f2f1d2354f91-: fail
nico1 /tmp# llmc enable GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc
disable.GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc-: fail
nico1 /tmp# llmc enable GPU-188a5143-db69-7058-63b5-f2f1d2354f91
disable.GPU-188a5143-db69-7058-63b5-f2f1d2354f91-: fail

I noticed that llmc help is missing the imatrix FLAG files
/tmp/pause to pause the imatrix tasks (which is still used despite the ability to pause GPUs because of legacy scripts and it being super reliable)
/tmp/imatrix.force to ignore "larger than 480GB" imatrix limit
/tmp/max-ngl to set the maximum number of layers allowed to be offloaded to the GPU

I now returned everything to as close to normal again as I could. quantization jobs are running again on nico1 and one of the 70B imatrix jobs is running despite booth GPUs being paused as I used the /tmp/pause flag to pause it before pausing the GPUs. The other imatirx jobs will unfortunately be blocked as I had no idea llmc enable GPU-* would be broken.

It would be great if you could tell me what I did wrong and/or by your own start snowflake-arctic-instruct imatrix computation once you are available. How did you make sure only one imatrix task is running? The only thing I could think of would be to pause llmjob.nico1 and one of the GPUs which should gurantee only one imatrix task running. Don't worry about the imatrix queue being so long. This is mainly because nico1 somehow decided to eat all the priority 40 tasks due to me overriding everything as llmc pause llmjob.nico1 was broken.

Wow strange it seem to still be doing imatrix jobs just with one GPU despite booth being blocked. Cool I guess as I wanted to unpause them but super confusing that it does this.

How is this an error now?

400   17 s  ablation-65-a55.simpo.armorm-shisa-v2-llama-3.1-8b error/255 repo create

If I llmc audit I see this:

HfHubHTTPError("500 Server Error: Internal Server Error for url: https://huggingface.co/api/repos/create (Request ID: Root=1-67d49fb3-377305ff77f775b842cdcecc;fd588126-eff5-4fcf-8792-e655e5a2affc)\n\nInternal Error - We're working hard to fix this as soon as possible!") at /llmjob/share/bin/llmjob line 2715.
        ...propagated at /llmjob/share/bin/llmjob line 2718.
job finished, status 255
job-done<0 ablation-65-a55.simpo.armorm-shisa-v2-llama-3.1-8b static 255>

https://huggingface.co/shisa-ai/ablation-65-a55.simpo.armorm-shisa-v2-llama-3.1-8b

This must be different than the repository creation rate limit I assume as rate limit usually never errors. I selected "retry" for now.

if there already is a job i need to manually add the imatrix job

You will unfortunately need to do so for Samantha-1.11-70b and Samantha-1.1-70b or tell me how to manually trigger an imatrix job if the scheduler thinks an imatrix already exists and so doesn't do one by itself as by the time the model was queued we have not yet archived the old imatrix.

Wow strange it seem to still be doing imatrix jobs just with one GPU despite booth being blocked.

You specified the wrong gpu uuid, so only one is blocked. You shoulöd be able to block all using llmc pause imatrix.nico2 .

I did that now, unpause with llmc resume imatrix.nico2

500 Server Error: Internal Server Error for url

yes, repo create just endlessly retries on being rate limited. this is simply hf suckiness, it happens on any request, and not all of them are retryable.

I noticed that llmc help is missing the imatrix FLAG files

There aren't any that you can access, they are all on the host that runs the imatrix jobs (kaos). And they are: .imatrix-hfd (gguf is valid) and .soverride (block this job). everything else would be in the json job description (e.g. which quant, wehere to download, and "force" which special cases quite a few things even the quantiser).

Xet probably lifts the 50 GB

I hope this will be transparent when using the hub api?

(flags)

Hmm, I reworked the flags stuff a few days ago, probably something is broken. One issue is that at least one of your uuids was not a uuid, but that's not checked by anything - it would simply mean you disabled a card that doesn't even exist, explaining those problems.

I am rather busy atm., but I will look at it later. Skimming through this,m your intentionw as to enable everything again, so I will do that.

gpu resuming was broken, the other flags should work

snowflake instruct didn't exhibit partially covered tensors either. peculiar:


[30]4.5118,[31]4.4902,[32]4.2987,[33]4.1352,[34]4.0640,[35]4.1097,[36]4.0953,[37]3.9571,[38]3.8514,[39]3.8286,
save_imatrix: entry '              blk.0.ffn_down_exps.weight' has partial data (95.31%)
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '              blk.0.ffn_gate_exps.weight' has partial data (95.31%)
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry '                blk.0.ffn_up_exps.weight' has partial data (95.31%)
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 382 out of 385 entries
[40]3.8027,[41]3.7899,[42]3.7907,[43]3.7528,[44]3.7568,[45]3.7423,[46]3.7421,[47]3.7649,[48]3.7736,[49]3.8393,
save_imatrix: entry '              blk.0.ffn_down_exps.weight' has partial data (97.66%)
save_imatrix: 3 out of 128 experts are missing data
save_imatrix: 3 out of 128 experts are missing data - storing but be aware
save_imatrix: entry '              blk.0.ffn_gate_exps.weight' has partial data (97.66%)
save_imatrix: 3 out of 128 experts are missing data
save_imatrix: 3 out of 128 experts are missing data - storing but be aware
save_imatrix: entry '                blk.0.ffn_up_exps.weight' has partial data (97.66%)
save_imatrix: 3 out of 128 experts are missing data
save_imatrix: 3 out of 128 experts are missing data - storing but be aware
[50]3.8619,[51]3.7652,[52]3.6756,[53]3.5995,[54]3.5232,[55]3.4544,[56]3.4564,[57]3.4428,[58]3.4881,[59]3.5413,[60]3.6089,[61]3.5819,[62]3.6202,[63]3.6591,[64]3.6948,[65]3.7287,[66]3.7580,[67]3.8092,[68]3.8528,[69]3.8791,[70]3.9078,[71]3.9304,[72]3.9267,[73]3.9117,[74]3.8934,[75]3.9132,[76]3.9207,[77]3.9402,[78]3.9272,[79]3.9366,[80]3.9516,[81]3.9433,[82]3.9516,[83]3.9424,[84]3.9542,[85]3.9600,[86]3.9625,[87]3.9679,[88]3.9827,[89]3.9785,[90]3.9810,[91]3.9932,[92]3.9877,[93]3.9786,[94]3.9769,[95]3.9490,[96]3.9652,[97]3.9638,[98]3.9659,[99]3.9510,[100]3.9496,[101]3.9704,[102]3.9540,[103]3.9469,[104]3.9432,[105]3.9589,[106]3.9717,[107]3.9966,[108]4.0193,[109]4.0061,[110]3.9952,[111]3.9861,[112]3.9732,[113]3.9614,[114]3.9478,[115]3.9381,[116]3.9272,[117]3.9226,[118]3.9405,[119]3.9552,[120]3.9894,[121]4.0224,[122]4.0629,[123]4.0951,[124]4.1483,[125]4.1925,[126]4.2084,[127]4.2208,[128]4.1951,[129]4.2042,[130]4.2001,[131]4.1913,[132]4.1599,[133]4.1229,[134]4.1424,[135]4.1528,[136]4.1617,[137]4.1609,[138]4.1752,[139]4.1914,[140]4.2076,[141]4.2148,[142]4.2255,[143]4.2306,[144]4.2246,[145]4.2281,[146]4.1897,[147]4.1507,[148]4.1279,[149]4.0930,[150]4.0629,[151]4.0319,[152]4.0548,[153]4.0668,[154]4.1009,[155]4.1359,[156]4.1768,[157]4.2180,[158]4.2545,[159]4.2916,[160]4.3233,[161]4.3644,[162]4.4004,[163]4.4289,[164]4.4621,[165]4.4962,[166]4.5275,[167]4.5578,[168]4.5864,[169]4.6174,[170]4.6465,[171]4.6776,[172]4.7161,[173]4.7472,[174]4.7761,[175]4.8207,[176]4.8486,[177]4.8822,[178]4.9031,[179]4.9323,[180]4.9580,[181]4.9898,[182]5.0146,[183]5.0482,[184]5.0830,[185]5.1043,[186]5.1348,[187]5.1531,[188]5.1795,[189]5.2056,[190]5.2293,[191]5.2568,[192]5.2935,[193]5.3223,[194]5.3406,[195]5.3595,[196]5.3979,[197]5.4154,[198]5.4360,[199]5.4551,[200]5.4766,[201]5.5009,[202]5.5214,[203]5.5368,[204]5.5569,[205]5.5791,[206]5.6068,[207]5.6288,[208]5.6491,[209]5.6769,[210]5.7026,[211]5.7270,[212]5.7459,[213]5.7706,[214]5.7825,[215]5.8032,[216]5.8271,[217]5.8449,[218]5.8689,[219]5.8854,[220]5.9095,[221]5.9244,[222]5.9341,[223]5.9554,[224]5.9779,[225]5.9978,[226]6.0189,[227]6.0359,[228]6.0500,[229]6.0720,[230]6.0962,[231]6.1168,[232]6.1403,[233]6.1576,[234]6.1815,[235]6.2036,[236]6.2261,[237]6.2379,[238]6.2595

Could it be that the patch simply makes tensors valid, so as soon as they are "stored", they no longer count as partial from then on? I haven't looked at the patch, but maybe it would fill partial weights with dummy weights, so on the next round, they would no longer count as partial? Might not be a disastrous thing, but probably the patch shouldn't permanently change weights, because that would slightly change the results for the next rounds - maybe it should modify and save a copy.

save_imatrix: 14 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 382 out of 385 entries
[20]4.3242,[21]4.4543,[22]4.3112,[23]4.1885,[24]4.1851,[25]4.1865,[26]4.1966,[27]4.1751,[28]4.2599,[29]4.3854,
save_imatrix: 7 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 382 out of 385 entries
[30]4.5118,[31]4.4902,[32]4.2987,[33]4.1352,[34]4.0640,[35]4.1097,[36]4.0953,[37]3.9571,[38]3.8514,[39]3.8286,
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 382 out of 385 entries
[40]3.8027,[41]3.7899,[42]3.7907,[43]3.7528,[44]3.7568,[45]3.7423,[46]3.7421,[47]3.7649,[48]3.7736,[49]3.8393,
save_imatrix: 3 out of 128 experts are missing data
save_imatrix: 3 out of 128 experts are missing data - storing but be aware

Thinking about this, we should definitely investigate this, as this will probably affect most moe's and has a good chance of negatively affecting them, unless the patched weight values are essentially being ignored (I have no clue how the weights are combined between chunks).

We reached the repository creation limit again, so I started to reprioritize and statically assigned some of the in my opinion important 1400 priority medium sized models to rein and leia and some big ones to rich1. I'm now strictly controlling what models nico1 and nico2 are working on to ensure they only do big ones. I kept back, kaos and marco untouched as they are not currently working on background models. Don't be confused of me abusing the pause host functionality on nico1 and nico2. I realized that I can pause a host and then delete the interrupt files for it to work on specific models without sheduling any new ones. The reason I have to pause the entire host is because ther followinfg commands still do NOT work:

nico1 ~# llmc pause imatrix.nico1
pause.imatrix.nico1+: fail
nico1 ~# llmc pause llmjob.nico1
pause.llmjob.nico1+: fail

so I started to reprioritize and statically assigned

That sucks, because I also did this this morning, so us having to do it twice is not a good sign :)

work on specific models without sheduling

If that works for you, that's great. You can also manually start models by setting the .force flag (e.g. via llmc shell) and push'ing. They will immediatelly be interrupted by ready models highe rin the queue, but those cna be overriden. I envisage that's useful if some models are stuck in repo create.

On the other hand, how does that work at all? In my experience, the creation limit is something that hits during the day, then stays with us until 2-3am, with some limited softness (i.e. one can get a request through sometimes).

followinfg commands still do NOT work:

Eh, fascinating how many mistakes you cna put into a few regexes. And I thought I tested those.

followinfg commands still do NOT work:

I kind of refactored it once more. IT's even better than before. Looks fine from my point of view, but I didn't have the time to test everything.

so, gemma3 also has a vision part, and llama already has an extractor for it?

since i am super swamped, wanna give it a try at helping me? the qwen2vl extractor probably is a good base, and it's in "quantize" (you can search for Qwen2VLForConditionalGeneration)

as can be seen from the code, it's not exactly straightforward, mostly due to hard to predict output naming. the way I solved it was by creating an empty temporary directory, and assuming only a single file will be created, which will then be renamed to a known name.

I feel the newest dryrun has issues, or mnaybe we just ran into this class of problem:

load_tensors: loading model tensors, this can take a while... (mmap = false)
[DRYRUN][CPU]: 6425850112
alloc_tensor_range: failed to allocate CPU buffer of size 6425850112
load_tensors: pretend allocating CPU buffer was successful due to dry-run being enabled
...
[DRYRUN][CPU]: 513024
output_reserve: failed to allocate output buffer of size 0.49 MiB
llama_init_from_model: failed to initialize the context: failed to reserve initial output buffer
common_init_from_params: Dryrun compleated!
dryrun failed

I went back to the grep method, but it seems dryrun testing is completely broken at the moment.

I feel the newest dryrun has issues, or mnaybe we just ran into this class of problem:

What issues are you experiencing and with what model do they occur? The output you posted looks all great except you wrongly detecting it as failed. Is it not returning status code 0 even if it is successful or why is your code still detecting this as dryrun failed? I don't see how it is possible that it doesn't exit with code 0 if you see above Dryrun compleated! message as the code in common.cpp is the following - shouldn't exit(0) immediately terminate the application with exit code 0? Are you sure your exit code check is implemented correctly?

if(getenv("DRYRUN")) {
    LOG_ERR("%s: Dryrun compleated!\n", __func__);
    exit(0);
}

Just to make sure I tested the latest version myself and as expected the exit codes printed using echo $? is correct. Now I'm even more confused what issues you are talking about. Maybe you just got confused by the expected errors such as failed to allocate and failed to initialize which are intentional. The only issue I found was this embarrassing typo in the Dryrun compleated! message.

Working model:

[DRYRUN][PINNED]: 122088
output_reserve: failed to allocate output buffer of size 0.12 MiB
llama_init_from_model: failed to initialize the context: failed to reserve initial output buffer
common_init_from_params: Dryrun compleated!
0

Broken model:

llama_model_load: error loading model: error loading model hyperparameters: key not found in model: llama.context_length
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/nico/law-LLM.gguf'
main: error: unable to load model
1

so, gemma3 also has a vision part, and llama already has an extractor for it?

The gemma3 vision extraction is really simple. You just execute examples/llava/gemma3_convert_encoder_to_gguf.py to get the mmproj file as usual. By default the mmproj will be stored under mmproj.gguf but the --outfilecommand line argument can be used to specify whatever name you like. Using --outtype you can specify if you want the mmproj as f32, f16, bf16 or q8_0. If you don't specify anything the mmproj will be in f16. Then you just specify the path of your gemma3 model and done. Should you encounter any issues you can use --verbose to see exactly what it is doing.

since i am super swamped, wanna give it a try at helping me? the qwen2vl extractor probably is a good base, and it's in "quantize" (you can search for Qwen2VLForConditionalGeneration)

I just looked at the quantize script and the only thing you have to change is likely:

python3 "/llmjob/llama.cpp/examples/llava/qwen2_vl_surgery.py" --data_type fp16 -- ../"$SRC" || exit 33

to

python3 "/llmjob/llama.cpp/examples/llava/gemma3_convert_encoder_to_gguf.py" --outtype f16 -- ../"$SRC" || exit 33

as can be seen from the code, it's not exactly straightforward, mostly due to hard to predict output naming. the way I solved it was by creating an empty temporary directory, and assuming only a single file will be created, which will then be renamed to a known name.

You could heavily simplify it for gemma3 by just using --outfile to specify whatever output file you want. This unfortunately doesn't seem to be possible for qwen2vl so you either use the auto outfile detection for booth of them or use dedicated code for gemma3 in which case you can remove it and instead just use --outfile.

i'll have a look at dryrun later - most likely I made a mistake in a hurry.

What's your take on f32 and Q8_0 quants of vision parts? Q8_0 seems attractive to have, and I made sure our naming convention supports that. f32, not so much.

What's your take on f32 and Q8_0 quants of vision parts? Q8_0 seems attractive to have, and I made sure our naming convention supports that. f32, not so much.

It would be awesome if we could offer diffrent mmproj quants as well. qwen2vl supports fp32 and fp16 as --data_type argument while gemma3 supports f32, f16, bf16 and q8_0as --outtype argument. We should at least offer f16 and q8_0 and maybe even f32 for gemma3.

ok, not sure when i can go about that (gemma3). in the meantime, he is a diff I use for many months now, for use in the mradermacher branch

diff --git a/gguf-py/gguf/gguf_writer.py b/gguf-py/gguf/gguf_writer.py
index 080d2b9d..d3cbe44f 100644
--- a/gguf-py/gguf/gguf_writer.py
+++ b/gguf-py/gguf/gguf_writer.py
@@ -237,6 +237,10 @@ class GGUFWriter:
             kv_bytes = bytearray()
 
             for key, val in kv_data.items():
+                if val.type != GGUFValueType.ARRAY or len (val.value) < 50:
+                    print("gguf serialising key ", key, "value", val)
+                else:
+                    print("gguf serialising key ", key, "value-suppressed")
                 kv_bytes += self._pack_val(key, GGUFValueType.STRING, add_vtype=False)
                 kv_bytes += self._pack_val(val.value, val.type, add_vtype=True)
 
@@ -269,8 +273,8 @@ class GGUFWriter:
         self.state = WriterState.TI_DATA
 
     def add_key_value(self, key: str, val: Any, vtype: GGUFValueType) -> None:
-        if any(key in kv_data for kv_data in self.kv_data):
-            raise ValueError(f'Duplicated key name {key!r}')
+        #if any(key in kv_data for kv_data in self.kv_data):
+        #    raise ValueError(f'Duplicated key name {key!r}')
 
         self.kv_data[0][key] = GGUFValue(value=val, type=vtype)
 

in the meantime, he is a diff I use for many months now, for use in the mradermacher branch

Thanks for sharing. I fixed typo in the "Dryrun completed!" message, applied your diff and updated to latest llama.cpp despite there not being any changes relevant to us. There is no reason for you to update again unless you have not manually applied your diff the last time you did.

i made a typo when restoring the old dryrun code. it seems to work now :-)

i also removed the -ofreq 10, assuming this deals with any problems with wrong imatrix weights. that means no feedback for us, but it's rarely useful anyway.

why, oh why, did i follow llama naming conventions and called it fp16 instead of f16 (qwen2vl)

I noticed that for my latest medical models I have quite a few imatrix errors like this:

nico1 /tmp# grep -r "GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed" *.log
Gemma2-2b-IT-FT-medical_qa.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
gemma-2b-it-finetuned-medical-qa.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
gemma-medical_qa-Finetune-ad-2b.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
gemma-medical_qa-Finetune-ja.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
gemma-medical_qa-Finetune-v2.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
medical_jargons_simplifier2.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
Medical-mT5-large.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
Medical-mT5-xl-multitask.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
medical_q_a_model.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
Medical_Report_Summarization.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
Medical_Summarization.log:/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed

They don't appear in llmc audit. How to deal with them? nuke or nukeall?

neither, the imatrix ones i have to deal with. queue fewer junk models? :-)
(do these actually work with transformers?)

well, actually nukeall does work in this case

What should I do about this one? nuke and force requeue explicitly to nico1? I think this should work as it should auto-skip already existing quants.

Running scope as unit: llmjob-wrap-gemma-3-4b-persian-v0-noquant-6698.scope
{[[PROGRESS:preparing...]]}
{[[PROGRESS:mmproj extraction]]}
mmproj extraction attempted on unsupported host
job finished, status 72
job-done<0 gemma-3-4b-persian-v0 noquant 72>

https://huggingface.co/mshojaei77/gemma-3-4b-persian-v0

neither, the imatrix ones i have to deal with. queue fewer junk models? :-)

Most of them are not junk but I unfortunately don't have time to test every single one of them before queueing. Many medical finetunes lacking a proper model card which makes judging their quality without actually testing the model almost impossible. We could say no model card means trash but this doesn't seem to always be true as some are just lazy and I already had multiple good models without a model card.

do these actually work with transformers?

I just tested Gemma2-2b-IT-FT-medical_qa using transformers and it worked. But no worries the model kind of sucks as it wants you to ask questions formatted exactly like inside the medical QA dataset and is so heavily censored that it refuses to answer the majority of them. It seems so stupid to create a medical finetune that refuses to answer medical questions. But it also seams stupid to not write a model card.

well, actually nukeall does work in this case

Great I will nukeall them myself in the future. I will also try to find a way to recognize and filter such failures before even queueing them. With latest changes to my script the failure rate already got reduced a lot compared to earlier versions.

What does (worker +cork) mean? I noticed that you queued all of today’s lownice models using that flag.

Edit: Ah interesting that flag is gone now.

I merged the latest llama.cpp into the mradermacher branch adding support for the RWKV v7 architecture and fixing the tensor shape issue of OLMo-2-0325-32B-Instruct (tensor 'blk.0.attn_k_norm.weight' has wrong shape; expected 5120, got 1024)

I highly recommend to update as otherwise all RWKV v7/RWKV v7 Distilled based and many OLMo-2 based models will fail. Once you updated, please queue the following models:

RWKV v7 Base models (RWKV7ForCausalLM):

RWKV v7 Distilled models (RwkvHybridForCausalLM):

Force requant failed OLMo-2 models (Olmo2ForCausalLM):

(worker +cork)

Sorry, just experimenting - I wanted to queue everything first, so I set an impossible worker name to be changed when I am happy with the queue.

llama.cpp is updated, could you do me a favour and queue the models, maybe a test model first?

llama.cpp is updated, could you do me a favor and queue the models, maybe a test model first?

Thanks a lot! Will do.

Sorry, just experimenting - I wanted to queue everything first, so I set an impossible worker name to be changed when I am happy with the queue.

Ah I see. Now it makes sense. No problem I was just a bit confused at first.

@mradermacher Half an hour ago llama.cpp added support for Mistral3ForConditionalGeneration. Luckily it is a ‎convert_hf_to_gguf.py change only so I was able to manually provide the GGUF and use our existing llama.cpp version for imatix computation and quantization. I recommend you again upgrade to the latest version of the mradermacher branch, so this no longer requires manual intervention. We could also hold back Mistral3ForConditionalGeneration based models until the vision extraction for it is implemented but I would expect this to take days if not weeks for them to implement so waiting is likely not a feasible option.

updated - but please keep a list of the models you queued so far, so we can re-run these models. new "add"s should automatically log these ("Mistral3ForConditionalGeneration, logging.")

i tried some of the rwkv 7 models that showed up in my list today (e.g. RWKV7-Goose-Pile-168M-HF), but... any idea?

  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 5384, in <module>
    main()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 5378, in main
    model_instance.write()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 440, in write
    self.prepare_metadata(vocab_only=False)
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 433, in prepare_metadata
    self.set_vocab()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 3598, in set_vocab
    self._set_vocab_rwkv_world()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 915, in _set_vocab_rwkv_world
    assert (self.dir_model / "rwkv_vocab_v20230424.txt").is_file()
AssertionError

updated

Thanks a lot for the quick update! :D

please keep a list of the models you queued so far, so we can re-run these models. new "add"s should automatically log these ("Mistral3ForConditionalGeneration, logging.")

The only one I manually converted so far was https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503

i tried some of the rwkv 7 models that showed up in my list today (e.g. RWKV7-Goose-Pile-168M-HF), but... any idea?

All RWKV v7 based models are supposed to have a file named rwkv_vocab_v20230424.txt as can be seen under any RWKV v7 base model like https://huggingface.co/fla-hub/rwkv7-191M-world/raw/main/rwkv_vocab_v20230424.txt in the case of fla-hub/rwkv7-191M-world. Your RWKV7-Goose-Pile-168M-HF model misses this file. Likely because it got converted from the RWKV v7 into a HuggingFace transformers comparable model as can be seen from the model’s name. We could try just copying that file into the same folder as the model but not sure if this would work. By the way fun fact that file used to allow arbitrary code execution in an earlier luckily rejected convert_hf_to_gguf.py implementation by phrasing the file using eval(line[line.index(' '):line.rindex(' ')]). ARWKV-7B-Preview-0.1 using the RwkvHybridForCausalLM you queued worked fine.

By the way fun fact that file used to allow arbitrary code execution in an earlier luckily rejected

I was under the impression that convert...py always allows arbitrary code execution - for example, in the glm case, I regularly have to patch .py files inside the repo to make it work, which proves that the files get executed. One way is enough...

That is what prompted me to introduced safe-exec btw., because In was also under the impression that it would not execute files from the repo by default. We did have a little chat about that, too, I think..

All RWKV v7 based models are supposed

I guess we cna then just skip those, as they are likely (in theory atr leats) identical to the non-hf version. Problems will arise if these become more popular (as they are by "RWKV")

I was under the impression that convert...py always allows arbitrary code execution - for example, in the glm case, I regularly have to patch .py files inside the repo to make it work, which proves that the files get executed. One way is enough...

It does for some models that are using a custom loader but there it is quite obvious that the custom loader gets executed to load the model so someone that doesn't mass convert thousands of models would likely take a short look at it before converting to GGUF. Allowing arbitrary code execution to phrase massive text file on the other hand is definitely not something any user could ever expect. It is also like the dumbest way to implement a text file parser.

As long convert_hf_to_gguf.py supports loading any models that are not in safetensors you can easily make it execute arbitrary code anyways. Someone with malicious intent would likely choose to infect the actual model and not the python file that loads it as that one is easily renewable but actually doing so in a stealthy way would be a genius as it will the automated malware scanner only scans models as far I'm aware. I'm positively surprised malicious AI models are not a common issue. Is far I'm aware not a single AI model tried to infect our infrastructure so far.

That is what prompted me to introduced safe-exec btw., because In was also under the impression that it would not execute files from the repo by default. We did have a little chat about that, too, I think.

We did. Enabling that for sure was a great decision. It would be really annoying to having our infrastructure infected by some random malware. We are at like the highest risk possible of this happening to us as we process thousands of models from often untrustworthy sources shortly after their release and so before HuggingFace could take them down based on their malware scanners results. But no worries as long nobody burns a Linux kernel exploit or more likely a Nvidia driver exploit on me nothing will get out of my LXC container. I’m always closely monitoring the LXC container so I would probably almost immediately spot any malicious process running inside of it.

I guess we cna then just skip those, as they are likely (in theory atr leats) identical to the non-hf version. Problems will arise if these become more popular (as they are by "RWKV")

No need to do them but could indeed get an issue if users start finetuning them instead of the ones inside the original RWKV v7 format but don't worry if it gets an issue, we can for sure do something to convert them.

It does for some models that are using a custom loader

If it does it for some, it does it for all - the model type is parsed from the files as well.

it is quite obvious that the custom loader gets executed to load the model so someone that doesn't mass convert thousands of models would likely take a short look at it before converting to GGUF.

I think the oppposite is the case. You assume everybody using transformers (or llama.cpp) somehow is an expert. I would assume most people would blindly trust it.

As long convert_hf_to_gguf.py supports loading any models that are not in safetensors you can easily make it execute arbitrary code anyways.

How so? The only alternative would by pytorch, and I don't think that executes code anymore.

automated malware scanner only scans models

As far as I am aware, automated malware scanners don't really exist. They either check outdated signatures, or pretend to check for behaviour and completely fail. Point in case, the hf malware repo scanner... :)

Anyway, I think the deeper issue is that transformers code is written by people who don't understand basic security or even safety practise, so running everything in a jail is the way to go :)

We are at like the highest risk possible of this happening to us as we process thousands of models from often untrustworthy sources

We are also one of the biggest targets for attacks, especially if something can be done with the generated files.

I’m always closely monitoring the LXC container so I would probably almost immediately spot any malicious process running inside of it.

Pride goes before the fall.

[rwkv]

No need to do them but could indeed get an issue if users start finetuning them instead of the ones inside the original RWKV v7 format but don't worry if it gets an issue

There were also two fla-hub non"-hf" "-pile" models having the same issue.

How so? The only alternative would by pytorch, and I don't think that executes code anymore.

What makes you think that PyTorch would no longer be vulnerable to arbitrary code execution? As long you unpickle you allow arbitrary code to run. The entire legacy file formats are insecure by design. I don't see any way once could ever load them in a secure way. Even the poor machine that convert legacy models to safetensor and so loads them will inevitable execute whatever arbitrary code is in the non-safe model but at least the resulting SafeTensor will not contain any arbitrary code.

I found this nice article from December 2024 to show how to backdoor AI models - doing so is surprisingly simple and it really is kind of a miracle no bad acters seam to make use of it: https://snyk.io/articles/python-pickle-poisoning-and-backdooring-pth-files/

We are also one of the biggest targets for attacks, especially if something can be done with the generated files.

Well similar to SafeTensor GGUFs are secure unless they exploit a security vulnerability inside llama.cpp of which every few month another one gets responsible disclosed under https://github.com/ggml-org/llama.cpp/security

I'm mainly concerned about someone stealing our HuggingFace token to nuke our models or uses it to push some garbage. Hopefully HuggingFace has a repository delete rate limit. Maybe we also should rate limit the nukerepo/nukeall commands just in case. Having a malicious insider should also be a concern given how much access all our nodes currently have.

Pride goes before the fall.

True but I have rotating air gaped offline backups in worst case. Even if someone could somehow escape the LXC container they would be user 100000 on the host which can't do anything without a kernel exploit that gives them root. Given how often I update the kernel especially if there are news about any security vulnerability I seems quite unlikely someone who gains access to your container could escape it unless NVidia lets me down with their terrible driver security. Worst someone could do inside the LXC container is ruining/sabotaging our operations and bothering my ISP with malicious activity. I have quite tight firewall rules set and use a separate subnet for our cluster, our internet network and my home network and generally don't have insecure devices in my network so unlikely they could traverse to any other device from within the LXC container beside other LXC containers used for our operation which equal security constraints.

The wait queue never was as low since early November. That was almost half a year ago!

964 additional job(s) in wait queue (total estimated size 24.760TB, imatrix 7.501TB, lownice 0):

To celebrate I reconstructed the following historic wait queue size based on manual status page backups I made:

Nov 09 01:05: 1107
Nov 09 10:21: 1456
Nov 10 02:47: 1489
Nov 10 10:14: 1537
Nov 10 11:50: 1589
Nov 10 18:20: 1611
Nov 10 20:58: 1636
Nov 11 14:14: 1637
Nov 11 17:22: 1633
Nov 12 00:10: 1678
Nov 13 11:29: 1738
Nov 13 13:49: 1796
Nov 14 10:32: 2020
Nov 18 19:43: 2077
Nov 19 10:22: 2962
Nov 20 02:22: 3073
Nov 20 02:25: 3107
Nov 20 09:45: 3319
Nov 21 10:26: 3324
Nov 22 16:27: 3329
Nov 22 17:42: 3330
Nov 27 00:00: 3466
Nov 27 02:20: 3468
Nov 27 16:37: 3459
Nov 28 23:09: 3441
Nov 29 10:28: 3440
Dec 01 07:21: 3534
Dec 01 17:17: 3613
Dec 01 20:22: 3729
Dec 02 14:36: 3720
Dec 03 01:15: 3698
Dec 03 17:19: 3848
Dec 04 01:53: 3816
Dec 11 10:49: 3800
Dec 12 13:03: 3830
Dec 13 09:58: 3919
Dec 13 16:46: 3959
Dec 14 23:37: 3977
Dec 15 02:25: 4001
Dec 15 02:51: 4000
Dec 15 06:59: 4052
Dec 15 12:31: 4051
Dec 15 18:25: 4056
Dec 16 15:21: 3987
Dec 16 19:59: 3969
Dec 17 11:30: 3907
Dec 17 13:47: 3881
Dec 17 15:39: 3831
Dec 17 21:57: 3754
Dec 18 05:15: 3731
Dec 18 17:35: 3636
Dec 19 04:42: 3620
Dec 19 11:11: 3556
Dec 20 00:49: 3465
Dec 20 16:06: 3386
Dec 21 05:12: 3379
Dec 21 15:09: 3325
Dec 21 21:43: 3295
Dec 22 16:02: 3183
Dec 23 16:50: 2982
Dec 24 04:15: 2898
Dec 24 15:15: 2769
Dec 25 01:49: 2612
Dec 25 16:08: 2599
Dec 26 03:54: 2598
Jan 03 02:57: 1450
Jan 03 03:06: 1447
Jan 03 03:09: 1446
Jan 03 04:02: 1464
Jan 03 14:47: 1393
Jan 03 23:56: 1299
Jan 04 01:20: 1283
Jan 04 13:34: 1160
Jan 04 19:05: 1094
Jan 05 02:01: 1022
Jan 05 04:43: 973
Jan 20 01:54: 1205
Jan 21 11:33: 1192
Jan 24 12:49: 1097
Feb 02 00:52: 1061
Feb 04 09:48: 1074
Feb 16 02:43: 2145
Feb 18 02:18: 2377
Mar 12 20:32: 1799
Mar 13 04:15: 1779
Mar 13 19:24: 1512
Mar 13 19:58: 1501
Mar 13 20:14: 1496
Mar 13 21:55: 1492
Mar 14 09:24: 1472
Mar 14 17:16: 1380
Mar 14 17:52: 1367
Mar 15 00:22: 1370
Mar 15 14:28: 1242
Mar 15 17:28: 1206
Mar 15 19:54: 1204
Mar 18 17:52: 1132
Mar 18 20:06: 1125
Mar 19 09:16: 1068
Mar 19 14:32: 1007
Mar 19 14:53: 1000
Mar 19 15:11: 995
Mar 19 17:50: 967
Mar 19 19:21: 964

The maximum it ever reached according to my measurements was 4056 on 15th of December 2024! :D

What makes you think that PyTorch would no longer be vulnerable to arbitrary code execution? As long you unpickle you allow arbitrary code to run.

I thought I had read that transformers had switched to a restricted unpickle library, but... indeed, that seems not the case.

However, my point was a different one: I think it does not unpickle by default, so pickle isn't the problem for unsuspecting users (it is for us, since we enable it). The problem is that asking for untrusted code execution wrongly implies that it won't execute untrusted code when not told to do so. IT would be better to always execute untrusted code instead of giving a false sense of security.

Having a malicious insider should also be a concern given how much access all our nodes currently have.

Like, richard, me, you, marco and....? I mean, specifically the mradermacher nodes themselves, or the host, not other containers.

I am not very concerned about that, but maybe I don't value the repositories as highly, so vandalism is pretty low on my list of fears. But I wouldn't want to be part of a löarge scale attack on people downloading models :) Other than llama.cpp insecurities, which by necessity I don't care much about.

The wait queue never was as low since early November. That was almost half a year ago!

Time to queue more older models, I suspect. Once I find the time again.

To celebrate I reconstructed the following historic wait queue size based on manual status page backups I made:

I infrequently wished we had such historical backups :_)

Like, richard, me, you, marco and....? I mean, specifically the mradermacher nodes themselves, or the host, not other containers.

I got betrayed by too many friends when I hosted a Minecraft server back in high school so I take security against trusted insiders likely a bit too serious. The good thing here is that all of the current persons involved invested a ton of money and resources into this so nobody will do anything malicious for sure. This is mainly a concern should we ever plan on onboarding someone new as giving someone we barely know this level of access is a major risk.

I am not very concerned about that, but maybe I don't value the repositories as highly, so vandalism is pretty low on my list of fears.

We put so much effort, work and resources into them so I value them quite a lot.

But I wouldn't want to be part of a löarge scale attack on people downloading models :) Other than llama.cpp insecurities, which by necessity I don't care much about.

It is very unlikely someone could use GGUFs to distribute malware. The format is relatively secure.

I infrequently wished we had such historical backups :_)

I will upload mine soon. They are tiny. To make your own you could create a crontask that downloads the status page. I just Ctrl & S store the status page from times to times so I can better see how well things progress. I'm way too obsessed with the status page.

My list of fears:

  1. HuggingFace imposing a storage limit
  2. HuggingFace bugging us for using too much resources
  3. HuggingFace banning us for some really stupid reason like too many DMCA notices and not taking into account the number of models we have or some abuse report spam or other trash like this
  4. HuggingFace running out of money
  5. llama.cpp deciding to merge a change braking support for all quants ever created because their maintainers don't value support for legacy quants.
  6. GGUF getting replaced by a far superior format as it happened for GGML which then got replaced by GGUF
  7. Stupid regulation for USA messing with open AI models. Especially with the currently president behaving so unpredictable.
  8. EU being stupid as always and wanting to geoblock some "dangerous" AI models. I'm so glad Switzerland is not part of this organization.
  9. My ISP kicking me out due to using too much bandwidth.
    10 Someone or something sabotaging our operations
  10. HuggingFace Xet storage disrupting our operations. I think they already push 7% of traffic through Xet so we might already use it.
  11. HuggingFace doing stupid rate limits. Richard just got rate limited today:
    grafik.png

When we are at thing Richard doesn't like its him only having 2 tasks on rich1. He sent me this picture 2 house ago - luckily there are more now. Also he would like the ability to set the number of parallel tasks:

grafik.png

In any case let's just enjoy the moment. There never was a better time to enjoy all this amazing openly available AI models!

I'm enjoying locally running AI models so much I ordered 2x Intel Arc 770 last weekend. They perform better for LLMs than they should and you get 4x better performance/value than for NVidia according to specification and even better performance/value in the unlikely case those unrealistic benchmarks would be true: https://www.tweaktown.com/news/97705/metas-next-gen-llama3-llm-is-here-and-the-intel-arc-a770-outperforms-geforce-rtx-4060/index.html and https://www.plainconcepts.com/maximizing-ai-performance-intel-arc-a770-gpu/

97705_03_ai-llm-performance-on-intel-arc-a770-16gb-gpu-outperforms-geforce-rtx-4060-8gb-using-cuda_full.jpg

perfomance-comparison-chart.png

This is mainly a concern should we ever plan on onboarding someone new as giving someone we barely know this level of access is a major risk.

Agreed. Maybe we can move repo creation more centrally (dryrun was a major step towards that btw.) and maybe have finer-grained tokens for, say, only uploads. At some point.

The format is relatively secure.

The format is completely secure, I think. But I don't trust the gguf parsers one bit.

list of fears

yeah, these are all definitely on the realistic part of the scale. In fact, I am susprised this got as far as it got, and I expect total breakdown daily. Enshittification is a thing, and it already happens with hf as well, although at a surprisingly low level so far.

HuggingFace Xet storage disrupting our operations.

my immediate short term concern, yes :)

Richard just got rate limited today:

Holy shit, what did they call the 5MB/s bandwidth before? unlimited? :-)

When we are at thing Richard doesn't like its him only having 2 tasks on rich1.

Well, nice level 1400 is very far down the list, so the scheduler does reserve resources for higher priority things. Some tuning might always be required (the logic is pure mess, always changing :)

But the real problem with richard will be once we are through the low pri models, which will be soon. Richard and me will have to find a new mode of operations.

Also he would like the ability to set the number of parallel tasks:

As in less than two, or more than two? I suspect once we are through the models, it would make most sense to limit it to two, so he always has guaranteed resources available for himself, since we likely won't need him full time anymore (likewise nico2). rich1 is the fastest box we have that is always available (marco is hampered by disk speed mostly. he was thinking of buying an nvme for just this).

I'm enjoying locally running AI models so much I ordered 2x Intel Arc 770 last weekend.

Yeah, I wondered about intel arc, too, in the beginning, and then theyx cancelled their promised 64GB model and generally fucked up their drivers again, so I was dissuaded. But things seem to be improving, that is good. If anybody needs more competition, it's nvidia, and if anybody needs better, more competitive products, it's intel at the moment. We saw how long it took AMD to mirror the shitty nvidia prise hikes (i.e. instant), and I don't doubt intels death will immediatelly cause amd to become the new evil intel. Not to speaks of the shady practise of artificially reducing PCIe lines to sell more server hardware (which amd also copied from intel). Enshittification everywhere.

Although, I must admit, I was dreaming about intels death ever since I had a dual opteron.

In any case let's just enjoy the moment. There never was a better time to enjoy all this amazing openly available AI models!

Yes, very depressing. Oh, you meant this to be uplifting??

containing all relevant files for a GPTNeoXTokenizerFast tokenizer.

do you know a way of doing something about this? happens with https://huggingface.co/zaas12/pythia-1.4-orpo for example. if it is as simple as installing it, there might be a whole bunch of models with this or similar issues (missing python packages)

btw., so amusing: https://emygervais.github.io/2025/03/15/bytecraft.html (saw the model this morning)

sigh. the amusing bytecraft model caused 100% endless loop on rich, blocking everything.

CPU based prompt processing will soon be so fast of Q4_K on AVX2 compatible CPUs: https://github.com/ggml-org/llama.cpp/pull/12332

Well, nice level 1400 is very far down the list, so the scheduler does reserve resources for higher priority things. Some tuning might always be required (the logic is pure mess, always changing :)

It would be nice if there are always some model waiting around idle ready to take over. Especially for Richard who cares way too much about his server getting fully utilized all the time.

But the real problem with richard will be once we are through the low pri models, which will be soon. Richard and me will have to find a new mode of operations.

You will have to keep rich1 busy or he will start using it for his own quants of that list with trash models you gave him. By the way he is doing an awesome job abusing Google Collab and soon his home internet for Native/AWQ/MLX quants.

As in less than two, or more than two? I suspect once we are through the models, it would make most sense to limit it to two, so he always has guaranteed resources available for himself, since we likely won't need him full time anymore (likewise nico2). rich1 is the fastest box we have that is always available (marco is hampered by disk speed mostly. he was thinking of buying an nvme for just this).

What do you expect from Richard. Obviously he wants to run 3. Here a quote from him:

why rich1 so smol ?
I paid today, I want full load 💀
we need to find something to process on rich1
a third queue or something just to run to hit this 100% cpu
I pay for whole server, I use whole server
can I have a button to switch 2 models concurrently to 3/4 models concurrently?

Usually he would run his own quants on rich1 as well to make sure it is maxed out but HuggingFace repository create rate limited him today so he cannot really max it out.

Yeah, I wondered about intel arc, too, in the beginning, and then theyx cancelled their promised 64GB model and generally fucked up their drivers again, so I was dissuaded. But things seem to be improving, that is good. If anybody needs more competition, it's nvidia, and if anybody needs better, more competitive products, it's intel at the moment. We saw how long it took AMD to mirror the shitty nvidia prise hikes (i.e. instant), and I don't doubt intels death will immediatelly cause amd to become the new evil intel. Not to speaks of the shady practise of artificially reducing PCIe lines to sell more server hardware (which amd also copied from intel). Enshittification everywhere.

Luckily StormPeak has 128 PCIe 5.0 lanes and 16 PCIe 4 lanes. AMD is quite generous with latest gen Threadripper Pro as even the cheapest 12 core model for $1400 comes with the full 128 PCIe lanes offering an absolutely awesome price/pcie lane ratio.

All manufacturers latest gen GPUs is shit. Nvidia 50-series got backported to TSMC 5nm and is really inefficient and basically just a larger 40-series for an insane price. AMD costs way too much for just 16 GB of memory and ROCm is the biggest pain ever to use for AI and basically anything else. Intel Arc latest gen is decent but only 192 bit bus worse than their previous generation with only 12 GB of memory and so less bandwidth and much less AI cores but there is hope for an awesome 24 GB clamshell model later this year.

Intel Arc 770 is truly awesome. This is not the latest generation they released this year but what they released 2.5 years ago. They now offer 16 GB of GDDR6 at a 256-bit bus with 560 GB/s bandwidth and 512 AI cores for $280 while NVidia offers a 4060 Ti 16 GB with 128-bit bus using clamshell for $600. For the price of an RTX 5090 I could buy over 8x Intel ARC 770 which would combined be 128 GB GDDR6 at 4480 GB/s and 2048-bit bus totaling 2048 AI cores. Really the price/performance you currently get with last gen Intel Arc 770 is insane. It also worth considering that despite its age it is using TSMC 5 nm like the NVidie 40-series of GPUs. And now so many ears after the initial Intel arc launch, they finally get the software side of things perfect. PyTorch, llama.cpp, axolotl, vLLM all work without any issues on intel arc booth on Windows and Linux. I just hope it doesn't have the audio stuttering or PCIe device reset issues I'm currently experiencing on RTX 4090 or the random Linux kernel crashes I'm experiencing using Sparkle Intel ARC A310 4GB GPUs in my job. I will for sure let you know how it goes once they arrive. They will probably booth go into StormPeak so it then has 2x RTX 4090 + 2x Intel Arc 770 so we can keep the RTX 3080 in CastlePeak and RTX 2070s in Threadripper for the RPC setup.

Although, I must admit, I was dreaming about intels death ever since I had a dual opteron.

Regarding CPUs Intel is dead for me since they removed AVX-512. I'm just not buying any CPU without AVX-512. Doing so would be stupid. I want my fast memset. Jokes aside there are applications like llama.cpp on CPU and AV1 encoding where AVX-512 makes quite massive difference. But I'm generally not that happy with AMD64. I really wish we could soon move on and use RISC-V based CPUs for PCs. I'm already using RISC-V based CPUs for use-cases where security matters. I also really miss transactional memory which Intel promised many times and then messed up and now just abandoned it. With the latest security vulnerability AMD CPUs got a whole lot more interesting anyways. You can now jailbreak them and write your own microcode: https://bughunters.google.com/blog/5424842357473280/zen-and-the-art-of-microcode-hacking

Yes, very depressing. Oh, you meant this to be uplifting??

It was mean uplifting but I guess it can also be seen quite depressing depending if you value the now or the future. Things will likely never be as good as they are now at least for me as I basically reached the peak of joy and happiness, I have an awesome job I truly enjoy and to which I’m looking forward to every day and during my spare time I can have fun with all this awesome openly available AI models. There just is no way thing stay anywhere close to as good they currently are. I recommend to just enjoy the moment as long it lasts. I made sure to back up all the base models and models I like the most just in case.

do you know a way of doing something about this?

The entire error is:
Can't load tokenizer for '/bpool/pythia-1.4-orpo'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/bpool/pythia-1.4-orpo' is the correct path to a directory containing all relevant files for a GPTNeoXTokenizerFast tokenizer.

https://huggingface.co/zaas12/pythia-1.4-orpo/tree/main does not contain a tokenizer.json or tokenizer.model so the model simply has no tokenizer. To fix this error it just copy the GPTNeoXTokenizerFast tokenizer from a different model into the folder containing the downloaded model.

For this specific model you know based on the "_name_or_path" inside the config.json that it was trained based of "EleutherAI/pythia-1.4b"
So you could download:

After which the model successfully converts into a GGUF. I stored it the resulting GGUF under /tmp/quant/pythia-1.4-orpo.gguf.

missing python packages

I noticed a ton of models with missing python packages and I was wondering why we keep nuking them instead of installing the proper dependencies. It seams quite stupid we don't support models where the HF to GGUF conversion depends on a specific python packet. I guess now that we maintain our own llama.cpp fork we could add them all to the requirements.txt

btw., so amusing: https://emygervais.github.io/2025/03/15/bytecraft.html (saw the model this morning)

Who would have guessed: "Working in the byte world is extremely challenging because a single wrong byte can break the whole functioning of the file."
This idea is so insane. You basically teach an LLM to write bytes instead of ASM. It might work but damn is it an insane idea. They should have at least used a tokenizer that made somewhat sense for this use case or even better just train a model from scratch because this is so far different from the common use-case of an LLM that starting fresh would likely be justified. What's next? An AI that creates a ROP chain so I can run a game inside a banking application by abusing a buffer overflow?

Not to speaks of the shady practise of artificially reducing PCIe lines to sell more server hardware (which amd also copied from intel). Enshittification everywhere.

Intel has done plenty of shady shit, but this situation is far more nuanced, as PCIe lanes take up valuable die space, and PCIe switches becoming too expensive for consumer usage, etc.

"PLX was acquired by Avago in 2014 in a deal that valued the company at $300m, and seemingly overnight the cost of these switches increased three-fold according to my sources at the time, making them unpalatable for consumer use." source: https://www.anandtech.com/show/15821/microchips-new-pcie-40-pcie-switches-100-lanes-174-gbps
second source that mentions this increase: https://www.servethehome.com/business-side-plx-acquisition-impediment-nvme-everywhere/

My go to example for intel screwing consumers over is ECC memory, there was no technical limitation on that at all, and consumer platforms having less stability because of it is a legacy we still deal with to this day.

Intel has done plenty of shady shit, but this situation is far more nuanced, as PCIe lanes take up valuable die space

That is why AMD puts a dedicated IO die in their CPUs. But even on a monolithic design PCIe lanes and memory lanes are always worth the die space they use. More PCIe lanes means more GPUs and more SSDs and more memory lanes means faster CPU inference performance. I'm so happy StormPeak has octa-channel memory.

PCIe switches becoming too expensive for consumer usage

I never really saw the appeal of PCIe switches. What is the advantage of using a PCIe switch compared to just using PCIe bifurcation to split PCIe bandwidth between multiple devices. When I want to plug 4 GPUs into one PCIe x16 slot I'm just using a x4x4x4x4 bifurcation card for $30 and I have something relabel that equally distributes the bandwidth between all the GPUs. But cheap PCIe redrivers would be super useful. My mainboard luckily has some integrated but having them after the PCIe riser cable would likely make way more sense.

My go to example for intel screwing consumers over is ECC memory, there was no technical limitation on that at all, and consumer platforms having less stability because of it is a legacy we still deal with to this day.

EEC memory is huge. Having ECC memory is an absolute must. I need to be able to trust my PC. Without ECC memory I would have to basically do all computations twice and compare results which would be insanely wasteful. This is by the way exactly what I did before I had ECC memory. All my PCs since 2017 have ECC memory. Threadripper has 128 GB ECC UDIMM DDR4 memory, CastlePeak has 256 GB ECC UDIMM DDR4 memory and StormPeak has 512 GB ECC RDIMM DDR5 memory. For anyone telling me that ECC errors are unlikely: No they are not. My ECC RAM puts a kernel log entry every time one happens and they indeed do happen and ECC always manages to correct them. Same thing as bit rot on TLC SSD is something that hapopens.

That is why AMD puts a dedicated IO die in their CPUs.

Yes, but that is a more recent thing compared to losing out on cheap PCIe switches.

But even on a monolithic design PCIe lanes and memory lanes are always worth the die space they use.

Again it really isn't that simple, it has to beachfront silicon, as it is I/O and if you want more of that you have to use a bigger chip. It's not like you can just shrink the cores or use less cores as that won't give you anymore beachfront silicon.

I never really saw the appeal of PCIe switches.

They are incredibly useful, and do far more than what bifurcation can. Even with the dedicated I/O die on zen CPUs the chipsets still are PCIe switches (with extra functionality), look up the block diagrams of the X570 and B650 chipsets and you'll see they are PCIe switches (although again they do offer a bit more than a standard PCIe switch, but the switching functionality is still core and important).

I agree with you on ECC, even if I haven't been fortunate enough to use exclusively ECC computers, my desktop is still not ECC but my NAS and my servers are.

You will have to keep rich1 busy or he will start using it for his own quants of that list with trash models you gave him.

I don't know what that means. Will he take my ability away to use it? That sucks. Will he use the idle time for other purposes? Sounds great to me - why keep it artificially busy. In any case, if for some reason things get too bad (there is no indication of that at the moment :) I'd rather not have rich1 then.

I can add a few more models to the queue, though.

What do you expect from Richard. Obviously he wants to run 3. Here a quote from him:

We've been there before. There is no way to help technically illiterate people. We can run three models once his server has the disk and memory to do so. Right now, it clearly doesn't have the memory, nor the disk, nor the network, for more.

My ECC RAM puts a kernel log entry every time one happens and they indeed do happen and ECC always manages to correct them.

Sorry, but ECC errors and bit errors are exceedingly rare. I've had dozens of busy servers over the decades, and the only case where I had ECC errors was a CPU errata.

So, yeah, they do happen, but many other faults are more likely. It is certainly nice to have this fuzzy feeling of extra security, though, but the performance drain is objectively not worth it for most applications.

Same thing as bit rot on TLC SSD is something that hapopens.

I thought so, too, but all my data happens to be checksummed, and even for ssds that have been stored for years, I've never had a bit error (admittedly, I only have crucial ssds that are that old). But I might simply have been lucky.

When I want to plug 4 GPUs into one PCIe x16 slot I'm just using a x4x4x4x4 bifurcation

and enjoy 25% speed of prompt processing and many other tasks?

Intel has done plenty of shady shit, but this situation is far more nuanced

I disagree with the "far". Intel has reduced the number of pcie lines in desktop cpus over the years. So, yeah, some nuance, but intel did this for segmentation reasons. change my mind, but that is what intel does for many many years now.

Actually there must be a bug with the queue on rich1.

Actually, there isn't. There simply isn't enough budget to add more big jobs. The only way out is to reduce the size of the jobs it can accept, greatly reducing it's usefulness.

Anyway, the queue will become empty at some point, and other than idiotically wasting cpu cycles, there is no way we can avoid becoming idle at some point.

I've reduced max model size for rich to 100B.

Sorry, but ECC errors and bit errors are exceedingly rare. I've had dozens of busy servers over the decades, and the only case where I had ECC errors was a CPU errata.

So, yeah, they do happen, but many other faults are more likely. It is certainly nice to have this fuzzy feeling of extra security, though, but the performance drain is objectively not worth it for most applications.

I don't have great information on the prevalence of errors so I'm not going to argue one way or the other, but ECC offers far more than just a "fuzzy feeling of extra security". Error detection and monitoring is a huge benefit, such as helping you find and deal with faulty hardware before it causes actual problems, and telling you if that is or is not the cause of the issue you are experiencing.

I disagree with the "far". Intel has reduced the number of pcie lines in desktop cpus over the years. So, yeah, some nuance, but intel did this for segmentation reasons. change my mind, but that is what intel does for many many years now.

Huh? I agree with you that it is Intel intentionally doing this for market segmentation, but my point was that for consumers motherboard PCIe lanes went down far more because of the lack of cheap PCIe switches, as most lanes before were given by PCIe switches, and that is still a thing with modern chipsets stepping in still to offer more lanes than the CPU provides. Memory channels and PCIe lanes are far more costly than cores and market segmentation based on that isn't entirely unreasonable. The scummy shit is them doing stuff like this "When Intel introduced Haswell-E, it experimented with a new type of product separation: it also varied the number of CPU-host PCIe lanes among the SKUs. This practice continues in Broadwell-E, in an almost identical fashion. The lowest end CPU has 28 PCIe 3.0 lanes, capable of three-way GPU setups (and no 2x16 setups), while the other processors have a full 40 PCIe 3.0 lanes" source https://www.anandtech.com/show/10337/the-intel-broadwell-e-review-core-i7-6950x-6900k-6850k-and-6800k-tested-up-to-10-cores

If you wanted more memory channels or PCIe lanes you went onto the HEDT platforms, they also didn't really have do HEDT for a while, but again that is far after when we are talking about with PCIe lanes going down, and a whole different story.

I'm not trying to change your mind as I'm not really even sure where we disagree. I just think ECC is a far simpler and clearer example of Intel being scummy and locking consumers out of things, and even the mainstream platform being kept at quad cores well for far longer than it should have been, and the way they segmented hyper threading both have less nuance than the PCIe/memory class segmentation between HEDT and consumer.

such as helping you find and deal with faulty hardware before it causes actual problem

Helping is a relative term. There are many places where you can get bit errors, such as inside your CPU. I've had more cases of faulty cpus and cpu memory controllers than I ever had memory errors.

Point being, ECC is a relatively minor thing. When I hear that without ECC memory, one does all the calculations twice, this is just cargo cult.

Also, it's really fun to pull nicos legs sometimes.

I'm not trying to change your mind as I'm not really even sure where we disagree.

I don't think we are in any significant disagreement :)

Again it really isn't that simple, it has to beachfront silicon, as it is I/O and if you want more of that you have to use a bigger chip. It's not like you can just shrink the cores or use less cores as that won't give you anymore beachfront silicon.

Ah that explains why the I/O die on StormPeak is so physically massive compared to the dies that contain the actual cores. I always wrongly assumed this is the case because they use cheaper older wafers for I/O dies.

They are incredibly useful, and do far more than what bifurcation can. Even with the dedicated I/O die on zen CPUs the chipsets still are PCIe switches (with extra functionality), look up the block diagrams of the X570 and B650 chipsets and you'll see they are PCIe switches (although again they do offer a bit more than a standard PCIe switch, but the switching functionality is still core and important).

You are right. On my WRX90E-SAGE SE mainboard the chipset also acts as PCIe switch. It servs the two SlimSAS ports each running at PCIe 4.0 x4 and servers the 4 SATA ports.

amd-wrx90.png

I agree with you on ECC, even if I haven't been fortunate enough to use exclusively ECC computers, my desktop is still not ECC but my NAS and my servers are.

ECC is awesome! I love every bit of it.

I don't know what that means. Will he take my ability away to use it? That sucks. Will he use the idle time for other purposes? Sounds great to me - why keep it artificially busy. In any case, if for some reason things get too bad (there is no indication of that at the moment :) I'd rather not have rich1 then.

Sorry for being unclear. No he will obviously not take it away rich1 when we don’t use it but will make use of any idle resources to do models for his own account. He even does this now during the short downtime we sometimes have due to repo creation rate limit.

I can add a few more models to the queue, though.

I personally would be happy to see the queue empty up and be like it was before November than keep thing as crazy as they currently are. But if you find great model, please queue them but we don’t need to queue garbage just to satisfy Richard. He can do his own quants if he is unsatisfied with rich1 utilization which actually already does every time we don’t max out his CPU.

We've been there before. There is no way to help technically illiterate people. We can run three models once his server has the disk and memory to do so. Right now, it clearly doesn't have the memory, nor the disk, nor the network, for more.

Yes exactly. I fully agree. He figured out the hard way today after insisting on me telling him how to increase the default OpenWrt 65536 active connection limit. He increased it to 655390 just to figure out that the limit existed for a reason and above 80K concurrent connections things started to get unstable and he lost connection to the entire server. Sometimes he just has to see things for himself to learn. He is still young and so has to get his experience from somewhere. It’s quite funny how he keeps complaining why everything keeps breaking for him without ever wondering if he might be the reason why. There is a 150-Watt peak power limit for the GPU in his laptop. He thought it is a great idea to remove that and run it 24/7 at 200-Watt normal power. Let's see how long that lasts. He just does all the stupid things a teenagers would do but using computer hardware.

Sorry, but ECC errors and bit errors are exceedingly rare. I've had dozens of busy servers over the decades, and the only case where I had ECC errors was a CPU errata.

Let’s check logs on Treadripper and see who is right. I investigated a ton of ECC errors in late 2024 so I for sure wouldn’t consider them “exceedingly rare”. The are surprisingly common for me.

So, yeah, they do happen, but many other faults are more likely. It is certainly nice to have this fuzzy feeling of extra security, though, but the performance drain is objectively not worth it for most applications.

It really depends on the type of computations you run. For me correctness is way more important than anything else. Especially back when I did scientific computing. I guess now that I mostly do AI. This is also why I have not enabled ECC on the GPUs you use for imatrix computation as doing so would lead to an over 10% performance decrease for very little benefit for our use case. ECC doesn’t matter for my current use cases as much as it did in the past but it still is an important nice to have and worth the hardware and performance cost for sure.

I thought so, too, but all my data happens to be checksummed, and even for ssds that have been stored for years, I've never had a bit error (admittedly, I only have crucial ssds that are that old). But I might simply have been lucky.

It seems to really depend on the SSD controller, storage chip and ECC algorithm. I think so far it was only late PCIe 3 and early PCIe 4 SSD from Samsung and Kingston from which I experienced bit rot issues. The SSDs you currently use for nico1 are notorious for uncorrectable bit rot so if you ever store a large file and don't read it for half a year the host wouldn’t run monthly scrubs it would have a quite high likelihood of being corrupted after half a year. This is one of the main reasons I gave you those specific SSDs. Their bit rot was a massive pain for me and I wasted dozens of hours on it. Corrupted rotten blocks kept breaking my backups as every time one got encountered the backup got canceled resulting in me having to spent hours searching whatever file contains the faulty block, restoring it from backup and trimming all empty space and hope the issue is gone. I had Windows Server on those SSDs so no fancy file system like ZFS or BTRFS to tell me which files are broken so I just had to dd the entire thing, see where it fails and then figure out what file is on this position.

and enjoy 25% speed of prompt processing and many other tasks?

I really should reduce the 4090 GPUs to x4 and see if there is a performance difference for imatrix computation. You are the opinion that all that matters for imatrix performance is PCIe bandwidth but I'm almost certain imatrix is RAM bottlenecked as this is what is getting hot while it is running. Last weekend I even installed a piece of plastic inside StormPeak to direct airflow towards RAM as before every time we did imatrix computation everyone in the entire house heard the fans and joked that I have a hay drier in the basement and it was actually so loud that sitting next to it for a long period of time made my ears hurt. Since I did that modification imatrix computation is almost quiet.

I disagree with the "far". Intel has reduced the number of pcie lines in desktop cpus over the years. So, yeah, some nuance, but intel did this for segmentation reasons. change my mind, but that is what intel does for many many years now.

And this is why I don't buy Intel or normal AMD CPUs. I mainly care about PCIe lanes and memory lanes when buying a CPU as this is what is end up bottlenecking me. I really hope AMD keeps Threadripper around because EPYC server mainboards suck.

Actually there must be a bug with the queue on rich1.
Actually, there isn't. There simply isn't enough budget to add more big jobs. The only way out is to reduce the size of the jobs it can accept, greatly reducing it's usefulness.

Ah yes that explains why only 2 got put there. Currently models somehow got massive because we reached the big model part of the priority 1400 models. I still don't get why we sort models by priority. Doing so seems a bit dumb because then once we are done with them we have to rush manually adding more big ones to not get HuggingFace repo creation limited.

Anyway, the queue will become empty at some point, and other than idiotically wasting cpu cycles, there is no way we can avoid becoming idle at some point.

Which is a good thing as then Richard can use it for his own purposes while we are not using it.

I've reduced max model size for rich to 100B.

Great. That should ensure that always some idle models are on rich1.

I don't have great information on the prevalence of errors so I'm not going to argue one way or the other

That's why I will now actually check my kernel logs on Threadripper because I should have the data there as for some reason journalctl keep all the kernel logs since 31th May there.

ECC offers far more than just a "fuzzy feeling of extra security". Error detection and monitoring is a huge benefit, such as helping you find and deal with faulty hardware before it causes actual problems, and telling you if that is or is not the cause of the issue you are experiencing.

I absolutely awesome. It already helped me detecting many issues. Manly while building a new PC.

Memory channels and PCIe lanes are far more costly than cores and market segmentation based on that isn't entirely unreasonable.

I never realized that they would be so expensive given that even the cheapest latest gen Threadriper Pro has 128 PCIe 5.0 lanes and 8 memory lanes despite only having 12 cores.

Helping is a relative term. There are many places where you can get bit errors, such as inside your CPU. I've had more cases of faulty cpus and cpu memory controllers than I ever had memory errors.

ECC doesn't just correct memory errors that happen due to faulty bits but also errors that happen during the transfer from memory to CPU. Unless the CPU really has some bug, data integrity should be guaranteed. DDR5 has some in-memory ECC but that is not checking the transfer so even for DDR5 I’m using proper ECC memory. But honestly if you don’t use your PC for anything important DDR5 without ECC will likely be fine as the internal ECC in all DDR5 memory is quite decent at preventing random memory errors due to things like cosmic rays.

Point being, ECC is a relatively minor thing. When I hear that without ECC memory, one does all the calculations twice, this is just cargo cult.

I actually did so back in university for all scientific calculations because I couldn't risk them to be wrong.

I don't think we are in any significant disagreement :)

We are not.

Also, it's really fun to pull nicos legs sometimes.

Or more like make me spent 2 hours reading, research and replaying to the massive wall of text you all wrote today. Joking aside it actually was a very interesting discussing and this is the first time I closely looked at the ECC error log as a hole instead of investigating specific ECC events.

This is how a typical ECC event looks like:

Aug 30 19:47:21 Threadripper kernel: mce: [Hardware Error]: Machine check events logged
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: Corrected error, no action required.
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: Error Addr: 0x000000019a9b9040
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x0000fd010a400302
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
Aug 30 19:47:21 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Aug 30 19:47:21 Threadripper kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Dec 24 02:54:42 Threadripper kernel: mce: [Hardware Error]: Machine check events logged
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: Corrected error, no action required.
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: Error Addr: 0x000000019a9b9040
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x0000fd010a400302
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
Dec 24 02:54:42 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 24 02:54:42 Threadripper kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

And here all ECC events that happened on Threadripper with 128 GB DDR4 ECC memory since 31th of Mai

Aug 30 19:47:21 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01
Nov 28 14:45:22 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Nov 28 21:24:02 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Nov 29 05:30:06 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Nov 30 07:42:58 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 01 04:28:09 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 03 02:09:44 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 04 02:55:13 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 04 03:00:41 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 04 14:56:07 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 05 14:30:36 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 02:04:11 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 03:53:25 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 03:58:52 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 04:04:20 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 04:09:48 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 06 08:04:38 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 10:33:53 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 12:01:16 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 12:06:44 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 12:12:11 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 07 20:12:47 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 17:56:29 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 17:58:03 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:03:30 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:08:58 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:14:26 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:19:53 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:25:21 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:30:49 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 18:36:16 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 08 20:41:53 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:21:00 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:26:28 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:31:56 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:37:24 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:42:51 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:48:19 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:53:47 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 09:59:14 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 10:04:42 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 10:10:10 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 10:15:37 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 10:21:05 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 12:37:37 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:21:18 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:26:46 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:32:14 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:37:41 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:43:09 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:48:37 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:54:04 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 13:59:32 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:05:00 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:10:28 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:15:55 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:21:23 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:26:51 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 14:32:18 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 09 18:27:08 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 10 21:40:05 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 11 03:02:18 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 11 18:08:53 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)
Dec 24 02:54:42 Threadripper kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x6ea6e4 offset:0x40 grain:64 syndrome:0xfd01)

This is a total of 64 corrected ECC errors in less than 10 months. I wouldn't consider this rare. It's also quite surprising that the issue seams to always happen at the same address so maybe there is some sort of hardware defect that makes this specific address way more likely to have issues.

Your results suggest that there is something seriously wrong with your hardware. I wouldn't trust it if it generates this massive amount of bit errors. insert funny theory about the swiss alps being nearer to space

The only times I ever got ECC errors in the last 20 years (I don't think IO had ecc detection hardware before the 2000s) was a hardware errata in an intel cpu (to be ignored), and actually faulty ram sticks. I am seriously distrusting your hardware now. I mean, the ram -> cpu path is not the only thing that can go wrong, and the failures you have are massive. Once every 5 years would be more acceptable, IMnsHO.

Since it seems to be always the same address (if I read that correctly, which I probably don't), this also indicates that your ram is indeed faulty. So, yeah, ECC found it, but so would a burn in with a memory checker.

Ah that explains why the I/O die on StormPeak is so physically massive compared to the dies that contain the actual cores. I always wrongly assumed this is the case because they use cheaper older wafers for I/O dies.

I hate to sound repetitive but there is more nuance again. They do use older nodes for the I/O die, and that does result in them being larger, but also not by that much because I/O is one of those things that does not scale well with process nodes, and that adds to the problem we were talking about before of it taking up valuable die area, as process node shrinks help reduce the die area of cores far more than I/O.

You are right. On my WRX90E-SAGE SE mainboard the chipset also acts as PCIe switch. It servs the two SlimSAS ports each running at PCIe 4.0 x4 and servers the 4 SATA ports.

Yep, that is typical. But also it's not just ubiquity like I said before PCIe switches are capable of things that bifurcation can't.

ECC is awesome! I love every bit of it.

I do like it, memory instability sucks as someone who has dealt with it in recent times (solved by RMA'ing the RAM).

Yes exactly. I fully agree. He figured out the hard way today after insisting on me telling him how to increase the default OpenWrt 65536 active connection limit. He increased it to 655390 just to figure out that the limit existed for a reason and above 80K concurrent connections things started to get unstable and he lost connection to the entire server. Sometimes he just has to see things for himself to learn. He is still young and so has to get his experience from somewhere. It’s quite funny how he keeps complaining why everything keeps breaking for him without ever wondering if he might be the reason why. There is a 150-Watt peak power limit for the GPU in his laptop. He thought it is a great idea to remove that and run it 24/7 at 200-Watt normal power. Let's see how long that lasts. He just does all the stupid things a teenagers would do but using computer hardware.

His experimenting sounds fun (might have something to do with the fact that I'm not at all impacted by it). You can learn by doing, but not all mistakes you find out what you did wrong. I still don't know why I couldn't get jumbo frames working with a point to point link (so very few things involved and all of them should support it), a few years ago.

It seems to really depend on the SSD controller, storage chip and ECC algorithm. I think so far it was only late PCIe 3 and early PCIe 4 SSD from Samsung and Kingston from which I experienced bit rot issues. The SSDs you currently use for nico1 are notorious for uncorrectable bit rot so if you ever store a large file and don't read it for half a year the host wouldn’t run monthly scrubs it would have a quite high likelihood of being corrupted after half a year. This is one of the main reasons I gave you those specific SSDs. Their bit rot was a massive pain for me and I wasted dozens of hours on it. Corrupted rotten blocks kept breaking my backups as every time one got encountered the backup got canceled resulting in me having to spent hours searching whatever file contains the faulty block, restoring it from backup and trimming all empty space and hope the issue is gone. I had Windows Server on those SSDs so no fancy file system like ZFS or BTRFS to tell me which files are broken so I just had to dd the entire thing, see where it fails and then figure out what file is on this position.

Thank you for this story (and I would love to know more if you don't mind). I'm very picky about buying SSDs (when I can afford to be), quality like you saw varies but what bothers me a lot is it is easier to count the number of companies that don't do the scummy thing of changing internal components without changing the sku, as it is so common place, which means actually evaluating quality MUCH harder.

I don't have great information on the prevalence of errors so I'm not going to argue one way or the other

That's why I will now actually check my kernel logs on Threadripper because I should have the data there as for some reason journalctl keep all the kernel logs since 31th May there.

[...]

Sorry, but ECC errors and bit errors are exceedingly rare. I've had dozens of busy servers over the decades, and the only case where I had ECC errors was a CPU errata.

Let’s check logs on Treadripper and see who is right. I investigated a ton of ECC errors in late 2024 so I for sure wouldn’t consider them “exceedingly rare”. The are surprisingly common for me.

I've seen this conversation happen so many times which is why I bowed out early, but like always it will be fun for me to hear it happen again.

I absolutely awesome.

??? Lol.

It already helped me detecting many issues. Manly while building a new PC.

If you build a new PC you should thoroughly do testing which includes memory testing, that would find those issues regardless of ECC (also on that note the state of memory checkers that handle ECC intelligently is literally one from what I found, and it is sadly paywalled for the useful version, as I recently found out when dealing with a server and testing it).

I never realized that they would be so expensive given that even the cheapest latest gen Threadriper Pro has 128 PCIe 5.0 lanes and 8 memory lanes despite only having 12 cores.

The MSRP of the AMD Threadripper PRO 7945WX is $1399, that is well outside of consumer CPU pricing, and requires an also much more expensive than consumer platform motherboard and RAM (especially if you want to make use of the octal channel). I'm not making a value judgement here, but it is objectively in a different price segment than consumer stuff, as most consumers wouldn't even spend half of that CPU price for the entire system.

ECC doesn't just correct memory errors that happen due to faulty bits but also errors that happen during the transfer from memory to CPU. Unless the CPU really has some bug, data integrity should be guaranteed. DDR5 has some in-memory ECC but that is not checking the transfer so even for DDR5 I’m using proper ECC memory. But honestly if you don’t use your PC for anything important DDR5 without ECC will likely be fine as the internal ECC in all DDR5 memory is quite decent at preventing random memory errors due to things like cosmic rays.

You are correct about the difference in ECC but your last sentence is very odd to me. If I'm not using a PC for anything important anything is fine, but even still I would trust a DDR4 system over a DDR5 one as the in memory ECC is due to the extremely high data transfer rates inherent to the standard, and memory controllers are generally less mature, and DDR5 is still more inherently challenging to run.

Even the PCIe standard had to add error correction (but they also switched to PAM4 while DDR5 is using NRZ like previous PCIe revisions):

"because of the additional signal states a PAM4 signal itself is more fragile than a NRZ signal. And this means that along with PAM4, for the first time in PCIe’s history the standard is also getting Forward Error Correction (FEC). Living up to its name, Forward Error Correction is a means of correcting signal errors in a link by supplying a constant stream of error correction data, and it’s already commonly used in situations where data integrity is critical and there’s no time for a retransmission (such as DisplayPort 1.4 w/DSC). While FEC hasn’t been necessary for PCIe until now, PAM4’s fragility is going to change that. The inclusion of FEC shouldn’t make a noticeable difference to end-users, but for the PCI-SIG it’s another design requirement to contend with. In particular, the group needs to make sure that their FEC implementation is low-latency while still being appropriately robust, as PCIe users won’t want a significant increase in PCIe’s latency.

The upshot of the switch to PAM4 then is that by increasing the amount of data transmitted without increasing the frequency, the signal loss requirements won’t go up. PCIe 6.0 will have the same 36dB loss as PCIe 5.0, meaning that while trace lengths aren’t officially defined by the standard, a PCIe 6.0 link should be able to reach just as far as a PCIe 5.0 link. Which, coming from PCIe 5.0, is no doubt a relief to vendors and engineers alike." Source: https://www.anandtech.com/show/14559/pci-express-bandwidth-to-be-doubled-again-pcie-60-announced-spec-to-land-in-2021

Joking aside it actually was a very interesting discussing

Same for me.

I've avoided samsung for other reasons (ignoring fua), but hearing that is a bit shocking. I am keeping checksums of most of my files for decades now, so even before filesystems had data checksums (well, just btrfs out of the linux ones, I think), I knew bitrot was a thing. I haven't caught a SSD doing that, but I have caught ext3 and xfs bugs that way in the past, and of course lots of hardware issues.

In any case, I can hardly believe that samsungs would actually bitrot just after a few months, when the disk is actually on (even if off its hard to believe). Surely, this would be well known if that was really the case in general, rather than some faulty specimens? I mean, I believe you, but, sheesh, it can't be, can it?

In any case, I hope you run a monthly scrub on my disks then? :)

Sorry to be a bother but will you be doing this model https://huggingface.co/deepseek-ai/DeepSeek-V3-0324 (the new deepseek checkpoint), and if so when do you think you'd have the imatrix done?

Edit: Sorry, I see it's already in progress here

Sorry to be a bother but will you be doing this model https://huggingface.co/deepseek-ai/DeepSeek-V3-0324 (the new deepseek checkpoint)

Yes for sure. You know very well that I love huge models. We even did FatLlama 1.7T in the past. I recommend you follow https://huggingface.co/mradermacher/model_requests/discussions/797 to get news about our progress on DeepSeek-V3-0324.

if so when do you think you'd have the imatrix done?

I want to do imatrix using Q8 as I value quality and I won't consider everything below Q8 high quality. Unfortunately doing it in Q8 requires the RPC setup due to requiering significantly more than 512 GB of RAM. The good thing is that the RPC setup is ready to use but the bad thing is that using it will mean a total outage of nico1 and nico2 workers for almost a day with nico1 the only worker able to do DeepSeek-V3-0324 static quants. I assume we will do imatrix quants as soon static quants are done. By then we will hopefully also know if StormPeak is still stable with the two new Intel Arc 770 GPUs, I installed during today evenings maintenance window. I built two servers with Intel Arc GPUs at work and they booth crash almost once every day due to Intel drivers so I’m quite skeptical. A crash during RPC imatrix computation would be a disaster as it could mean 20 hours of lost work. When thinking about it we might be able to use the Intel Arc 770 for the RPC setup but not sure if you can mix NVidia with Vulkan or SYCL backend but if we can it would mean 30 GB more RAM which would allow nico1 to continue working during RPC imatrix computation.

Edit: Sorry, I see it's already in queue.

Yes it indeed is as you can see under https://hf.tst.eu/status.html

-7777  689 si DeepSeek-V3-0324                             run/static 3/12,Q2_K [133/1025] (hfu Q4_K_S)

Q4_K_S is done and already uploading while Q2_K is currently beeing computed: https://huggingface.co/mradermacher/DeepSeek-V3-0324-GGUF
Download Page: https://hf.tst.eu/model#DeepSeek-V3-0324-GGUF

I won't consider everything below Q8 high quality.

Sure. But without data, we won't know if Q8_0 vs. Q2_K even makes a measurable differtence. Maybe it does, but imatrices are such crude tools, I would be surprised if we wouldn't find that Q4_K_M or some other quant gives practically indistinguishable results from f16.

I'm moving from Kernel 6.8.12-9-pve to 6.8.12 to 6.11.11-2-pve. The i915 driver sucks for Intel Arc GPUs. I'm switching to Xe which on Kernel 6.8 is too immature.

Sure. But without data, we won't know if Q8_0 vs. Q2_K even makes a measurable differtence. Maybe it does, but imatrices are such crude tools, I would be surprised if we wouldn't find that Q4_K_M or some other quant gives practically indistinguishable results from f16.

Someday I will measure it. Maybe if the queue ever runs dry. Until then we should at least use Q8 when possible, as for Q8 we know for quite certain that the difference will be so small it will be impossible to ever tell as. I even did my 405B quant quality measurements on quants made using Q8 imatrix and I have not seen anything that would indicate it performing any worse.

@nicoboss btw., nico2 is enabled every morning via root's crontab on nico1 - if it pings, it will be enabled, so if we want it down (e.g. when doing rpc imatrix computations), we should comment that out (or you can just keep the nico2 vm down for example, as the enabling is done via that)

@mradermacher How to restart an imatrix task I killed? I forgot to initialize the NVidia GPUs before starting your container, so I had to kill them. llmc audit doesn't work for imatrix tasks. Sorry for this. After every host reboot, I need to remember to execute nvidia-smi on the host before starting your LXC container and if I'm really busy with other things I sometimes forget.

@nicoboss btw., nico2 is enabled every morning via root's crontab on nico1 - if it pings, it will be enabled,

Ah that’s why it didn't start today. I just happen to reboot the StormPeak at that time to fix some issues with the new Intel Arc GPUs. I restarted StormPeak way too many times today before asking some Vulkan developer why Vulkan on my Intel Arc GPUs isn't working but everything else like SYCL is. In case anyone wonders unlike NVidia Intel userland drivers don't come with Vulkan so mesa must be installed.

we should comment that out (or you can just keep the nico2 vm down for example, as the enabling is done via that)

We can just turn off the nico2 LXC container so nothing can turn off the CastlePeak host. We need the RAM for the CastlePeak RPC server anyways.

How to restart an imatrix task I killed?

By telling me, the only way currently. It's no problem :)

[keep the nico2 vm down] We can just turn off the nico2 LXC container

So.. turning it off but not keeping it down(?)

We need the RAM for the CastlePeak RPC server anyways.

Just say the word - I assume same set-up as before?

I'm now preparing the RPC setup. It will be the same as always. It will be using latest llama.cpp. I already updated the mradermacher branch.

@mradermacher The RPC imatrix computation setup is ready! :D

I configured it to maximize available memory on StormPeak so let’s hope that works without OOM but I think it will as this should be even tighter for 405B 16 bit than R1 8 bit and there it just barely worked.

I already updated the mradermacher branch.

Updating...

The RPC imatrix computation setup is ready! :D

... or not :) Also, reminds me that I will have to think about how to update llama on nico2 when it's down, also. Probably from nico1 when it enables it.

Updating...

Great! I just rebooted StormPeak into low ARC mode so we get an additinal 24 GB of RAM.

... or not :)

Everything is ready!

Also, reminds me that I will have to think about how to update llama on nico2 when it's down, also. Probably from nico1 when it enables it.

For now this is not needed. nico2 will stay turned off during RPC computation as CastlePeak is hosting the RPC server using a diffrent LXC container but yes updating it on wake would make sense.

nico1 is currently idle and all remaining lownice tasks seam to not require any imatrix so now seams like the perfect time to start RPC. Also timing wise we should booth be awake when it finishes when we start now which is really nice.

Starting it now also has the advantage that I might still be awake in case we OOM while loading the model and could adjust RPC servers accordingly.

For now this is not needed.

It is needed, because when nico2 comes up, it should not run on outdated llama.cpp till I remember to update it maybe in a few weeks :)

Case in point, in my euphoria I started the imatrix job before the update was finished, because I did run the update earlier and forgot that it had failed. Probably would have worked, but would have been a mistake nevertheless.

Thanks a lot for starting the imatrix computation.

Case in point, in my euphoria I started the imatrix job before the update was finished, because I did run the update earlier and forgot that it had failed. Probably would have worked, but would have been a mistake nevertheless.

It probably would have worked but nice you cough it. Sorry that I just happen to reboot at the exact time you made the update. I only checked that everything on the status page is idle but forgot about llama.cpp updates. I should have rebooted way earlier when I setup the entire RPC setup but forgot about the changing the ZFS ARC cache size requiring a reboot as usually it never needs a reboot but if I want to make it quite low have to put the value into modprobe, rebuild initramfs and reboot or it will be ignored.

It is needed, because when nico2 comes up, it should not run on outdated llama.cpp till I remember to update it maybe in a few weeks :)

No worries I will remind you if you forget.

kaos now has all the llama variants and should be able to update nico2 whenever it comes up again. in theory, of course.

Sorry that I just happen to reboot at the exact time you made the update.

You couldn't know, it's not a big deal. It did remind me to change things around, so whenever nodes become enabled, they will be auto-updated now. Other than the rpc link.

No worries I will remind you if you forget.

Right - and now that I thankfully have my own llama upstream maintainer, you think you can add the current build number of git revision to to ggufs in quantize? A simple string in mradermacher.llama_quantize_build or so would suffice. That doesn't tell us what version of convert*py did the thing, but it we often wondered which version of llama.cpp did a certain quant, exactly, and that at least gives us the version at quantize time.

PS: forgot if I asked already, if yes, and it was too annoying, just ignore me. this is not a repeat nudge :)

nite. things should continue as normal after deepseek is done

haha, as we talk about it, the net delivers: https://retr0.blog/blog/llama-rpc-rce (just the rpc server though, apparently)

You couldn't know, it's not a big deal. It did remind me to change things around, so whenever nodes become enabled, they will be auto-updated now. Other than the rpc link.

llmc enable nico2 indeed performed the llama.cpp update but nico2 still shows as disabled no matter how many times I try to enable it.

I'll have a look soon :) In any case, the safe choice should be /root/nico2-resume on nico1, that's the daily cronjob (but it also only does llmc enable)

yup, it was broken (by the rsync, even)

StormPeak crashed again earlier this evening thanks to the intel graphics drivers which caused the status page to breeze. Luckily the timing couldn’t have been better. DeepSeek-V3-0324 just finished 2 minutes before the kernel gave up after over an hour of struggling. I didn’t do killall9 yet so you can investigate why It froze as the local scheduler still seems to be doing an amazing job on keeping nico1 busy.

It did remind me to change things around, so whenever nodes become enabled, they will be auto-updated now. Other than the rpc link.

Which is really cool. I loved watching nico2 update on enable.

nite. things should continue as normal after deepseek is done

It did which was really cool. I was busy with work when suddenly fans ramped up and I immediately knew DeepSeek imatrix computation must be done. Only measurement mattering for fan speed is RAM temperature so it always ramps up if we do non-RPC imatrix but since I installed some DVD cover to redirect airflow over the RAM last weekend it does so way less than before.

haha, as we talk about it, the net delivers: https://retr0.blog/blog/llama-rpc-rce (just the rpc server though, apparently)

Thanks for linking. What an interesting article. I really like reading this vulnerability writeups. I was quite active in the Nintendo Switch hacking scene.

yup, it was broken (by the rsync, even)

Thanks for fixing it. Today nico2 started as intended.

Right - and now that I thankfully have my own llama upstream maintainer, you think you can add the current build number of git revision to to ggufs in quantize? A simple string in mradermacher.llama_quantize_build or so would suffice. That doesn't tell us what version of convert*py did the thing, but it we often wondered which version of llama.cpp did a certain quant, exactly, and that at least gives us the version at quantize time.

I don't really get why you would need a llama.cpp change for that. Couldn't you just add that the version number like all other custom metadata just providing it as command line argument to llama.cpp? But sure, I can add it if you prefer. Doing so would be really easy.

If you don't know the version you can get it using .\llama-cli.exe --version but keep in mind that on our fork build numbers and git commits will differ from official llama.cpp so maybe the timestamp of the most recent commit would make more sense.

PS: forgot if I asked already, if yes, and it was too annoying, just ignore me. this is not a repeat nudge :)

You already did and I was just too busy to answer and then forgot so goo you reminded me again. Please continue to do so for important things I forget. I'm sometimes really busy with my job and usually try to get back to you once things calm down but it can happen that I forget about something you asked so keep reminding you if I don’t get back o you within a few days.

I didn’t do killall9 yet so you can investigate why It froze

Annoyingly enough, because there was a hung rsh that actually needed a kill -9. The global scheduler was still running, as it was only the rsh process that the status daemon was waiting for.

I really like reading this vulnerability writeups. I was quite active in the Nintendo Switch hacking scene.

Me too (writeups) and cool (hacking scene :)

I don't really get why you would need a llama.cpp change for that. Couldn't you just add that the version number like all other custom metadata just providing it as command line argument to llama.cpp?

I can't get the version in a race-free way. I either would have to do versioned updates, and then somehow track when I can get rid of the old versions, or stop-the-world-then-update, or llama.cpp adds it itself. Having the version be wrong on some quants would IMHO be worse than not having it at all. In any case, doing it without a llama change is likely orders of magnitudes more complicated and buggy than the likely one or two-line change inside llama.cpp. The only issue is that we don't track the convert version, which is probably even more important, but that requires both the script to set it, as well as quantize to preserve it. (well, in theory I could extract it and set it manually).

maybe the timestamp of the most recent commit would make more sense.

Yeah, build numbers are liekly irrelevant. git commit id should be the way to go, imho. Maybe that and the most recent commit timestamp. I think llama.cpp already somewhere has the former.

Better quants coming to llama.cpp/ik_llama.cpp soon: https://github.com/ggml-org/llama.cpp/pull/12557

@nicoboss since you have done a lot of work testing quant quality, this may make some of your findings outdated.

Of particular note to you both is this may finally put to rest issues with Q3_K being unusable.

Also on the topic of bitrot, it can also be caused by failing HDD thankfully on ZFS that can be handled gracefully.

My ZFS dashboard shows me statistics.
Read Errors: 0
Write Errors: 0
Self Healed: ~400k (this most recent scrub was clean, but sometimes more get caught in scrubs)

Only one disk, been anticipating a drive failure, I'm somewhat prepared, but still don't feel like proactively replacing it.

@nicoboss yeah, if unpickling is always unsafe, why does transformers pretend otherwise, e.g:

_pickle.UnpicklingError: Weights only load failed. Re-running torch.load with weights_only set to False will likely succeed, but it can result in arbitrary code execution.Do it only if you get the file from a trusted source. WeightsUnpickler error: Unsupported operand 149

ikawrakow:

Oh, I used ik_llama.cpp to compare. It is possible that has become much faster than mainline (I haven't used mainline for quite some time). I started testing with DeepSeek-Lite, and almost gave up (your IQ4_NL quantization took 302.5 seconds with imatrix). ik_llama.cpp does it in 54.5 seconds.

intriguing...

in any case, my impression (primed by being explicitly said) that llama.cpp does not care about imatrix quants fortifies. and something must have happened between ikawrakow and llama.cpp if he even avoids commenting on the llama.cpp issues (https://github.com/ikawrakow/ik_llama.cpp/discussions/288)

and something must have happened between ikawrakow and llama.cpp if he even avoids commenting on the llama.cpp issues (https://github.com/ikawrakow/ik_llama.cpp/discussions/288)

This is leaning to the drama side of things which I normally stay away from as it isn't productive but I feel the desire to correct this, as it isn't true, https://github.com/ggml-org/llama.cpp/issues/10011 here he is well after he forked helping mainline out.

Someone did point blank ask ikawrakow what happened, and he answered https://github.com/ikawrakow/ik_llama.cpp/discussions/256

From my perspective it does just seem like there is a difference in visions that became incompatible. Especially now with how I see them, and what the roadmaps of both look like I'm glad both exist.

@tdh111 thanks for your links.

note that ikawrakow disagrees with the compilade patch actually improving Q3_K quants

Someone did point blank ask ikawrakow what happened, and he answered https://github.com/ikawrakow/ik_llama.cpp/discussions/256

He did not actually answer, in fact, he completely avoided answering that (at least in the issue you provided), which actually supports my suspicion - if everybody were happy, why not say that. I might be biased, having had a bad history with llama.cpp devs as well, but his behaviour, i think objectively matches my theory better than the "no bad blood at all" theory (neither of which is likely the true story).

It's clear, though, that ikawrakow is not interested in talking about it, and I think he shouldn't be prodded.

@nicoboss the backlog is essentially dead - "only" 10TB static quants left, which will mostly run into the repo creation limit, causing big delays, but no other grief. And 180 70Bs mostly caused by the sce craze that I didnt't imatrtix (or at all) quant.

Update: thinking about it, I'll stop submitting to nico2 and stop using it once it's done with its models. No point force-hitting the limit.

It's clear, though, that ikawrakow is not interested in talking about it, and I think he shouldn't be prodded.

I agree completely, I've been curious about it but never asked for that reason.

note that ikawrakow disagrees with the compilade patch actually improving Q3_K quants

his data says otherwise, https://github.com/ikawrakow/ik_llama.cpp/pull/295 in that table he shows Q3_K both his version and compilade's have improvements.

his data says otherwise,

No, it doesn't, his data is for --pure, not the standard Q3_K quant formats that llama uses and everybody generates. --pure formats are not relevant in the real world and perform worse than much smaller mix quants. Same is true for compilade - he didn't test with real world quants apparently, assuming the results would be the same. According to ikawrakow, they aren't. Besides, even compilade agrees that the improvements are only for a single model family, so even if the table were for Q3_K quants, it would not show otherwise.

@nicoboss I've re-queued most of the static-only quants that in the meantime got a couple hundred downloads, which was surprisingly many... but still the minority.

@mradermacher Could you please restart thePathos-Delta1-LLaMa-70B imatrix task. ArsParadox accidentally crashed it by starting his finetuning too early. I unfortunately still have no way to restart imatrix tasks by my own.

https://github.com/ggml-org/llama.cpp/pull/12634 just got merged. I'm so happy. I actually already used this model to create my own static Q4_K_M quants and ran inference of them for over 12 hours. Let’s update my mradermacher branch and then do all the BailingMoeForCausalLM models.

I updated the mradermacher llama.cpp fork. Please update all the local llama.cpp installations. I will be preparing the SOURCE GGUFs of the 290B models manually in the meantime.

updated llama.cpp. it also enables Qwen2_5_VLForConditionalGeneration and PLMForCausalLM. that'll be lots of models.

@nicoboss Qwen2_5_VLForConditionalGeneration is a vision model, yes? do you happen to know if it works with the same extraction process as Qwen2VLForConditionalGeneration ?

I will try to hold those models back.

llama.cpp quality is so frustrating. is it too much to ask to actually try out PLM model support on the actual PLM model before pushing it out?

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 291, got 290

I will be preparing the SOURCE GGUFs of the 290B models manually in the meantime.

Do they require a special process?

I've set the override file for Ling-plus-base on nico1, remove it once it is ready.

And the bailing support also hasn't apparently been tested with the actual bailing models. Ling-lite-base fails:

RuntimeError: split_with_sizes expects split_sizes to sum exactly to 3072 (input tensor's size at dimension -2), but got split_sizes=[0, 0, 0]

I feel we should delay new model support by a week or so - it's like this every single time.

Try the large one. I put it under /tmp/quant/Ling-plus.gguf - sorry no softlink this time because bpool is full and I had no time to clean it up.

Actually there would now be enough storage on bpool if you want to move it over but I see you already started it. Also for Ling-plus.gguf imatrix you will have to either use Q8 or RPC due to it beeing 545 GiB. If you need the RPC setup just let me know but Q8 is likely good enough.

Qwen2_5_VLForConditionalGeneration

After reading https://github.com/ggml-org/llama.cpp/issues/11483 I gave it a try, but indeed, qwen2_vl_surgery fails (ValueError: Trying to set a tensor of shape torch.Size([2048]) in "bias" (which has shape torch.Size([1280])), this looks incorrect.)

Not sure what a good way to proceed is.

imatrix you will have to either use Q8 or imatrix due to it beeing 545 GiB. If you need the imatrix setup just let me know but Q8 is likely good enough.

Well, didn't we say for "foundation" models we do full precision, if possible? But, yeah, I am fine with Q8_0 :=) But I always also want the base model to get the same treatment.

Try the large one.

If that refers to the bailing models, the large one was already converted by you. You mean I should convert it on my own? I can try that with the base model, unless you already working on that. I don't think it's something in my set-up, though - I guess it's just another case of llama.cpp not even bothering to test with the original models.

his data says otherwise,

No, it doesn't, his data is for --pure, not the standard Q3_K quant formats that llama uses and everybody generates. --pure formats are not relevant in the real world and perform worse than much smaller mix quants. Same is true for compilade - he didn't test with real world quants apparently, assuming the results would be the same. According to ikawrakow, they aren't. Besides, even compilade agrees that the improvements are only for a single model family, so even if the table were for Q3_K quants, it would not show otherwise.

He merged in his PR which said this:

"In PR 12557 in mainline llama.cpp @compilade uses a (nearly) exhaustive search for optimality, whith correspondingly very long quantization times. One can arrive at about the same result much quicker as follows[...]

[see PR for actual math ]

Because of that, this kind of "first order" approximation is much faster than exhaustive search, as can be seen in the above table by comparing quantization run times between this PR and @compilade 's PR 12557, while achieving effectively the same quantization accuracy as measured by PPL.

Extending the above algorithm to the non-linear quants IQ4_XS and IQ4_NL is trivial.

". He considers it an improvement, and otherwise he would not have merged it in, when testing you use pure to test each quant at all tensors ( except token embeddings and output tensor which are always set to Q8_0 to prevent PPL collapse at low bpw). It is what you want to do if you are fundamentally altering the underlying math behind the quantization process.

https://github.com/ggml-org/llama.cpp/pull/12634 just got merged. I'm so happy. I actually already used this model to create my own static Q4_K_M quants and ran inference of them for over 12 hours. Let’s update my mradermacher branch and then do all the BailingMoeForCausalLM models.

@nicoboss

I see your comments on the PR, how would you rate Ling as a model (in comparison to others you've liked), I may want to run it myself.

@tdh111 I am just the messenger - your guessing at what he might or might not think simply cannot trump what he explicitly wrote:

So, when using --pure, it may appear that one gets an improvement because the new method being tested happens to do better on exactly these tensors, but worse on many others.

Yes, you might test on pure quants, but these quants are useless. What counts is the actual Q3_K* quants, which do not get improved by compilades patch according to him, because it also affects other quant types.

Please, if you want to continue to argue this, argue it with him. Maybe he lied when he wrote that, maybe he changed his mind without saying so - I find it moot to discuss this - point being that I was right in what I reported, even if you clearly don't like it - but it's not my opinion, and I correctly reported what he wrote.

I would also like to split these discussions into another discussion topic, so we can keep this topic strictly related to the quanting business. I've created a new topic for this:

https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/5

Well, didn't we say for "foundation" models we do full precision, if possible? But, yeah, I am fine with Q8_0 :=) But I always also want the base model to get the same treatment.

True. Let's stick to that rule. Honestly I really prefer doing it using F16 but I felt bad bothering you with another complicated RPC setup. Let's do it in F16 using RPC as the model is awesome and absolutely deserves it. It also is relatively tiny and so it is no problem to quickly run over RPC. The experts are rleatively small so duing imatrix in Q8 could have some negative impact.

If that refers to the bailing models, the large one was already converted by you. You mean I should convert it on my own?

Sorry for the confusion. I meant trying to quantize and imatrix it and see if the source GGUF I provided works or has the same issue as that base model which failed.

I can try that with the base model, unless you already working on that. I don't think it's something in my set-up, though

I think the same. I will provide the large base model later today if it even converts which it probably will not.

I guess it's just another case of llama.cpp not even bothering to test with the original models.

I'm in fact the only reason Ling-plus worked as I tested it and gave feedback that it is broken back before the merge request got merged. I had no idea he also didn't test the base models as he only mentioned Ling-plus beeing too large for him to test. If I knew I would have tested all of them.

Bailing is in one of the foundational models I really enjoy. I ran inference on it for a quite a while generating over 1 MB of text.

@mradermacher The RPC imatrix setup is ready to be used for Ling-plus. I have freed up much memory and you will be able to keep quantization tasks running for sure as the model only barely exceeds the single node memory limit. I disabled the CastlePeak shutdown trigger so you don't accidentally turn of off while imatrix computations are running.

No idea why Ling-lite and Ling-Coder-lite are tokenizing the input for such a long time during imatrix computation - I hope they aren't stuck.

both ling models were stuck at tokenizing input, of all things. inspires confidence. I am sorry, Inonly saw it now. I don't think I will be able to do the required conversio0ns etc., before I get up again :/ I will see though.

I'm in fact the only reason Ling-plus worked as I tested it

I saw that, yes:

I had no idea he also didn't test the base models

As usual, it looks as if they didn't test any of the actual models.

Ah, wait, yes, I don't need to convert, it's already there. I need the base model. If you manage to get it read till early morning, and if I manage to get ling-plus going soon, it would work out perfectly.

Hmm, maybe all I need is update the job and everything is automatric... I'll update scheduling parameters so three quant jobs will run on nico1 (deepsek as bg. and two normal opnes), as I hope there will b e enough memory. I moved it to nice 2, so it should automatically start doing it's thing after most of the imatrix queue is done. Good that everythiong is automated...

Hmm, i don't use irun for imatrix jobs. Sucks, which one is the right one to kill (another ling-lite model clogged the queue)

Note to self: fix that tomorrow!

Uuh, and today is the first day I can watch nico2 do a scheduled shutdown.

Ah, right, good that I saw that, it's 17:00... hmmm...

Bailing is in one of the foundational models I really enjoy.

I count that as a manual request for the model :)

@mradermacher The RPC imatrix setup is ready to be used for Ling-plus. I have freed up much memory and you will be able to keep quantization tasks running for sure as the model only barely exceeds the single node memory limit. I disabled the CastlePeak shutdown trigger so you don't accidentally turn of off while imatrix computations are running.

Uuh, and today is the first day I can watch nico2 do a scheduled shutdown.
Ah, right, good that I saw that, it's 17:00... hmmm...

Oh no I feel so bad that you stayed up so long just to see it happened likely forgetting about my previous message.

I count that as a manual request for the model :)

I really want to do them. All the Bailing models are quite fantastic. They are massive and intelligent while being fast to run on the CPU due to being MoE. They have a very nice balance of being intelligent while still being able to run at reasonable speed despite not fitting in GPU memory. Like 405B and R1 are just too slow for me to use them unless I absolutely have to and then I prepare all my prompts and run them over nigh while for Ling-plus with its 6 tokens per second it generates text faster than I read making it just fast enough for me to be willing to use it at real time and I could even speed it up further by using speculative token generation and GPU offloading.

Hmm, i don't use irun for imatrix jobs. Sucks, which one is the right one to kill (another ling-lite model clogged the queue)

You can always just check nvtop to see the PID to kill.

Ah, wait, yes, I don't need to convert, it's already there. I need the base model. If you manage to get it read till early morning, and if I manage to get ling-plus going soon, it would work out perfectly.

I will make it ready. It should be ready in around 2 hours.

both ling models were stuck at tokenizing input, of all things. inspires confidence. I am sorry, Inonly saw it now. I don't think I will be able to do the required conversio0ns etc., before I get up again :/ I will see though.

They are aware of the issue and are currently working on a fix.

RuntimeError: split_with_sizes expects split_sizes to sum exactly to 3072 (input tensor's size at dimension -2), but got split_sizes=[0, 0, 0]

They are aware of this and the errors reported by the users as well. Please follow https://github.com/ggml-org/llama.cpp/pull/12634 as things are moving along quite fast.

Oh no I feel so bad that you stayed up so long just to see it happened likely forgetting about my previous message.

I didn't stay up for that, no :)

All the Bailing models are quite fantastic.

Unfortunately, even the Ling-plus model entered an endless loop during tokenization. We had two reports of another ling* model to crash with "invalid regex", so probbaly some regex is broken, fails to match, and you get 0 length token or so. rinse repeat (https://huggingface.co/mradermacher/Ling-lite-GGUF/discussions/2#67eaeffe7382053ae1241ee5)

On the other hand, you ran inference with the .gguf, and that seems to have worked, correct?

You can always just check nvtop to see the PID to kill.

Hmm.

nvtop: ./src/extract_gpuinfo_intel.c:228: parse_drm_fdinfo_intel: Assertion `!cache_entry_check && "We should not be processing a client id twice per update"' failed.

(that's at home). Nah. I just did it the old fashioned way with ps and grep...

They are aware of this

No doubt they will eventually fix it. But they always release it broken. Anyway, everything is still queued. And sorry, it's just not how I release software - I am just exasparated.

Any idea what we should do with qwen 2.5 vision models? supposedly, vision extraction should work with most sizes, but not for me.

I guess log and only do the text for the time being?

@nicoboss I changed nico1 to single-quant job, because it will primarily do 70B's, and they should fit nicely into the arc cache.

Indeed, it basiclaly does zero I/O, other than occasional writing of tensor data. Still, the cpu is often 80% idle when it shouldn't wait for I/O - is quantize this inefficient? I was under the impression that it has no trouble keeping cores busy as long as tensor data is in memory - you have any idea? In the meantime I'll increase to two jobs again.

@nicoboss I changed nico1 to single-quant job, because it will primarily do 70B's, and they should fit nicely into the arc cache.

I assume you mean nico2. nico1 doesn't even use ZFS and so no ARC cache.

Indeed, it basiclaly does zero I/O, other than occasional writing of tensor data. Still, the cpu is often 80% idle when it shouldn't wait for I/O - is quantize this inefficient? I was under the impression that it has no trouble keeping cores busy as long as tensor data is in memory - you have any idea? In the meantime I'll increase to two jobs again.

Sorry I yet again forgot to manualy run my boot.sh this morning. I have no clue why cron doesn't execute it on boot but without this script running we heve almost zero ARC cache explaining why you barely see any CPU utilisation.

/etc/crontab

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
@reboot /root/boot.sh

/root/boot.sh:

#!/bin/bash
cpupower frequency-info
cpupower frequency-set -u 3200000
cpupower frequency-info
echo 193273528320 > /sys/module/zfs/parameters/zfs_arc_max
numfmt --to iec --format "Set ZFS ARC to %3.2f" $(cat /sys/module/zfs/parameters/zfs_arc_max)
root@CastlePeak:~# stat boot.sh 
  File: boot.sh
  Size: 247             Blocks: 9          IO Block: 512    regular file
Device: 0,27    Inode: 290397      Links: 1
Access: (0777/-rwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-03-10 13:07:34.281358058 +0100
Modify: 2025-03-13 16:20:35.299606772 +0100
Change: 2025-03-13 16:20:46.101826699 +0100
 Birth: 2025-03-10 13:07:34.281358058 +0100

Not only does corn fail me but so does modprobe. Maybe I forgot to rebuild the initramfs but I'm quite sure I did. It is so ridiculous that I have to manually execute boot.sh every day just because all automation fails me.

/etc/modprobe.d/zfs.conf

options zfs zfs_arc_max="2147483648"

Ah wait I see. /etc/modprobe.d/zfs.conf has the wrong value. That explains why why modprobe doesn't work. So actually the issue is just cron in that case.

Interesting so crone thinks something is wrong with the syntax of my crontab file

root@CastlePeak:~# cat /var/log/syslog | grep -w 'cron'
2025-04-01T07:01:05.326155+02:00 CastlePeak systemd[1]: Started cron.service - Regular background program processing daemon.
2025-04-01T07:01:05.328109+02:00 CastlePeak cron[3195]: (CRON) INFO (pidfile fd = 3)
2025-04-01T07:01:05.329151+02:00 CastlePeak cron[3195]: Error: bad command; while reading /etc/crontab
2025-04-01T07:01:05.329181+02:00 CastlePeak cron[3195]: (*system*) ERROR (Syntax error, this crontab file will be ignored)
2025-04-01T07:01:05.330370+02:00 CastlePeak cron[3195]: (CRON) INFO (Running @reboot jobs)

Maybe I'm stupid. I don't see a syntax error. Here the entire file:

root@CastlePeak:~# cat /etc/crontab
# /etc/crontab: system-wide crontab
# Unlike any other crontab you don't have to run the `crontab'
# command to install the new version when you edit this file
# and files in /etc/cron.d. These files also have username fields,
# that none of the other crontabs do.

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed
17 *    * * *   root    cd / && run-parts --report /etc/cron.hourly
25 6    * * *   root    test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.daily; }
47 6    * * 7   root    test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.weekly; }
52 6    1 * *   root    test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.monthly; }
#
@reboot /root/boot.sh

Nice how it gives you a line number. Not.

As for your problem, not sure, but a) does your crond support @reboot? and b) I don't think you can leave out the user

And a free admin pro-tip, don't patch /etc/crontab, use your own file in /etc/cron.d (or root's crontab)- that way, you won't have conflicts on upgrades, and you can name your files descriptively, e.g. /etc/cron.d/mradermacher

PPS: /etc/rc.local, on debian, is likely the safer way to do that, altogether

assume you mean nico2.

Yup.

we heve almost zero ARC cache explaining why you barely see any CPU utilisation.

Wait, what? If you don't have a big arc cache, ZFS simply idles, with no I/O and no CPU? In what world does that make sense? Or is it just that I can't see it.

I'll reduce the job number again and see if anything has changed.

a) does your crond support @undefined

Yes it does

b) I don't think you can leave out the user

Ah you are right:

Jobs in /etc/cron.d/

       The jobs in cron.d and /etc/crontab are system jobs, which are
       used usually for more than one user, thus, additionally the
       username is needed.  MAILTO on the first line is optional.

Somehow, with two jobs, it's now 90% idle for 10+ seconds tretches.

(Maybe because it's two static + imatrix jobs now.)

@undefined is funny

Wait, what? If you don't have a big arc cache, ZFS simply idles, with no I/O and no CPU? In what world does that make sense? Or is it just that I can't see it.

I think you just can't see it with whatever tool you use. I can see it on the Proxmox web interface under IO delay. For some reason IO delay (the time the CPU spends waiting for IO) increases but the IO bandwidth stays almost the same.

Somehow, with two jobs, it's now 90% idle for 10+ seconds tretches.

It takes some time for ARC cache to fill up with the data you need.

No, there doesn't seem to be a change. It's currently uploading and doing more I/O, but it seems llama.cpp really can't keep the cpu busy even when I/O is essentially free. I'll use tow jobs. That sucks, because otherwise, one job would be ideal at the moment.

Anyway, thanks for your input again, signing off

I juest realized that I did manually run boot.sh late morning today. So it should have worked. So the zero IO you where seeing was real. It really seams like a llama.cpp issue of still beeing inefficient even if all data it needs is in RAM.

Wow I got mentioned XD

It really seams like a llama.cpp issue of still beeing inefficient even if all data it needs is in RAM.

Yeah, so my mental model of "it loads the tensor, runs x threads in parallel on uniform data with perfect sharing, cleaning up + saving" is wrong. But maybe there is a good reason for that, so no criticism here :)

@undefined sorry, I thought I had looked and determined that you don't exist. I was wrong.

I've replaced the standard rsh-client by rsh-redone-client (which is differently buggy, and I normally avoid it, but I don't use it interactively on kaos, so don't care). In the long run, We'll have to have some kind of llmjob server. But maybe that one doesn't hang so often. (the statusd was hanging waiting for it once more)

@mradermacher Please update to latest llama.cpp on our fork and restart all failed Bailing imatrix tasks. https://github.com/ggml-org/llama.cpp/pull/12677 got merged which finally fixes all the remaining Bailing issues.

Please make sure to use RPC when restarting Ling-plus. All RPC servers are updated to latest llama.cpp and ready. We maybe should also unblock Ling-plus-base and setup its imatrix task to use RPC as well.

I also moved Ling-lite-base from marco to nico1 and manually provided the GGUF as https://huggingface.co/inclusionAI/Ling-lite-base/discussions/2 is not yet merged.

HuggingFace implemented a really cool feature that shous you exactly which quants are compatible to your hardware for every single quant we uploaded as a single file:

grafik.png

grafik.png

I see that Ling-lite-base imatrix computation is currently blocked which is a good thing as you have not yet updated to our latest llama.cpp version but please make sure to unlock it once you have done so.

I think we should change timeofday end time to 18:00 as due to the switch to summertime last weekend it is still daytime and sunny outside despite being 17:30.

I'll take care of llama/ling when I am mentally fit enough again.

Ling-lite-base

Very cool of you!

HuggingFace implemented a really cool feature that shous you exactly which quants are compatible to your hardware for every single quant we uploaded as a single file:

All quants are "compatible". And if the question is "will fit", then the feature is useless, imho, because it doesn't take any parameters into account. If all it does is compare quant sizes with some fixed vram limit, I don't see the point.

I think we should change timeofday end time to 18:00

I will (and hopefully not forget nico2 once it is available again. Any ETA on it? I was kind of depending on it...)

I will (and hopefully not forget nico2 once it is available again. Any ETA on it? I was kind of depending on it...)

nico2 was running almost the entire day until you turned it off at 17:00. It just started a bit later than usual as I tried enabling kernel crash dumps which caused it to get stuck during boot and it took a few hours for me to notice. But it was running late morning and the entire afternoon.

Actually just set timeofday end-time to 19:00 as it in fact is still sunny outside - summer time is so strange.

I turned on CastlePeak and started the RPC server there as it got turned off together with CastlePeak at 17:00 but we need it for Ling-plus

nico2 was running almost the entire day until you turned it off at 17:00.

That is good to hear (and no, I don't turn it off myself :)

as it got turned off together with CastlePeak at 17:00 but we need it for Ling-plus

It was a valiewnt attempt :)

Ling-plus

Failed due to nan's showing up during imatrix training. Happened in the past a lot, until lama.cpp fixed things, now is a rare thing. According to the llama.cpp devs, this makes ling-plus either a completely useless model incapable of inferencing, or us liars and deceivers. In case you are now "what the fuck" - that was before your time, when I dared to report these cases to llama.cpp - I was then publicly accused of tinkering with the model, the code, or the results, or having "weong" imatrix training data, or not enough (which obviously is bollocks, because shorter material increases chances of success) - and was ordered(!!) in no unclear terms to not distribute such quants. It caused a major ruckus in the social sphere (e.g. reddit, 4chan) where mradermacher was accused of somehow producing broken quants on purpose, backed by the derogatory llama.cpp developer comments. Never received an apology when it turned out to be llama.cpp bugs.

That's the source of my bitterness towards llama.cpp.

And it just all came back, showing how much more it affected me than I thought. At least I didn't throw the towel right then and there, but it was close.

This all sucks.

Don't feel like you have to comment on the above.

Anyway, in the dark past, I used a reduced imatrix training set, because the problems only appeared after 94 chunks (that's the reason I had half and quarter training sets), but I guess the thing to do is to not have imatrix quants of ling-plus at the moment, because clearly something is suboptimal in the source model.

Actually just set timeofday end-time to 19:00 as it in fact is still sunny outside - summer time is so strange.

Well, how about doing something like sunrise + x to sunset - y. Can't be much more complicated than a sine function or so. Maybe with some hard limits in winter. If you think it's aon OK idea, I will look into it, because it sounds interesting for a change.

that's the reason I had half and quarter training sets

Right, that was also the reason behind adding an abort() when ppl becoems a nan, and having regular autosaves that I recently disabled. That's how rare that case had bcome.

I have started the rpc imatrix process for ling-plus-base. it is not uncommon that the nan problem appears only for the instruct model. Hope I didn't do antyhing wrong - but I assaumed if rpc is working it can be used.

And sorry in general, have too little time for detail work.

I have started the rpc imatrix process for ling-plus-base. it is not uncommon that the nan problem appears only for the instruct model. Hope I didn't do antyhing wrong - but I assaumed if rpc is working it can be used.

Yes I have not yet increased the ARC cache on CastlePeak and all RPC servers are still running so everything is still ready for RPC.

The big question is why it is so slow (12h). Shouldn't it be way faster? (with luck it will crash after 94 chunks anyway :)

Ling-plus-base failed at the samer block, so probably it's actually related to the specific imatrix chunk. I suggest we do static only then :(

We could consider re-adding the static IQ3* quants? Maybe they are fixed...

I will try some other imatrix datasets or make llama.cpp skip NaN chunks. Seams like terrible design that it refuses to compute the imatrix just because of some random NaN.

As a start we should try applying https://github.com/ggml-org/llama.cpp/pull/11773 - if it works, I consider merging this into our fork as it will fix many remaining imatrx NaN cases. NaNs are always an issue on llama.cpp's side not properly handle certain edge cases and never us or ouer imatrix dataset. In the same PR ikawrakow explained how NaNs are caused by llama.cpp's hacky implementation.

I will try some other imatrix datasets or make llama.cpp skip NaN chunks.

That can't work - once nan, forever nan. The only way is to not have it run into nans in the first place. Again, this why I did an autosave every 10 chunks, but I don't think we can do that with the patched version, we'd have to fix it first.

But since this is almost certainly a model defect, I think it's probably not worth it.

llama.cpp's side not properly handle certain edge cases and never us or ouer imatrix dataset.

Or, very commonly, broken weights in models. Which likely is the case here.

"I am not interested in maintaining this code."

We know. (slaren)

https://github.com/ggml-org/llama.cpp/pull/11773

I wonder why it isn't applied. But that code isn't executed during imatrix training, so I give a zero^W1e-5 chance of helping?

We can probably search for models that actually failed this way with llmc why, though. I am sure we have a few (but not many).

I am also glad you are queuing models again, although there seems to be a perceived 100% failure rate (I only really see the failures :-)

What would be interetsing, aqnd I wonder if you are already doing this, effectively, is to see if we have any non-imatrix models that are popular enough (for some metric) to warrant an imatrix?

@RichardErkhov @nicoboss

We are through the queue. Most of the remaining models in the queue are held back qwen 2.5 vl models where a transformers release is likely needed.

That means rich1 will essentially become idle regularly, and we will have to find a different mode of use - we still depend on rich1, but we need a better way of sharing than "mradermacher takes over".

For example, once done I could reduce the number of jobs to one, and this guarantee at least half the box's memory will be available. Or something else. Discussion appreciated.

PS: yes, I still want to redo classical pre-2024 models, and I have an idea on how to find broken models and eradicate/replace them, but neither of these will create the massive queue we had for the last halfr a year.

PPS: and the next thing will be moving nico2 to some other model, such as permanently off, or on only on (manual) demand

PPPS: wow, this feels good. io wish the queue wasn't clogged with the 2.5 models and was completely empty, just for the psychological effect - couldn't have done it without you :)

nice, well I guess we should prioritise quanting on rich1 whenever we get new models, and if then if queue becomes too big to handle spread to other servers. Because nico is paying for the electricity, and I am paying fixed price, so might as well save some nico money and utilize already spent my money haha. Well, I will continue with my queue, but it obviously cant 24/7 CPU just because of my amazing code and because my queue is sorted by size lol, well it just needs a bit of fixing and everything will be fine. Well, that was a nice half a year of 110% cpu load, I guess we should queue 1990-2024 models lmao

or we can imatrix all of the ggufs lmao, just literally abuse nico gpus for a month to imat the whole 20k+ models and store them, and just make rich1 process all of them

We are through the queue.

What a moment to celebrate! Thank you everyone for your massive effort! We all did such an amazing job!

Most of the remaining models in the queue are held back qwen 2.5 vl models where a transformers release is likely needed.

Qwen2.5 VL support for llama.cpp is basically ready and just waiting for review: https://github.com/ggml-org/llama.cpp/pull/12402 - hopefully the llama.cpp team didn't miss that it is ready for review because I did.

PS: yes, I still want to redo classical pre-2024 models, and I have an idea on how to find broken models and eradicate/replace them, but neither of these will create the massive queue we had for the last halfr a year.

We could dry-run them and see if they still work. But I would say if they are of big historical significance requantizing might makes sense even if they are not broken just for the sake of having better quality quants.

PPS: and the next thing will be moving nico2 to some other model, such as permanently off, or on only on (manual) demand

I think we could just make it automatically turn on whenever we have high-priority models to work on.

PPPS: wow, this feels good.

It indeed does. It is so relieving that after over 5 month the queue is finally empty again! Having such a massive backlog was quite overwhelming and I thought many times we might never get them all done.

couldn't have done it without you :)

We couldn't have achieved this without you who put so much work and effort into finding all these awesome models and maintaining this surprisingly complex infrastructure.

well said, @nicoboss , thanks to everyone guys !

the queue is going in front of me. half an hours ago it was 119
81 additional job(s) in wait queue (total estimated size 7.022TB, imatrix 6.809TB, lownice 79)

I guess we are going to cook everything by tomorrow morning

@mradermacher fun question: what is the size of mradermacher?

We could dry-run them and see if they still work.

That is exactly what I plan to do - write the header-downloader to recreate "stub" models, then dryrun them. The downloader I want anyway for other reasons. Just when is the question :)

1990-

Yeah :)

I think we could just make it automatically turn on whenever we have high-priority models to work on.

Right, except the "have high-priority models to work on" is a bit difficult to formulate. In any case, we'll try with manual queueing for a while, and see how much we will need it.

the queue is going in front of me. half an hours ago it was 119

Quite a few of the queued models are (invisibly) locked due to being qwen 2.5, and the 7Bs specifically need the next transformers release, apparently.

fun question: what is the size of mradermacher?

Measured in TB, it's 5077.057 (our count, can't find where hf shows it anymore).

As for saving power, essentially we already have a preference system - rain is always preferred over nico1 for models it can support for example. I suggest we change things so that nico1 only does jobs with nice level < -1400 during the night (right now, it's <= 50). That would exclude daily models. Not sure rich1 can do the work, though, it's considerably slower than nico1, nico2 or marco (in practise, it is probably faster than marco because it has a faster disk). But we can surely try.

Just pushing high priority models to other nodes first will not do, as we will always have bursst with more models than nodes, so nico1 would always get some.

But yeah, my point is, we can surely reduce the load on nico1, e.g. at night, and probably not use nico2 almost all of the time.

This can also be modified, e.g. I could make two "daily" groups of models, let's call them daily junk and daily intertesting, e.g. at -1000 and 0. I'll play around with it the next few weeks. See what works.

I mean from now we wont get as many models with high priority, so we dont need asap quants, so just normal queue can stay on rich1, and if anything high priority can go to another server . That way we keep rich1 loaded at all times and nico with more money =)

For some reason I expected more than 5077 TB lol

@mradermacher Text only llama 4 support got merged: https://github.com/ggml-org/llama.cpp/pull/12791. I updated ouer llama.cpp fork. Please update the workers. I already have the models locally and plan on queuing them tomorrow to nico1 using manually provided GGUF.

Llama 4 is affected by https://github.com/ggml-org/llama.cpp/pull/12727 as well. I wonder if we like bartowski should risk using this change despite not being merged yet. It results in much better quants for all MoE models with an expert count other than 8. I think we should as we will have to redo Llama 4 anyways once its missing features are implemented.

I will now start preparing the Llama 4 source GGUFs. Latest llama.cpp now requires libcurl4-openssl-dev to build so in case it fails to build you know what to install.

I mean from now we wont get as many models with high priority,

Why would we get fewer high priority models?

I wonder if we like bartowski should risk using this change despite not being merged yet.

I think we should wait till it's merged and the kinks are gone in a "released" version. Why? Because every single time in the past, we had to redo the quants anyway. Different this time? Sure, maybe.

We have to redo them anyway for sure no matter what we do. The current state of Llama 4 support is quite bad. The idea was to go for slightly better quants than what is currently on master so we at least don't have to redo them again until things are stable but I'm fine with using master.

Different this time? Sure, maybe.

No and initial support maybe never was as bad as for Llama 4. So many features are missing in the current implementation. Thing is that this model is insanely popular and everyone wants to try it so just waiting a few months for everything to be implemented is not really an option. I also really hate that every time they implement one of them many missing Llama 4 features that users will demand a requant.

Let's wait at least a day, till hopefully somebody had time to review 12727

Sounds good?

So many features are missing in the current implementation.

I am not talking about features, I am talking about the model being outright broken every single time, and being fixed within a week.

Thing is that this model is insanely popular and everyone wants to try it

Well, here we disagree :) I want to deliver good work, based on official llama.cpp releases. I don't want to chase popular models and provide the experimental testbed for them, don't want to invent experimental quants etc. etc.

I am not concerned about missing features. I am not terribly concerned with requants. I am concerned about the quants being outright broken, just as it was the case with llama 3, 3.1, ... and practically every new non-llama architecture.

I also feel bad about not having an obvious improvement like 12727, but I feel terrible about it not even having been reviewed.

I'll let you decide then, to take the risk or not - say the word. But we should at least wait on a resolution of 12727, IMHO.

I just checked and nobody uploaded a GGUF for Llama-4-Maverick-17B-128E-Instruct so far despite the really high demand so let's just do it without 12727 using latest version of ouer fork. I'm fine with waiting a day but I see it as unlikely that 12727 is getting reviewed in the next day as last change was 3 days ago and nobody is assigned as reviewer yet. Llama-4-Maverick-17B-128E-Instruct.gguf will be ready in less than 2 hours.

Slightly related only: Since rpc parameters are somewhat standardized by now, I could provide a llmc command to "reconfigure" an imatrix job into a "big rpc imatrix job", maybe also adds some more imatrix management commands. Of course, timing is a bit bad, but we can discuss - I love making you more independent.

I just checked and nobody uploaded a GGUF for Llama-4-Maverick-17B-128E-Instruct so far despite the really high demand

Maybe some people are more reasonable... Anyway, preparing the llama build now. If you want to do it without 12727 there is no point in waiting for it. Although the likely result might be publishing the first and worst quant...

I will now start preparing the Llama 4 source GGUFs. Latest llama.cpp now requires libcurl4-openssl-dev to build so in case it fails to build you know what to install.

It's not installed on my build machine, and the build succeeded - are you sure it needs openssl specifically? debian defaults to gnutls.

A command to restart failed imatrix tasks would be supper useful. Reconfiguring an imatrix job to use RPC would be awesome as well. Maybe a way for me to trigger an update to latest llama.cpp or have it auto-update would be great too as currently I always have to ask for it and with all the Llama 4 changes this likely require us to update way more often than usual.

I have not forgotten about adding the llama.cpp version metadata. I'm just really busy at the moment and so had no time for it so far but it seems like we can use the git tags for a consistent versioning as they are synched over from llama.cpp upstream.

It's not installed on my build machine, and the build succeeded - are you sure it needs openssl specifically? debian defaults to gnutls.

No every libcurl-dev will work. I just went with openssl as I was not aware that gnutls is the default.

llama.cpp has been updated, but I didn't test it, for lack of time. best see if new jobs work, then feel free to queue.

Llama-4-Maverick-17B-128E-Instruct is now being quantized. We will need to use the imatrix setup for Llama-4-Maverick-17B-128E-Instruct. Please restart the MM-Thinker-72B imatrix task if you have time. Llama-3_1-Nemotron-Ultra-253B-v1 has quite a strange failure we will need to look into.

Let's summarize important open llama.cpp pull requests so we can better keep track of them:

Support Qwen3 and Qwen3MoE : https://github.com/ggml-org/llama.cpp/pull/12828
DeepSeek V2/V3 MLA implementation: https://github.com/ggml-org/llama.cpp/pull/12801
Update llama-quant.cpp llama_tensor_get_type with DeepSeek friendly modifications: https://github.com/ggml-org/llama.cpp/pull/12727
Add Qwen2.5VL support: https://github.com/ggml-org/llama.cpp/pull/12402

18634 Segmentation fault

[MN-Thinker-72B] Hmm.. what would cause a segfault... other than, say, starting the imatrix process in the split second where rsync does the renames (or a bug in llama.cpp).

hmm, there has been a transformers release, I will test if that fixes the qwen2.5vl problems (text-only).

Llama-3_1-Nemotron-Ultra-253B-v1

Well, I suspect it's a slightly tweaked architecture and/or simply broken or missing support for it in llama.cpp. Cheap to retry, maybe the transformers upate helped. OR will hlep, once I vereified it happened.

wait, why does pip3 give me transformers 4.46.3, which is really old.

aha... so llama.cpp is fine with newer transformers versions, kind of:

/llmjob/llama.cpp/requirements/requirements-convert_legacy_llama.txt:transformers>=4.45.1,<5.0.0

However, when I pip3 install -U -r /llmjob/llama.cpp/requirements.txt, then it downgrades to transformers 4.46 (and uninstalls 4.51), so maybe some other dependency force-downgrades it.

But I don't understand how pip3 works - do dependencies only matter at the time of install, and are afterwards ignored? Because when I pip3 install -U transformers afterwards, it happily upgrades to 4.51 again. It also looks weird when I -r requirements.txt, as if it starts with 4.51, and then it searches down in versions till it finds 4.36 (total python noob here btw., don't assume I understand what I am doing):

Requirement already satisfied: transformers<5.0.0,>=4.45.1 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 3)) (4.46.3)
Collecting transformers<5.0.0,>=4.45.1
  Using cached transformers-4.51.1-py3-none-any.whl (10.4 MB)
Requirement already satisfied: gguf>=0.1.0 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 4)) (0.14.0)
Requirement already satisfied: protobuf<5.0.0,>=4.21.0 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 5)) (4.25.6)
Requirement already satisfied: torch~=2.2.1 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-convert_hf_to_gguf.txt (line 3)) (2.2.2+cpu)
Requirement already satisfied: aiohttp~=3.9.3 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 1)) (3.9.5)
Requirement already satisfied: pytest~=8.3.3 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 2)) (8.3.5)
Requirement already satisfied: huggingface_hub~=0.23.2 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 3)) (0.23.5)
Requirement already satisfied: matplotlib~=3.10.0 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 4)) (3.10.1)
Requirement already satisfied: openai~=1.55.3 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 6)) (1.55.3)
Requirement already satisfied: pandas~=2.2.3 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 7)) (2.2.3)
Requirement already satisfied: prometheus-client~=0.20.0 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 8)) (0.20.0)
Requirement already satisfied: requests~=2.32.3 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 9)) (2.32.3)
Requirement already satisfied: wget~=3.2 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 10)) (3.2)
Requirement already satisfied: typer~=0.15.1 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 11)) (0.15.2)
Requirement already satisfied: seaborn~=0.13.2 in /llmjob/share/python/lib/python3.11/site-packages (from -r /llmjob/llama.cpp/./requirements/requirements-tool_bench.txt (line 12)) (0.13.2)
Requirement already satisfied: filelock in /llmjob/share/python/lib/python3.11/site-packages (from transformers<5.0.0,>=4.45.1->-r /llmjob/llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 3)) (3.17.0)
  Using cached transformers-4.51.0-py3-none-any.whl (10.4 MB)
  Using cached transformers-4.50.3-py3-none-any.whl (10.2 MB)
  Using cached transformers-4.50.2-py3-none-any.whl (10.2 MB)
  Using cached transformers-4.50.1-py3-none-any.whl (10.2 MB)
  Using cached transformers-4.50.0-py3-none-any.whl (10.2 MB)
  Using cached transformers-4.49.0-py3-none-any.whl (10.0 MB)
  Using cached transformers-4.48.3-py3-none-any.whl (9.7 MB)
  Using cached transformers-4.48.2-py3-none-any.whl (9.7 MB)
  Using cached transformers-4.48.1-py3-none-any.whl (9.7 MB)
  Using cached transformers-4.48.0-py3-none-any.whl (9.7 MB)
  Using cached transformers-4.47.1-py3-none-any.whl (10.1 MB)
  Using cached transformers-4.47.0-py3-none-any.whl (10.1 MB)
Requirement already satisfied: packaging>=20.0 in /llmjob/share/python/lib/python3.11/site-packages (from transformers<5.0.0,>=4.45.1->-r /llmjob/llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 3)) (24.2)

We will need to use the imatrix setup for Llama-4-Maverick-17B-128E-Instruct.

If you mean rpc setup, I configured the imatrix job accordingly (with max. one hfd and one quant job on nico1), but it's currently in override.

PS: untested, you might be able to use llmc shell kaos and rm /tmp/Llama-4-Maverick-17B-128E-Instruct.soverride to remove the override, followed by llmc push, to enable it. Likewise this cna be used to override imatrix jobs. That works because on kaos, the global /tmp is shared to llmc shell (there are no secrets there in /tmp, presumably :)

PPS: false alarm, /tmp of course is +t, so it won't work. will have to move it elsewhere. but that should be easy, just not right now.

Please start the Llama-4-Maverick-17B-128E-Instruct imatrix task. The RPC imatrix setup is ready!

PS: untested, you might be able to use llmc shell kaos and rm /tmp/Llama-4-Maverick-17B-128E-Instruct.soverride to remove the override, followed by llmc push, to enable it. Likewise this cna be used to override imatrix jobs. That works because on kaos, the global /tmp is shared to llmc shell (there are no secrets there in /tmp, presumably :)
PPS: false alarm, /tmp of course is +t, so it won't work. will have to move it elsewhere. but that should be easy, just not right now.

As you predicted it doesn't work due to wrong permissions and the Sticky-Bit. It currently has 600 permissions so I can't do anything with it - not even read or write. For others that have 666 permissions I had at least some fun but even those I can't delete due to the Sticky-Bit. The 666 ones I was able to hardlink just to figure out I can't delete the hardlinked copy due to the Sticky-Bit - please delete addtxt_.txt which I created while messing around. Your jail terminal shows all files owned by 65534 no matter who created them so file permissions got a bit confusing. In any case looks secure from the few minutes I messed around with it. Whoever created that jail had the smart idea to read-only mount almost everything heavily limiting any potential attack surface.

Some quite awesome llama.cpp pull requests for convert_hf_to_gguf.py got created today:
convert : ability to lazy-load safetensors remotely without downloading to disk: https://github.com/ggml-org/llama.cpp/pull/12820
convert : write tensors in parallel: https://github.com/ggml-org/llama.cpp/pull/12837

#12820 allows us to convert SafeTensor to source GGUF without even having to store the actual files and not having to download model that immediately fail the conversion.
#12837 allows us to use multiple threads convert_hf_to_gguf.py making it much faster

Yesterday night I finished the perf measurement project freeing up 4 TB of SSD storage. That storage I now mounted to your container under /dpool. I'm currently creating Llama-4-Maverick-17B-128E source gguf on it. It should be done in around 3 hours. Please whitelist dpool so we can softlink to it.

#12820 allows us to convert SafeTensor to source GGUF without even having to store the actual files and not having to download model that immediately fail the conversion.

That is potentially useful, yet... with the amount of download errors and retries, I wonder if it would be a net-win.

#12837 allows us to use multiple threads convert_hf_to_gguf.py making it much faster

You mean likely much slower? Why would multiple threads make it faster, when it's already I/O bound everywhere? It would likely it make it slower almost everywhere if it used multiple threads.

The obvious optimisation for convert_hf_to_gguf.py would be to not make an extra copy of all tensors on disk.

Please whitelist dpool so we can softlink to it.

done! What's the usage guidelines for that, and... how fast is reading speed fro there? :)

In any case looks secure from the few minutes I messed around with it.

Makes me very happy to hear :)

I started with umask 0 for all hf-related things, because I have very few "foreign" users on my systems, and thought it might come in handy eventually. It kind of does, but using /tmp klind of destroyed that. In any casem I will likely provide imatrix-related llmc commands, that seems easier and safer. The llmc shell can also get some obvious improvements (I work with it every day during "audit"), but right now, I have very little time. Hopefully that changes soon-ish.

Also, since, at least between us two, you are the god of python, I am really at a loss of why pip always downgrades transformers, and that's my most pressing problem. We really need a more uptodate version for a lot of architectures. Not even pip3 -vvv tells anything about why it choses or downloads certain versions, only that it does.

I could force it by reinstalling transformers again after everything is installed, but there must be a reason why it downgrades (but then, why does it upgrade it when installed alone)

Hmm, I was offline an hour and now lots of imatrix jobs are missing, including the llama 4 maverick one. WQhat horrible error cascade/explosion did I miss?

Hmm, I was offline an hour and now lots of imatrix jobs are missing, including the llama 4 maverick one. WQhat horrible error cascade/explosion did I miss?

No idea. It already was like this when I woke up over an hour ago. I was so surprised to see llama 4 maverick gone without doing it's RPC imatrix task - it in fact never even connected to RPC servers. The imatrix RPC setup is still ready so please run it once things are fixed and imatrix for other models are done.

I turned on the nico2 LXC container but left it disabled for now in case you want to start the RPC imatrix task. If you don't want to do RPC now you can let nico2 work quantization tasks in the meantime. I disabled the nico2 shutdown handler so it will not get turned off by accident while we do imatrix RPC computation.

Edit: I enabled nico2 again so we make use of it in the meantime as it is great weather.

Oh great so llama.cpp fixed Llama-3_1-Nemotron-Ultra-253B-v1 support in: https://github.com/ggml-org/llama.cpp/pull/12843 - cool we still kept the model downloaded. I will give it a try and if it works, we either wait until this PR is merged or merge to to ouer own branch.

Edit: We will have to wait for the merge as the model will not load using mainline llama.cpp and so it would be unusable for normal users if we do it now.

done! What's the usage guidelines for that, and... how fast is reading speed fro there? :)

Usage guidance for ZFS based file volumes is always that you can fill them as much you want as I can easily set a storage limit. I recommend to always keep maybe 1 GB free, or performance will get terrible.

dpool uses a 4000G Kingston FURY Renegade PCIe 4.0 NVMe M.2 SSD. It should have 7300MB/s read- and 7000MB/s write speed and 1000000 IOPS. It however currently uses ZFS which will likely be the limiting factor. I might BTRFS format it in the future.

dpool uses a 4000G Kingston FURY Renegade PCIe 4.0 NVMe M.2 SSD. It should have 7300MB/s read- and 7000MB/s write speed and 1000000 IOPS. It however currently uses ZFS which will likely be the limiting factor. I might BTRFS format it in the future. Doing it just before Llama 4 llama.cpp support was great timing.

back and started.

the weird thing is that not only the jobs are gone (stored on kaos), but for two jobs which still existed (gemma), the .gguf file was not in /tmp, despite the transfer having been successful.

as for logs, the last log for the maverick job was yesterday 14:55:41, when the scheduler found thjere is not enough budget, meaning the job has not been touched since then.

very strange.

/dpool

Wow, that sounds fast. But indeed, it's hardly faster than /bpool at the moment.

And indeed, a lot of imatrix jobs have either been lost, or somehow their gguf file in /tmp was lost. Two hard to connect failure modes.

Oh great so llama.cpp fixed Llama-3_1-Nemotron-Ultra-253B-v1 support in

Looking at that issue, I must say your use of the word "fixed" is refreshingly optimistic. But at least somebody cares, that's very good :)

@mradermacher The imatrix RPC task OOM failed. So computing the imatrix of this model must be really tight even with RPC. It tried to allocate 32 GB of GPU memory on a 24 GB GPU so it wasn't even close. I now switched to the max memory imatrix RPC setup during which StormPeak is almost unusable. Please restart the Llama-4-Maverick-17B-128E-Instruct imatrix task.

Well, on the positive side, despite numerous attempts to oom-crash your nodes, it seems to be quite resilient by now. I've set temporary hfd and quant limits to 0 on nico1.

on a related note, recently, imatrix jobs have started to crash like this:

CUDA error: an illegal memory access was encountered

it's very noticable, because this has never happene dbefore, and now it affects about a third of the models. I suspect a bug in current llama, or some config change - maybe we should go back to an older llama version?

Could be also the models, but it's too common and appeared too sudden for this to be likely.

compute_imatrix: computing over 314 chunks with batch_size 512
/llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu:2480
  cudaStreamSynchronize(cuda_ctx->stream())
[New LWP 3765900]
[New LWP 3765901]
[New LWP 3765902]
[New LWP 3765903]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007c1c69012c17 in __GI___wait4 (pid=3766092, stat_loc=0x7fff5c24162c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007c1c69012c17 in __GI___wait4 (pid=3766092, stat_loc=0x7fff5c24162c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007c1c695ef0b4 in ggml_print_backtrace () at /llmjob/llama.cpp-cuda512/ggml/src/ggml.c:156
156             waitpid(pid, &wstatus, 0);
#2  ggml_abort (file=0x7c1c620e0f40 "/llmjob/llama.cpp-cuda512/ggml/src/ggml-cuda/ggml-cuda.cu", line=75, fmt=0x7c1c62106e88 "CUDA error") at /llmjob/llama.cpp-cuda512/ggml/src/ggml.c:183
183         ggml_print_backtrace();
#3  0x00007c1c61e8d033 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /llmjob/llama.cpp/build/bin/libggml-cuda.so
#4  0x00007c1c61e8e59a in ggml_backend_cuda_synchronize(ggml_backend*) () from /llmjob/llama.cpp/build/bin/libggml-cuda.so
#5  0x00007c1c6960434c in ggml_backend_sched_compute_splits (sched=0x60e771a7ca00) at /llmjob/llama.cpp-cuda512/ggml/src/ggml-backend.cpp:1427
1427    /llmjob/llama.cpp-cuda512/ggml/src/ggml-backend.cpp: No such file or directory.
#6  ggml_backend_sched_graph_compute_async (sched=0x60e771a7ca00, graph=<optimized out>) at /llmjob/llama.cpp-cuda512/ggml/src/ggml-backend.cpp:1590
1590    in /llmjob/llama.cpp-cuda512/ggml/src/ggml-backend.cpp
#7  0x00007c1c697226d9 in llama_context::graph_compute (this=this@entry=0x60e771b9a500, gf=gf@entry=0x7c1c226fb030, batched=<optimized out>) at /usr/include/c++/12/bits/unique_ptr.h:191
191           pointer    _M_ptr() const noexcept { return std::get<0>(_M_t); }
#8  0x00007c1c69725522 in llama_context::decode (this=0x60e771b9a500, inp_batch=...) at /llmjob/llama.cpp-cuda512/src/llama-context.cpp:1329
1329    /llmjob/llama.cpp-cuda512/src/llama-context.cpp: No such file or directory.
#9  0x00007c1c697267ab in llama_decode (ctx=<optimized out>, batch=...) at /llmjob/llama.cpp-cuda512/src/llama-context.cpp:2792
2792    in /llmjob/llama.cpp-cuda512/src/llama-context.cpp
#10 0x000060e7644af309 in compute_imatrix (params=..., ctx=0x60e771b9a500) at /llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:554
554     /llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp: No such file or directory.
#11 main (argc=<optimized out>, argv=<optimized out>) at /llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:686
686     in /llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp
[Inferior 1 (process 3765898) detached]

And now leia can't create repositories anymore (403 Forbidden). If I try to refresh the token via the web interface, I also get a 403. Hopefully that's not a sign of things to come.

nope, same on nico1 now.

And now leia can't create repositories anymore (403 Forbidden). If I try to refresh the token via the web interface, I also get a 403. Hopefully that's not a sign of things to come.

Same for nico1

And now leia can't create repositories anymore (403 Forbidden). If I try to refresh the token via the web interface, I also get a 403. Hopefully that's not a sign of things to come.

Its fixed now.

Typical hf day then.

I configured Llama-4-Maverick-17B-128E for rpc as well. nico1 is paused (not a good moment), so I can't easily see if it is configured properly, but it likely is.

I configured Llama-4-Maverick-17B-128E for rpc as well. nico1 is paused (not a good moment), so I can't easily see if it is configured properly, but it likely is.

I paused nico1 and nico2 because quantisation tasks where still running when you started the RPC imatrix task and I didn't want it to OOM again. Memory will be very tight so it is unclear if there will be enough memory left to have any quantisation tasks running while imatrix RPC is running.

I resumed nico1 and instead only paused llmjob.nico1. I should have done so from the beginning but using host-pause and canceling it before it disables the host is more convinient.

Imatrix RPC setup looks great so far.

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: RPC[192.168.200.201:7201] model buffer size = 221778.05 MiB
load_tensors: RPC[192.168.200.202:7202] model buffer size = 443556.09 MiB
load_tensors: RPC[192.168.200.203:7203] model buffer size =   600.04 MiB
load_tensors: RPC[192.168.200.204:7204] model buffer size = 96420.84 MiB
load_tensors:   CPU_Mapped model buffer size =  1973.12 MiB
...........................

Llama-3_1-Nemotron-Ultra-253B-v1

Well, I suspect it's a slightly tweaked architecture and/or simply broken or missing support for it in llama.cpp. Cheap to retry, maybe the transformers upate helped. OR will hlep, once I vereified it happened.

#12843 is already fixed! We can even keep our current source GGUF. I'm so excited for this. We could already quant it but then nobody could run it using official llama.cpp releases. I really hope that this is getting merged soon. The model is quite amazing for its size beating Llama-4-Maverick and DeepSeek R1 in some benchmarks despite being much smaller. I guess I will just upload my Q4_K_M quant I made for testing to my own HuggingFace account for now so the community has something to test the model in the meantime.

on a related note, recently, imatrix jobs have started to crash like this:
CUDA error: an illegal memory access was encountered
it's very noticable, because this has never happene dbefore, and now it affects about a third of the models. I suspect a bug in current llama, or some config change - maybe we should go back to an older llama version?
Could be also the models, but it's too common and appeared too sudden for this to be likely.

Guess what crashed the RPC imatrix setup this time. Not OOM but this garbage:

root@RPC-GPU:~# GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./run.sh

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('192.168.200.201') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
Starting RPC server
  endpoint       : 192.168.200.201:7201
  local cache    : n/a
  backend memory : 35 MB
Accepted client connection, free_mem=36700160, total_mem=36700160
Client connection closed
Accepted client connection, free_mem=36700160, total_mem=36700160
Client connection closed
Accepted client connection, free_mem=36700160, total_mem=36700160
Client connection closed
Accepted client connection, free_mem=36700160, total_mem=36700160
Client connection closed
Accepted client connection, free_mem=36700160, total_mem=36700160
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /root/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2465
  cudaStreamSynchronize(cuda_ctx->stream())
/root/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
./run.sh: line 8:  5960 Aborted                 ./rpc-server -H 192.168.200.201 -p 7201 -m 35.5

This is likely related to https://github.com/ggml-org/llama.cpp/issues/12798
I updated ouer fork so we have the now merged support for Qwen3 and Qwen3MoE (https://github.com/ggml-org/llama.cpp/pull/12828).
Lets get latest and rebuild with GGML_CUDA_GRAPHS=OFF
In the case of RPC servers this is: cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON -DGGML_CUDA_GRAPHS=OFF

CUDA error: an illegal memory access was encountered

Yup, seems a common error. All failed models failed with that. But it's not universal. Also not related to llama-4. But seems mostly to affect llama.

rebuilding all with -DGGML_CUDA_GRAPHS=off - if possible, drop me a note when this is fixed so I can remove it (or alternatively, we can investigate if this is even useful for us)

The imatrix setup is now ready again. I got llatest llama.cpp and rebuilt without DGGML_CUDA_GRAPHS. Please restart the imatrix RPC task.

rebuilding all with -DGGML_CUDA_GRAPHS=off - if possible, drop me a note when this is fixed so I can remove it (or alternatively, we can investigate if this is even useful for us)

I will monitor that issue. Lates hope the temporary workaround of setting GGML_CUDA_GRAPHS works. Please restart the RPC imatrix task so it starts after the current imatrix tasks as I really want to have the RPC imatrix task running within the next few hours as otherwise the timing is quite bad for me. If that doesn't work for you feel free to even kill the currently running imatrix tasks in favour of RPC imatrix computation.

I started some of the small iomatrix tasks that failed before and they work. Unfortunately, a bigger one sneaked in, but after that, the first rpc task should start automatically (~40min).

Nice this time RPC imatrix computation started successfully. After the 2.5 hours it took to load the model RPC imatrix computation seems to be much faster than expected.

RAM usage:
CastlePeak: 90.79% (228.36 GiB of 251.53 GiB) (hosts nico2)
StormPeak: 88.67% (446.15 GiB of 503.18 GiB) (hosts nico1)
Threadripper: 88.61% (111.32 GiB of 125.63 GiB) (hosts OpenWrt router)

I think maybe a single quantisation task on nico1 could work if you want to risk it.

It's going so fast you could RPC imatirx compute Llama-4-Maverick-17B-128E after the current instruct one. Not sure if you already configured it to use RPC but if you did it's next in queue and should automaticaly start once the curent one is done.

a good day to wake up to :)

I think we might ran out of nico1 spool storage for a breaf period as booth Llama-4-Maverick-17B-128E-Instruct and Llama-4-Maverick-17B-128E randomly failed simultanously. We are currently at 326 GB free storage at the time of me checking.

The The-Omega-Concession-M-24B-v1.0 imatrix task might broke due to the storage situation as well as it failed once done without any error as the log file just stops likely when things ran out of storage so please restart it as well.

I wonder if we have to restart some HuggingFace uploades due to the storage situation as HuggingFace upload failed due to no space left in 'Llama-4-Maverick-17B-128E-Instruct-i1-GGUF-Llama-4-Maverick-17B-128E-Instruct.i1-IQ3_M.gguf*.log':

Llama-4-Maverick-17B-128E-Instruct.i1-IQ3_M.gguf.part1of4:  83%|████████▎ | 36.6G/44.0G [13:19<02:02, 60.9MB/s]'(ProtocolError('Connection aborted.', OSError(28, 'No space left on device')), '(Request ID: 7fb0487d-b688-43c2-8d71-dd2642d519a7)')' thrown while requesting PUT 

No it actually restarted itself after the storage situation was fixed but then failed with that error:

Llama-4-Maverick-17B-128E-Instruct.i1-IQ3_M.gguf.part1of4: 100%|██████████| 44.0G/44.0G [16:03<00:00, 45.7MB/s]
BadRequestError(' (Request ID: Root=1-67f79659-5ebf7bfd6737a9b26474a95c;a3f6c7cb-2d0c-4828-aed8-182ed473112a)\n\nBad request for commit endpoint:\nYour push was rejected because an LFS pointer pointed to a file that does not exist. For instance, this can happen if you used git push --no-verify to push your changes. Offending file: - Llama-4-Maverick-17B-128E-Instruct.i1-IQ3_M.gguf.part1of4') at /llmjob/share/bin/llmjob line 2804.

Let's hope that fixed it:

nico1 ~# ils 
hfu 81284 212230 212235 212289 213285 213289 213293 237369 237370 3788005 3850513
hfu-Llama-4-Maverick-17B-128E-Instruct-i1-GGUF 3788005
hfu-Llama-4-Maverick-17B-128E-Instruct.i1-IQ3_M.gguf 3788005
nico1 ~# ikil 3788005

This killed the faulty upload task but now it’s not trying to upload it again.

i1-IQ3_M for Llama-4-Maverick-17B-128E-Instruct never got uploaded to https://huggingface.co/mradermacher/Llama-4-Maverick-17B-128E-Instruct-i1-GGUF/tree/main as they no longer seam to exist on nico1.

Edit: Neverminded I somehow missed them. They did get uploaded! Confirmation bias is so real. So everything is perfect. Strange they did despite me having to kill the upload. I should put more trust in your system automatically doing the right thing. I'm so used to systems not working the way they should.

"Your push was rejected because an LFS pointer pointed to a file that does not exist."

How could this even happen with the hub api?

Edit: Neverminded I somehow missed them. They did get uploaded!

It's actually quite a common failure mode (for us) to have it uploaded, but not deleted.

In any case, the logic is quite accidental - here is a high-level description:

  1. When a job starts, it first enumerates all running hfu process groups, then downloads the remote file listing and merges them.
  2. The quantizer then skips creating quant that already exist on disk, or are in the previous list.
  3. Whether it was created or not, when it exists, the script will then upload it, and, if successful, delete it.

If a job is interrupted, it still goes through all quants again, so as long as it restarts, uploads will be restarted as needed. That's all both by luck, and of course because I try to design processes to be idempotent, if convenient.

The only hole is if the job is done (i.e. 9999 blocked/nonempty, or imatrix phase after static phase), and the uploads get interrupted (e.g. after a crash) then uploads will not be reattempted. The symptom is that
the job will stay forever in blocked/nonempty, until the quants are deleted (preferably after being uploaded). This is kind of a design bug - the "quantize" script does so much work on its own. If all the various subtasks (like uploads) were jobs, they could be scheduled more efficiently. However, I would have never invested the effort into making something so complicated...

[the rest]

Yeah, I noticed. No clue why it was full, but it was most certainly full at some point. Fortunately after the rpc imatrix was done...

regarding my python troubles, this must be a bug. first of all, there seems to be no way whatsoever to get some useful debug log out of pip3 - you can get download logging to no end, but not a single line explaining why pip downgrades a package.

But it makes no sense. llama.cpp's top-level requirements.txt looks like this:

-r ./requirements/requirements-convert_legacy_llama.txt
-r ./requirements/requirements-convert_hf_to_gguf.txt
-r ./requirements/requirements-convert_hf_to_gguf_update.txt
-r ./requirements/requirements-convert_llama_ggml_to_gguf.txt
-r ./requirements/requirements-convert_lora_to_gguf.txt
-r ./requirements/requirements-tool_bench.txt

I can pip install -U -r every single one of these files individually and I get transformers 4.51. If I install them in one go via the top-level requirements.txt, it downgrades to 4.46.

Ok, apparently, this dependency causes it: huggingface_hub~=0.23.2

And yes, that looks to be a design bug, as these dependencies are not checked anymore once a package is installed, so they are absolutely pointless.

@nicoboss both llamas 4 maverick quants stopped with the same problem:

llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.1.ffn_gate_exps.weight in a very low-bit quantization

Normally, that's an imatrix coverage problem, but in this case, I think it's simply a bug in llama 4 support. We could provide "the other" imatrix quants (basically non-IQ1/2/3 and IQ3_M or so, I have a list).

But that might mean that the other imatrix quants will have lower quality, depending on the nature of the problemn.

Suggestions?

In other news, I started the ling-plus* imatrix quants with whatever I found in /tmp on nico1. I have no clue why there were imatrix files there, and I have no clue what is in them :) The truth will come when we hit the first IQ2 or so quant.

I am not sure thats a good idea, obviously, but the only other option would be to declare the Ling-plus models as broken and ignore them (ok, the third option would be to resurrect imatrix-data-quarter, but I also don't feel like using the rpc setup for that).

pip always tries to satisfy the dependencies of newly installed packet even if that requires upgrading/downgrading dependencies of previously installed packets. This because in a python environment you can only have a single version installed for a specific packet. If the situation is totally unresolvable pip shows a dependency conflict. I personally find pip unable to resolve dependencies way more annoying than it breaking previously installed packets as there is usually no way to resolve such conflicts. With pip doing something and trying its best there usually is a decent possibility of everything working. Packet maintainers and developers are mostly to blame for this mess. Some just specify that they require a very specific version of a packet for no reason other then that’s what they tested their software with instead of specifying that everything of a specific major/minor version or anything newer than a specific version is supported. Most do that because they don’t want to be blamed if whoever maintains that dependent packet doesn’t follow sane versioning guidelines and introduces breaking API changes in a bugfix release or introduces catastrophic bugs in a bugfix release. In any case this behavior just results in nobody taking version requirement seriously anymore as they are just annoying if they are used more as recommendations than true requirements.

I wouldn’t be surprised if we indeed didn’t cover all 128 experts of Llama 4 maverick but given that we use our algorithm to store them anyways as long we cover most experts this should have worked. It would be interesting to check the log and see what happened if you still have that somewhere. It could be a bug with the current relatively new Llama 4 implementation but hard to say without looking at the logs.

Regarding long-plus models there we had the NaN crash. The imatrix files in /tmp are probably the last autosave before the NaN crash. The clean solution would be llama.cpp fixing the NaN issue on their side, you editing your imatrix training data and delete the chunk of training data which caused NaN crash for that specific model or using different imatrix training data for this model like the imatrix dataset under /root. I planed on trying some other imatrix training datasets on some smaller quant and if they work do them using the RPC imatrix setup. Then unfortunately Llama 4, Nemotron Ultra 253B and a ton of llama.cpp PRs to monitor like MLA implementation and Qwen 3 support came in combination with me having a busy time at work leaving me with no time to reinvestigating long-plus. I guess providing imatrix quants with the imatrix we have is fine. Just keep in mind that this was only trained on like a quarter of data we usually use for imatrix training.

Until 25th of April I will be limited in my ability to respond to unimportant HuggingFace comments like model requests and user questions, queue models, execute llmc audit, investigate issues with certain models, manually provide GGUFs of large models or start the RPC imatrix setup so you might have to do a bit more by your own than usual during this time. It’s not that I can’t do those things just not as much as usually. On the bright side I will now finally have time to work on our llama.cpp fork and add the long-requested versioning metadata.

llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.1.ffn_gate_exps.weight in a very low-bit quantization

Normally, that's an imatrix coverage problem, but in this case, I think it's simply a bug in llama 4 support.

Unsloth also reports similar

image.png

They also provide how much calibration data they used:

We tried adding more uncommon languages to our calibration dataset, and tried using more tokens (1 million) vs Scout's 250K tokens for calibration, but we still found issues.

They got around it by:

We decided to leave these MoE layers as 3bit and 4bit.

I think there is an issue with the tensors in block 45 as it is below 50%, block 3 is above 95% and thus gets saved even if it is partial, block 1 is not above 95% so it does not get saved hence why you hit the error, but it is above 90% so it doesn't seem that strange it doesn't save but block 45 is below 50% which is strange, but I think you can force all three through anyway and test by adjusting the cutoff or adding conditions to the changes you used here: https://github.com/nicoboss/llama.cpp/pull/1

Since if the calibration doesn't hit those experts then inference has a good chance of not which is why it might be worth the test if you guys want to or you can do like unsloth and just adjust those tensors in 45 (and 1 if you don't want to set cutoff to 90%) to be saved in higher bit.

Also on pip I tend to use the dry run option documented here: https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-dry-run which makes pip a little more bearable to use.

hi @mradermacher , how are you? a small oops happened on rich1 and we deadlocked status page and potentially scheduler. I need a tiny bit of help to recover without making a huge mess =) sorry for inconvenience and thank you for help =)

And I guess it would be a great idea to make a command to force release a lock withput killall9

Llmc audit still works, so just status page is dead

And it's every interesting what broke, so if you fix tell me please what happened =)

pip always tries to satisfy the dependencies of newly installed packet even if that requires upgrading/downgrading dependencies of previously installed packets.

Right, but it seems to me very broken to have dependencies, only to have them ignored. Sure, if a package creator says transformers<4.46 needlessly, the creator is to blame (and is probably the case here).
But if that were the only reason for these dependencies to exist, they should not exist. And in the cases where these dependencies are actually required for a package to work, then
it makes no sense for pip to break them. Puzzling.

This because in a python environment you can only have a single version installed for a specific packet.

That's actually very much sane :)

I wouldn’t be surprised if we indeed didn’t cover all 128 experts of Llama 4 maverick but given that we use our algorithm to store them anyways as long we cover most experts this should have worked.

This is a case of a tensor not stored at all. Either our patch ignores 0% covered tensors, or there ius another issue. Since llama-4 has some duplicated tensor issues, maybe this is it.

It would be interesting to check the log and see what happened if you still have that somewhere.

Nope :)

Regarding long-plus models there we had the NaN crash. The imatrix files in /tmp are probably the last autosave before the NaN crash.

It looks like that, but autosaves are off, because of the potential problem I reported with the weight patch. Since we also don't get messages anymore when it's off, I thought it would really be off.

If it is on, then this is worrying, because I then think our imatrix weight patch might slightly corrupt the imatrix data.

The clean solution would be llama.cpp fixing the NaN issue on their side,

Assuming (without evidence) that this is not a model issue.

Until 25th of April I will be limited in my ability to respond to unimportant HuggingFace comments like model requests and user questions, queue models, execute llmc audit, investigate issues with certain models,

Good, that means we are both busy :) Maybe we need to pres somebody else into work :) @RichardErkhov what do you do in your ample free time?= :)

@tdh111

Unsloth also reports similar

Yes, similar, but I think it's different - we still save partially covered tensors (or should), and this is a tensor completely missing, which is why I think it's not the usual "some expert not fully covered" issue.

to be saved in higher bit.

I don't think this can easily be done with llama.cpp. I think we should either wait a bit longer for something magical to happen (such as an upstream fix), or skip those quants. Or both, in the order.

Somebody has helpfully reported the llama 4 issue btw.: https://github.com/ggml-org/llama.cpp/issues/12913

@RichardErkhov what do you do in your ample free time?= :)

Who said I have free time haha? Im last year of school, I have like 20 exams and then preparation to move to another country for uni lol. And everything is mixed with writing code of course

What do you want though? Maybe I will have some time to work in parallel haha

@RichardErkhov

What do you want though?

Well, a great help would be to go through the mradermacher model requests, queue them, and report back to the requester. Typically, it looks like this:

https://huggingface.co/mradermacher/model_requests/discussions/828#67f577bfba17fca922c920b9

It would already be a great help if you just queued some normal/simple requests and simply ignore everything too complex.

You can queue models like this from rich1 (e.g. for the example above):

llmc add -2000 si https://huggingface.co/deepcogito/cogito-v1-preview-qwen-32B

It would be great help because I can sometimes only check once per day or even less often, so that would reduce latency a lot.

Im last year of school, I have like 20 exams and then preparation to move to another country for uni lol.

Haha, as if anybody cares for school grades :)

(Well, yeah, depending on what you want to do, they are important indeed :)

And it's every interesting what broke, so if you fix tell me please what happened =)

I don't know, but typically it results in rsh hanging and still keeping the socket open that the status daemon wants to see closed.

Don't worry about things breaking though, at least, as long as there was a reason to.

killall9

It could be made less crude, but there is little alternative in killing everything that holds a lock when there is a real deadlock (but there hasn't been for a long while). Maybe I can make the llmstatusd separately restartable, because that is always safe to restart.

i will try to queue them, and will ping you if something is wrong. I hope I will not break anything, as I tend to "if anything can go wrong will go wrong", but I will try

Haha, as if anybody cares for school grades :)
(Well, yeah, depending on what you want to do, they are important indeed :)
with my amazing school I need to grind to get something above D on a level exam, even with my 98.6 average...

And yes, please make a command to release a lock haha, or else you will wake up every day from a deadlock just because I failed to properly add a model

@tdh111

Unsloth also reports similar

Yes, similar, but I think it's different - we still save partially covered tensors (or should), and this is a tensor completely missing, which is why I think it's not the usual "some expert not fully covered" issue.

You only save partial when they are above 95%, like I showed in unsloth's image block 1 is 93% so not covered by your patch. Your patch does catch block 3 though which unsloth does not save but yours should be. Block 45 is the problematic one at under 50% though.

@nicoboss nice audit action :)

+                if (bad_experts.size() < round(kv.second.n_as * 0.05)) {

@tdh11 Hmm, I was misinformed about what that patch does then :( Thanks for pointing it out.

@RichardErkhov

i will try to queue them. I hope I will not break anything, as I tend to "if anything can go wrong will go wrong", but I will try

Don't worry, I know you enough by now. But it should be relatively foolproof (you are just queuing a very few models per day at most :). Also, maybe I was too sloppy earlier: of course school goes first, but I assume you know that, just covering my base :)

Anyway, thanks for your help, once more.

will ping you if something is wrong.

I normaly go through all model requests (because I can't see if any went wrong anyway, or have been successfully acted upon). If not sure, just ignore it, you don't have to specifically ping me. Likewise, is you see that a model failed, feel free to report back to the original submitter in the issue, but don't feel you need to alert me.

  •            if (bad_experts.size() < round(kv.second.n_as * 0.05)) {
    

@tdh11 Hmm, I was misinformed about what that patch does then :( Thanks for pointing it out.

I'm sorry for the poor communication in the past, but yes that is what the patch does, if it allowed anything through than it would hide situations where you do not have enough coverage for the model (which with the modern MoE's is actually difficult).

Well, theopen question is why we do have "auto-saves" despite -ofreq not specified. Because my fear is that the patch changes the imatrix data in-place, which then goes into the next computation. When I dropped -ofreq, I also didn't get any messages anymore about tensor coverage, except the end. Very strange.

@nicoboss I think you queued a bunch of glm4 models - they are not supported by our llama.cpp version - fortunately, as the current upstream supports it, but apparently silently data corrupts them when quantizing (which is weird, but hey).

@x0wllaar helpfully pointed this out in https://huggingface.co/mradermacher/model_requests/discussions/844#67fe8fb6046aebf7c8dcdcf5

update: i hate this broken @ completion thingy. gets me every single time

@nicoboss and while I have you, this is also interesting: https://huggingface.co/mradermacher/model_requests/discussions/847#67ffc38637b9f0d1152b9d45

basically, the rather common llama_vocab_get_add_eos crash might actually be a fixed llama.cpp bug

@nicoboss the GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed issue seems to be a bug in our fork only - quants that fail like seem to work fine with normal llama.cpp (and e.g., koboldcpp). (i have not tested the exact same versions, though)

it was almost certainly introduced with the dryrun changes

yup, verified. that seems to affect 161 models :) and is the most common killer for things that survive DRYRUN

@nicoboss also, it seems our build numbers are unrelated to llama.cpp build numbers. that is unfortunate.

for testing, i'll switch back to llama.cpp

nope, happens with upstream llama as well, so it must be our compile options. but it happens with all variants (cuda, nocuda, noavx...)

sucks that we both don't have time :()

i debugged a bit, so add_eos is true for me, and it seems it is so because the model has: tokenizer.ggml.add_eos_token bool = true

and this happens in main:

if (!llama_model_has_encoder(model)) {
    GGML_ASSERT(!llama_vocab_get_add_eos(vocab));
}

has_encoder is false, because arch is llama, and therefore, assertion failure.

Good thing we waited with MLA:

Well... It's always like this. And the only thing to avoid it is to let it cook for a while and let people test it - it's normal to have bugs, especially with llama.cpp's essentially zero testing/quality approach.

I think we might have glm4 support upstream now: https://github.com/ggml-org/llama.cpp/pull/12957 https://github.com/ggml-org/llama.cpp/pull/13021

I updated our fork with latest llama.cpp so you can update and try it out. Please keep in mind that GLM4-0414 is currently probably still broken and will be fixed once https://github.com/ggml-org/llama.cpp/pull/13099 gets merged but all other GLM-4 models should be fine.

Well... It's always like this. And the only thing to avoid it is to let it cook for a while and let people test it - it's normal to have bugs, especially with llama.cpp's essentially zero testing/quality approach.

Yes I know and my code is also not always perfect but I usually do a lot of automated and manual testing to at least ensure a certain quality standard. llama.cpp even mentions inside https://github.com/ggml-org/llama.cpp/blob/master/docs/development/HOWTO-add-model.md that main, imatrix, quantize and server should be tested when adding a new architecture. I would expect the same when making significant changes to the implementation of an existing architecture. But in the end all llama.cpp contributors just do so in their spare time so I guess we can all be happy that they contribute at all and things breaking from times to times is not that bad as we can always wait a bit for things to mature unless a model is urgent.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment