Whimsical Waffle: The Curious Case of LLMs and Their Linguistic Shenanigans
yay
Actually, to get the DRYRUN test, all we would have to do is to get rid of the MAP_POPULATE in:
mmap(NULL, 31937041504, PROT_READ, MAP_SHARED|MAP_POPULATE, 4, 0) = 0x7ff79c600000
Because I think with the right switches, we can otherwise avoid touching the memory (alternatively, map /dev/null). Of course, the measurements allowed by DRYRUN are much more worthwhile. Basically, it's the killer feature if we could make it available and it turns out to be feasible. Thats the really interesting (to me) todo point: create a script that downloads the gguf header only from huggingface and recreates a dummy gguf. Too bad the gguf file format is so badly designed - you have to decode the whole header incrementally to know how long it is.
(using fuse to mount a file via https is cheating)
btw., in the case of blacksheep, i take the lists of quants done from the "quantize" script and patch the job like this:
"iquants": "Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S",
and fore the jais models for example, I removed the *0, *1, IQ4_NL quzant, essentially:
"squants": "x-f16 Q4_K_S Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS",
"iquants": "Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M Q4_K_M IQ2_M Q6_K IQ4_XS Q2_K_S IQ1_M Q3_K_S IQ2_XXS Q3_K_L IQ2_XS Q5_K_S IQ2_S IQ1_S Q5_K_M IQ3_XS IQ3_S",
it's in theory possible to do this when adding the job (not via llmc, because reasons), but that requires us to predict with some accuracy that this will happen, so is rarely useful
Actually, to get the DRYRUN test, all we would have to do is to get rid of the MAP_POPULATE in:
mmap(NULL, 31937041504, PROT_READ, MAP_SHARED|MAP_POPULATE, 4, 0) = 0x7ff79c600000
I'm a bit confused. Dryrun doesn't even use mmap. I explicitly disable it and even print "mmap is not supported for dry-run so it is now disabled" as warning if you don't specify --no-mmap
. Why would you even want mmap for dry-run? You are not allocating any memory when loading the model so what would be the point of it?
Because I think with the right switches, we can otherwise avoid touching the memory (alternatively, map /dev/null).
What you mean with touching memory? No additional RAM or GPU memory should get allocated when loading a model. Obviously llama.cpp requires some memory to function like any application but that is so little it can be ignored.
Of course, the measurements allowed by DRYRUN are much more worthwhile. Basically, it's the killer feature if we could make it available and it turns out to be feasible. Thats the really interesting (to me) todo point: create a script that downloads the gguf header only from huggingface and recreates a dummy gguf. Too bad the gguf file format is so badly designed - you have to decode the whole header incrementally to know how long it is.
I don't think the header can be that big so you can likely just download enough for the full header to always be present.
btw., in the case of blacksheep, i take the lists of quants done from the "quantize" script and patch the job like this
"iquants": "Q2_K IQ3_M Q4_K_S IQ3_XXS Q3_K_M small-IQ4_NL Q4_K_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS Q4_1 IQ3_S"
I assume you are setting this inside llmjob edit
.
Wouldn't the scripts synchronize when it is available again?
Altogether it's 3GB, not just scripts, but also, of course, llama.cpp. I added a hack so when removing the disable flag it will sync automatically, but I also update llama.cpp from home, and every node has a different combination of llama.cpp variants (probably the easiest way around is to change that).
But, yeah, that's not effectively automatable.
Yes even for me it would now be inconvenient to switch as I memorized the path so well.
embrace the difference :)
Oh, let's hope for the best. No imatrix failure so far but a lot of imatrix tasks will only be started at 22:00 due to most of them currently being timeofday blocked.
I am pretty sure the dryrun test works - the onyl way it could fail is if it somehow succeeds despite the model being broken. Likely there are some tests in llama.cpp that are only done at inference time, the question is, how many, and are they important :) We will find out.
Just so you know DRYRUN is supposed to work with every llama.cpp executable that loades a model so you are not limited to llama-cli.
To... some extent (i.e. tracking allocations)? Surely you have not found a generic way to exit all of these at just the right time.
Then just don't use llama-cli but any other one that doesn't do this.
Haha, "just". Love it :) Anyway, are there any? There is the server, but the server seems to do the same thing.
Nice. No idea why everyone keeps renaming thair models but us having a diffrent name makes ouer models hard to find so automated renames would be quite usefull.
They rename it because they want to be able to erase it and create a different one without having to come up with a new final name, in case it sucks. Models are also regularly moved, and sometimes even aparently cloned, to other users.
It does make them harder to find, but at least I stopped using the search function by hf and started to use the quantisations link.
That would be amazing! There are quite a lot of factors that influence vram usage but maybe you can find a pattern by playing around with dryrun.
I would allow the user to specify VRAM for 0, 1 or 2 gpus, tensor split, some flags like flash attention, and then probably do a binary search to find the maximum -ngl value.
models always show the date when they were last updated
You'll have to check wuant file dates anyway if you need some kind of date. And then, it's pretty useless.
I guess we can at least try to update them in chronological order, so the order stays the same. Or can we?!?
The updates would almost certainly go from newest to oldest, even (or rather, reverse order in how hf lists them for me), with some randomness.
GIT_COMMITTER_DATE and GIT_AUTHOR_DATE environment variables before committing using git
If I can't do it via the api it will not happen. Messing in scripts with git will be a disaster. Besides, will the server-side git really just accept any client-side garbage date when pushed?
as this will hopefully be the last time we ever edit all of them.
The other a-ha moment I had last week was when I realised that this is the problem and must give. I have versioned the model cards now, so we can keep any number of different co,patible card formats and update at our own pace.
I don't think with us publishing 100+ repos a day anybody would care about 20000 updates even per day.
I'm a bit confused. Dryrun doesn't even use mmap. I explicitly disable it and even print "mmap is not supported for dry-run so it is now disabled" as warning if you don't specify --no-mmap. Why would you even want mmap for dry-run? You are not allocating any memory when loading the model so what would be the point of it?
I was talking about an alternative way to achieve just the validity testing without changing llama.cpp. It's entirely hypothetical.
I don't think the header can be that big so you can likely just download enough for the full header to always be present.
The header is pretty massive - tiny if you look at the whole file, but many megabytes in size to warrant an optimisation. My first computer had ~100 octets usable memory. I sawe amazing sofwtare wirtten in 20k of memory. When I see a bash process using 2MB of RAM I regularly get dizzy.
Anyway, gguf is very wasteful, or example, every vocabulary entry is 8 bytes string length + string. Also, "likely enough" means you still have to be prepared for it to not be enough in edge cases.
And to be honest, what worries me most is that aws typically charges for the full file even if only a few bytes of it are being downloaded. But since the gguf parse on the hf page exists, I am sure it doesn't matter :)
To... some extent (i.e. tracking allocations)? Surely you have not found a generic way to exit all of these at just the right time.
It should work for majority of them. Almost all that load a model are using the same code to do so. I just tested llama-imatrix
, llama-perplexity
, llama-simple
, llama-simple-chat
and llama-run
all of which were fully compatible with DRYRUN despite me never testing them before. It’s not that they are just working they also tell you how much memory would be required to fulfill to load the model in a way that fulfills thar purpose as they essentially just load the model with the exact parameters they require.
Haha, "just". Love it :) Anyway, are there any?
No ide. Try the ones I mentioned above and if they all do it than this is likely something in the model loading code in which case I can take a look at the code and see if we can change this.
I would allow the user to specify VRAM for 0, 1 or 2 gpus, tensor split, some flags like flash attention, and then probably do a binary search to find the maximum -ngl value.
That would be so awesome. This is actually exactly what I'm currently for what I'm using DRYRUN myself.
Keep in mind that DRYRUN only tells you the memory required to load the model and allocate enough memory for its context. Memory used during inference for things like attention is not considered but is easy to estimate. In fact, more memory is required to load a model if flash attention is enabled due to additional overheads associated with its implementation.
If I can't do it via the api it will not happen. Messing in scripts with git will be a disaster.
Totally understandable.
will the server-side git really just accept any client-side garbage date when pushed?
All git servers seam to do. git servers kind of trust client side garbage by design. I had to spoof dates/name/emails for author/committer so many times in the past and I not once had a git server refuse the commit. The only thing I'm not sure if HuggingFace uses the time in the git commit like GitHub/GitLab do or if it uses the server time of the push. Now I'm a bit curious so the next time I upload a model I might try it.
The other a-ha moment I had last week was when I realized that this is the problem and must give. I have versioned the model cards now, so we can keep any number of different compatible card formats and update at our own pace.
I don't think with us publishing 100+ repos a day anybody would care about 20000 updates even per day.
Yes it should be fine unless we hit some kind of rate limit.
The header is pretty massive - tiny if you look at the whole file, but many megabytes in size to warrant an optimization. My first computer had ~100 octets usable memory. I saw amazing software written in 20k of memory. When I see a bash process using 2MB of RAM I regularly get dizzy.
My first "Gameboy" which in fact was a Voyage 200 calculator for school had 188 kB RAM and 2,7 MB ROM and it was enough to play all kind of games. I even had something like Maro Maker on there. I actually had that Voyage 200 calculator 5 years before I had my first mobile phone and used it from everything from reading, writing, programming and gaming.
In case you wonder my first PC was a Windows 2000 with 13 GB of HDD storage and I think 128 MB of RAM. My first programming language was BlitzBasic to write PC games followed by Compact-C which I used to program C-Control Pro microcontrollers which had 2 KB of usable RAM, 10 KB of usable flash storage, 1 KB EEPROM and a 14.7456 MHz CPU so I know your feeling.
Anyway, gguf is very wasteful, or example, every vocabulary entry is 8 bytes string length + string.
That is indeed terrible wasteful. 1 byte would have been enough.
Also, "likely enough" means you still have to be prepared for it to not be enough in edge cases.
Which should be fine as llama.cpp was so nice to put stupid limits everywhere so most edge cases likely already failed when we tried converting them into GGUF.
And to be honest, what worries me most is that aws typically charges for the full file even if only a few bytes of it are being downloaded. But since the gguf parse on the hf page exists, I am sure it doesn't matter :)
S3 only charges for the actually used bandwidth as far I'm aware. So if you only download the first 10 MB HuggingFace should only be charged for 10 MB. They do charge per 10K API calls a very low amount but this doesn't at all matter as we only have around 500K quants. I'm mostly worried about HuggingFace might be using intelligent tiering in which case us accessing all the quants might cause them to be copied into hot storage which then would cost them the transfer fee plus 30 days of hot storage. But in any case, there is not much we can do about any of this unless we find a storage usage pattern and can based on one quant tell how much all the others require which I think might be possible.
Memory used during inference for things like attention is not considered but is easy to estimate. In fact, more memory is required to load a model if flash attention is enabled due to additional overheads associated with its implementation.
That's a bummer then... So how would you easily estimate it? And what you mean more is required to "load" a model - after loading, flash attention surely uses less memory.
Yes it should be fine unless we hit some kind of rate limit.
That doesn't worry me either - I envisaged some kind of bulk update because I thought versioning the readmes is a bad idea. But, I changed my mind. IF we hit a rate limit, it will take a few y<ears to update old repos - so what.
Voyage 200 calculator for school
I got the first HP 48SX in germany (or so I was actually told by HP). Sigh. hp calculators... were so nice...
Windows 2000
Wow. That is so long after I had switched to GNU/Linux. (I switched from DOS to Linux just before win 3 became ubiquitous (in 1994, with 1.0.2 or something - I was even late to the game, or so it felt))
That is indeed terrible wasteful. 1 byte would have been enough.
Yeah, or 4 octet (or even 8 octet) header length + json/msgpack/cbor/... and yes, one octet would be enough if you limit strings to 127 octets, but to be fair, that's a limit of the encoder, not a limit of the format.
I'd say whoever designed it (well, gerganov) was probably paranoid of not running into arbitrary 4GB limits anywhere. Puzzlingly enough, though, the primitive types numbers (there are 13) are stored in 32 bit ints. And no, everything is just octet-aligned, so it's nothing to do with that.
To it's defence, the gguf decoder I wrote in Perl is just 80 lines of code. So in that sense, it lends itself to a very simple implementation. But using an existing JSON decoder with that header would just be 3 lines or so...
I think ggerganov has a major fear of external dependencies - even more than me, and I thought I was a bit on the extreme side.
S3 only charges for the actually used bandwidth as far I'm aware.
I admit I am no expert, but it seems to be a well known attack to request only part of a large file and get billed with much larger transfer costs because aws does not bill octets downloaded but octets prepared for download, regardless of how much actually was used (or even requested). So yes, only actually used bandwidth, but it's their internal fantasy made up bandwidth, not the external customer-measurable bandwidth. It is possible that it only affects some S3 storage products, but it's a concern. Well, it's not a concern, because huggingface does it themselves, and I am happy to cache things...
S3
And don't they also bill GET requests? So there must be some optimal transfer size - probably in the megabyte range?
Sooooo, DRYRUN gives me an error message for a failed model, but exit status is 0:
load_tensors: loading model tensors, this can take a while... (mmap = false)
llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' has wrong shape; expected 5120, 152064, got 5120, 151665, 1, 1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'Methuselah-PSR-B1620-26b-14B-Exp.gguf'
main: Dryrun compleated!
changed the test to this:
if DRYRUN= llama llama-cli -m "$SRC.gguf~" --no-warmup -n 0 -t 1 -no-cnv -st </dev/null 2>&1 | tee -a /dev/fd/2 | grep -q ": failed to load model"; then
That's a bummer then... So how would you easily estimate it? And what you mean more is required to "load" a model - after loading, flash attention surely uses less memory.
DRYRUN tells you how much memory you need to load a model and reserving the memory required for its context. So if you have as much memory as DRYRUN tells you, you will be able to load the model. However depending on context and prompt you might still OOM during inference as some memory is allocated during inference for algorithms like attention. The memory required for attention should more or less be the same for a given context with a give attention method. So you can likely measure it once and add it onto to what DRYRUN tells you is required to load the model. Flash attention needs more memory during the initial load, but the attention algorithm itself uses linear instead of quadratic memory for a given context which for large context should be more memory efficient.
That doesn't worry me either - I envisaged some kind of bulk update because I thought versioning the readmes is a bad idea. But, I changed my mind. IF we hit a rate limit, it will take a few y<ears to update old repos - so what.
The limit can't be so bad that it will take years. We should try to update them in a reasonable timeframe as the current model card isn’t that good in my opinion.
And don't they also bill GET requests? So there must be some optimal transfer size - probably in the megabyte range?
They do but it is $0.0004 per 1,000 requests so if we need 500K of them that is $0.2 which is so low it almost not worth mentioning.
HuggingFace will be fine:
"There are no retrieval charges in S3 Intelligent-Tiering. If an object in the infrequent access tier is accessed later, it is automatically moved back to the frequent access tier. No additional tiering charges apply when objects are moved between access tiers within the S3 Intelligent-Tiering storage class."
So if they use Intelligent-Tiering they are not getting charged for AWS being stupid beside paying slightly more for files being in less cold storage for 30 days which is almost nothing to what retrieval charges would be.
In case you wonder from S3 to Europe (Zurich) is $0.02 per GB and nothing if it only goes to Amazon CloudFront (which has their own billing for bandwidth) and really they seem to only calculate that data is actually getting sent to the internet based on their website and intelligent storage has no retrieval fee so they really shouldn't bill for the data we don't download unless they found some kind of loophole to trick their customers.
But in any case, there is nothing we can do about any of this.
Sooooo, DRYRUN gives me an error message for a failed model, but exit status is 0
That's so stupid. Sorry for this mistake. I forgot about that. I will be fixing it today evening.
changed the test to this
This will work in the meantime.
DRYRUN tells you how much memory you need
I realise what you mean. I guess it can also be handled by telling the user to reduce ngl a bit when in doubt. It will still be far more useful than the manual trial runs I have to do now.
The limit can't be so bad that it will take years.
I meant toi say "even it takes a few years..." and I didn't expect the repom create limit to be as bad as it is. Or erratic(?) still feels weird to get rate limited sometimes, even when we don't crunch through lots of models.
S3
Thanks, helps a lot :)
This will work in the meantime.
We are not in a hurry - assuming that we always get "failed to load model". Eh, evenif it would not, it'd still be a great improvement :)
model page
Well, my plan is to get rid of graphs and everything but the download table and the links, and also more or less fully generate the page and move all metadata to yaml. The only hindrance is that it is a lot of work, and even a single misplaced space or fixed typo will cause havoc :) Just not so much fun. But I am slowly working towards making it doable (and gaining motivation by not forcing me to work on it :)
If you have any conrete input (text fragments, layout) on the model page, I am happy to collect it. The general trend, though, should be to move as much of the info to the external model page, so there is only one place to improve. Unfortunately, the model download page already needs revamping, too, and already goes too much into the directioon of web development for my taste :)
Sooooo, DRYRUN gives me an error message for a failed model, but exit status is 0:
This should now be fixed in the latest version. I kind of forgot about llama.cpp sometimes using exceptions to jump out of heavily nested functions skipping all the code that would otherwise get executed by following the normal return path. I personally don't really like throwing exceptions somewhere and handling them on a completely different location - it feels like a modern version of goto but without labeling where it jumps to.
I fixed this by adding a dedicated exit point for dry-run inside common.cpp to no longer mess with llama.cpp's exception handling and removing all modifications from main.cpp. This now ensures exceptions skip past the dry-run dedicated exit point and are instead getting properly handled by main.cpp
I also updated the mradermacher branch to latest llama.cpp so we now have Gemma 3 and experimental Gemma 3 vision support.
You guys might find this interesting: https://arxiv.org/abs/2503.03592
Quote from conclusion:
Further, the usage of importance matrices written in non-English does not significantly improve performance on non-English datasets and might in fact slightly harm it. However, this reduction in performance is not statistically significant.
You guys might find this interesting: https://arxiv.org/abs/2503.03592
Thanks a lot for sharing! I looked at the paper and am really surprised by the result. Their testing methodology looks clean and the result tell quite a clear story. This means our primary English imatrix dataset is much better for non-English models than we thought. I now regret having non-English models only queued for static quants.
@nicoboss I assume you queued all/most of the nice lint check models that all fell through the llama loading code check? :)
Here's all errors (deduplicated), and they do all seem legit (and therefore I have nuked them):
/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
/llmjob/llama.cpp-cuda512/src/llama.cpp:8666: GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor") failed
I suspect these are all pure embeddings and therefore can't be used with llama-imatrix?
regarding the paper, it's one of the results I expected (either it's no big deal, because a lot with imatrix training data seems irrelevant), or it has a big effect. But finally I can choose between these extremes!
I also feel much better about my training data now, which is pretty incoherent. But given that random tokens seem to work relatively fine, it would actually be surprising if it were so detrimental.
The question is, what does that tell us about how llms store knowledge? and how about IQ quants, which are far, far more sensitive to imatrix weights?
@tdh111 anyway, very much appreciated, I would have never seen this paper without you
I assume you queued all/most of the nice lint check models that all fell through the llama loading code check? :)
I queued a quite a lot of trending models some of which turned out to be bad. Those errors are all legit and can be nuked.
I suspect these are all pure embeddings and therefore can't be used with llama-imatrix?
Yes exactly. I will improve my trending model discovery scripts to filter out embeddings in the next version. I will also check if there is a way dry-run can detect this. The main issue that this is a check that occurs during inference time inside llama_decode_impl and not while loading the model.
The last 2 failures you can nuke if you want.
https://huggingface.co/cl-nagoya/ruri-large-v2 likely requires manual GGUF conversion due to ModuleNotFoundError: No module named 'fugashi'
No idea why https://huggingface.co/google/flan-t5-xxl fails to download but if the just started Redown fail I guess I will provide the GGUF manually there as well.
Edit: Nevermind cl-nagoya/ruri-large-v2 likely
is an embedding as well so I nuked it as we don't care about them.
Edit2: I think redown fixed flan-t5-xxl
so must have just been some random HuggingFace download error.
Edit3: No flan-t5-xxl
failed again: ValueError: Missing or incomplete model files: ['model-00001-of-00005.safetensors']
anyway, very much appreciated, I would have never seen this paper without you
Thanks a lot for sharing!
Glad you both liked it.
The question is, what does that tell us about how llms store knowledge? and how about IQ quants, which are far, far more sensitive to imatrix weights?
Both of those are separate from the paper I linked, but this paper is relevant to your first question: https://arxiv.org/abs/2503.05613 .
Your second question about IQ quants is best answered by ikawrakow, who would most likely answer if asked in a discussion post in ik_llama.cpp. I feel like I know the answer but I'm not confident enough to give my answer because I would rather not spread potential wrong information, but now that you ask I'm curious if the same holds true for his new quant types (IQ_K) which at low bpw offer better performance than I-quants and at higher bpw offer better performance and quality compared to K-quants.
I will also check if there is a way dry-run can detect this.
Be careful - the more checks you add, or rather, move, the more you will diverge from future llama.cpp versions that might do things differently. There is a trade-off here, between catching more things and maybe blocking future roads.
some random HuggingFace download error.
Possible, but unlikely, as hfd retries pretty aggressively. When you open a (s) in audit, the download is printed (it's in MODEL/log, too). If it's a new model, a much more common failure mode is actually not yet uploaded files. For example,, YOYO-AI loves to make elabnorate model cards before actuially uploading all files :/
I'm unexpectedly busy (and probably rather tired) for the next weeks, probably. I'll try to take care, but don't be alarmed if things get a bit more erratic.
also not caught:
llama_model_quantize: failed to quantize: key not found in model: llama.context_length
this is actually caught by quantize, so causes extra human work, but not extra computational work (it's caught during static jobs).
interesting that qantize even bothers...
and clearly, nice level 1200 is the junk class
How do I know if /tmp/quant/Samantha-1.11-70b-i1-GGUF/imatrix.dat
is an old or new imatrix? I unfortunately nuked the existing imatrix repo before hash comparing them. I checked for Samantha-1.1-70b which is basically the same case and they were different so I'm almost certain the imatrix for Samantha-1.11-70b got recomputed as well. It seems like cases where after nuke existing imatirx are getting copied only happens if they were somewhat recently generated but not for this 1-year-old cases of repositories where static quants never even existed. In the future I will obviously use nukeall so none of this will be an issue.
and clearly, nice level 1200 is the junk class
I noticed this as well. I nuked so many errors this morning when I woke up. We had almost entire hosts filled with errors.
also not caught:
llama_model_quantize: failed to quantize: key not found in model: llama.context_length
This does get caught using dry-run. Not sure why you think it does not. I even tested one of the models that had this error today to confirm:
llama_model_loader: mmap is not supported for dry-run so it is now disabled
print_info: file format = GGUF V3 (latest)
print_info: file type = F16
print_info: file size = 12.55 GiB (16.00 BPW)
llama_model_load: error loading model: error loading model hyperparameters: key not found in model: llama.context_length
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/nico/law-LLM.gguf'
main: error: unable to load model
I'm unexpectedly busy (and probably rather tired) for the next weeks, probably. I'll try to take care, but don't be alarmed if things get a bit more erratic.
No problem. Now that you gave me all these amazing tools and I got familiar using them I should be able to solve most of the issues myself hopefully letting you focus as much on your job as possible. Just ignore things and only respond that what is important to safe time. I'm doing them same when I'm busy. Feel free to ignore user requests and audits as I can handle them myself.