Quantum Entanglement and the Sentient Toaster: Revolutionizing LLM Training
I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.
-rw------- 1 root root 509G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf
I assume that is in GB and not GiB. In which case 474 GiB might fit as we have 503 GiB of RAM (after subtracting RAM reserved for hardware) but would be extremely tight given the RAM required for context.
I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.
Q6_K is fine for me. Q8_0 might not fit without offloading and it is unclear if offloading is even possible. I don't think it's worth using RPC if Q6_K fits. As a bonus there will be enough RAM left to let quantization tasks running if we do Q6_K. If you already have Q8_0 locally you should give it a try and see if it fits but if not Q6_K is fine for me.
I just checked and you do have it locally under /tmp/snowflake-arctic-instruct.Q8_0.gguf
so please give it a try to see if it fits. I believe it should fit if nothing else is running as the model has such a small number of layers. If it doesn't fit use Q6_K instead.
474G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf
I'll try an offload of 1 and 0, then Q6. hopefully it does not crash.
I think you have to finish or kill the frozen quantisation tasks first. They are using a lot of reserved RAM (not cached RAM that can be taked away).
So, despite it listing both cpus, it only allocated something on cpu 0 (19GB). Otherwise, top says the process uses 435.6g, which is good, because I forgot to resume/stop the running quantize. I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.
457.4g after warming up.
So, despite it listing both GPUs, it only allocated something on GPU0 (19GB)
llama.cpp uses booth GPUs for imatrix but only offloaded to one because you set -ngl 1
and it can only offload on a per-layer bases. Also ince when are quantisation tasks using the GPUs?
I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.
I'm not so sure about that. Keep in mind that imatrix uses mmap memory that can be taken away by other processes like quantisation tasks that use reserved memory.
dstat shows a relatively high disk read rate so imatrix might now be streaming from SSD:
Yes it is clearly streaming from SSD now:
Once the quantisation tasks are interrupted it should work without SSD streaming again.
This is somewhat worrying:
[1]2.9360,[2]2.3937,[3]2.4731,[4]2.5391,[5]2.8621,[6]2.8125,[7]2.6349,[8]2.9891,[9]2.8659,
save_imatrix: entry ' blk.34.ffn_up_exps.weight' has partial data (98.44%) - skipping
save_imatrix: entry ' blk.33.ffn_down_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.33.ffn_up_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.34.ffn_down_exps.weight' has partial data (98.44%) - skipping
save_imatrix: entry ' blk.0.ffn_down_exps.weight' has partial data (83.59%) - skipping
save_imatrix: entry ' blk.0.ffn_gate_exps.weight' has partial data (83.59%) - skipping
save_imatrix: entry ' blk.33.ffn_gate_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.0.ffn_up_exps.weight' has partial data (83.59%) - skipping
save_imatrix: entry ' blk.34.ffn_gate_exps.weight' has partial data (98.44%) - skipping
save_imatrix: entry ' blk.1.ffn_up_exps.weight' has partial data (94.53%) - skipping
save_imatrix: entry ' blk.1.ffn_down_exps.weight' has partial data (94.53%) - skipping
save_imatrix: entry ' blk.1.ffn_gate_exps.weight' has partial data (94.53%) - skipping
save_imatrix: storing only 373 out of 385 entries
Yes, one iteration after both quant tasks finished it stopped streaming.. But these are big tasks.
Nope, started again.
As for the quantize tasks, I don't know what is going on. I was also able to see this, but now I am unable to see any processes.
I think it stopped streaming for good. It is possible that it also takes a few iterations for everything to stay in memory.
Top now at 461.3g (495GB). So it isn't tight. Let's see what happens.
This is somewhat worrying:
It should be fine and maybe expected for a MoE model with 128 experts. According to the llama.cpp source code (https://github.com/ggerganov/llama.cpp/blob/d9c3ba2b7749c00df477599aa141a98b4521aa2c/examples/imatrix/imatrix.cpp#L218-L219 ) this warning is part of the code to avoid writing imatrix entries that do not have full data which can happen with MoE models where some of the experts end up not being exercised by the provided training data.
Storing 373 out of 385 entries seams to be good enough.
It's reducing. These look like useful new messages, actually.
[10]3.1400,[11]3.2586,[12]3.0453,[13]3.0821,[14]3.3073,[15]3.5876,[16]3.7071,[17]3.9026,[18]4.0482,[19]4.1979,
save_imatrix: entry ' blk.33.ffn_down_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.33.ffn_up_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.0.ffn_down_exps.weight' has partial data (89.06%) - skipping
save_imatrix: entry ' blk.0.ffn_gate_exps.weight' has partial data (89.06%) - skipping
save_imatrix: entry ' blk.33.ffn_gate_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.0.ffn_up_exps.weight' has partial data (89.06%) - skipping
save_imatrix: storing only 379 out of 385 entries
It's reducing. These look like useful new messages, actually.
This is expected as the longer we train the more likely experts are to be included during imatrix training. I'm wondering if MoE models need longer imatrix training compared to monolithic models. This one has 128 experts while only 2 are active for a given token so we only use 1/64th of the model for every token.
If it stays that way, we'll have good chances that the imatrix quantization will fail (if the message means what I think it does). If true, it intuitively makes sense - it's harder to tickle all experts in such a massive MoE model. Well, we have another 330 chunks.
I'm wondering if MoE models need longer imatrix training
Longer is unlikely to help - the right training data, is more likely. The top two (with 99.22%) have not reduced in the last iterations. And good that I save every 10 iterations, I knew someday it would be useful for something :)
Pretty exciting. Anyway, over and out for a while.
What is interesting is that it doesn't show a message for every tensor it skips. And it really is quite fast - obvious in hindsight. But I don't think the remaining chunks will do anything. Let's see if it quants. My prediction would be that it will likely fail with low bit quants.
I think starting a second imatrix computation task while snowflake is still running might not have been the best idea as it caused snowflake to run out of RAM and SSD streaming again. I now set the /tmp/pause flag to stop any further imatrix computation tasks from running.
-2000 488 snowflake-arctic-instruct run/imatrix (GPU-2d) / 196.40s/c 162.9/1194.8m(219.4) [271/365] 6.7983
42+ 13 Gugugo-koen-7B-V1.1 run/imatrix (GPU-18) 53/32 1.00s/c 2.7/6.1m(5.1) [194/367] 26.2758
Unfortunately there are now even quantisation tasks that started to run:
1 66 I huihui-ai-abliterated-Qwen2.5-32B-Inst-BaseMerge-TIES run/imatrix 9/25,IQ4_XS [705/771] (hfu i1-Q6_K)
Not sure what I should do to pause the quantization tasks. I could pause the entire host but seems a bit overkill and might cause other issues.
If it stays that way, we'll have good chances that the imatrix quantization will fail
I don't think it will fail. It will hopefully just statically quant blk.0.ffn_down_exps.weight, blk.0.ffn_gate_exps.weight and blk.0.ffn_up_exps.weight which should be fine as then the vast majority of the model will have the imatrix applied and it seems unlikely there would be any meaningful real world quality difference. The question is more if llama.cpp is capable of quantizing with a partial imatrix. I don’t think this was ever tested.
The top two (with 99.22%) have not reduced in the last iterations.
(100/128)*127 = 99.21875% => 99.22%
We they are just missing a single expert on a single layer. For some reason none of our training data seem to get routed to this specific expert for the first layer. All other layers already reached full coverage.
Either the expert is very specific, or maybe it's just a model bug. That would also explain why we use less memory than expected - that expert is never paged in.
As a sidenote,. pausing without giving any explanation is very disruptive when we are doing something exceptional like generating this imatrix. I have no clue what is going on, and I can't return the system to its normal state again.
I've manually finished the snowflake imatrix so I can at least go back to normal operating mode.
Either the expert is very specific, or maybe it's just a model bug. That would also explain why we use less memory than expected - that expert is never paged in.
I would assume a very specific expert. I couldn't even come up with 128 different type of experts so I expect some of them to have really specific areas of activation.
As a sidenote,. pausing without giving any explanation is very disruptive when we are doing something exceptional like generating this imatrix. I have no clue what is going on, and I can't return the system to its normal state again.
We would ideally prevent the scheduler from starting any tasks while the imatrix of such massive models is being computed. It is not that bad if this happens while running them normally as it will just start streaming from SSD essentially pausing it until there is enough RAM but with RPC running out of RAM will result in a total system crash. I likely should have just let it stream from SSD until you had time to fix it but I know that the /tmp/pause flag is only making new imatrix task wait in an endless loop which unlike pausing the entire host should be safe.
When we are at pausing the performance measurement project is coming along extremely well so soon I will have to pause the entire nico1 host for multiple nights if we want to do the performance measurements on StormPeak. I hope this is not too disruptive or I might not do it. I'm currently doing performance measurments on Threadripper, CastlePeak, Raspberry Pi 4, 7840S Laptop and all of them should be done within the next few days. I will try to keep StormPeak measurement at an absolute minimum and only measure with 32 threads which based on my current result should be the setting that gives the best performance on a 32 core/64 thread CPU.
I've manually finished the snowflake imatrix so I can at least go back to normal operating mode.
Awesome and I see that snowflake imatrix quantizing seam to work! Thanks a lot for doing imatrix quants of this amazing model. If the imatrix quants turn out well we can do the snowflake base model too. I will give them a try tomorrow.
I likely should have just let it stream from
The problem is the constant meddling without feedback. If you'd communicate this I could fix it and remove the pause flag (which helped nothing in this situation as the scheduler did not start new imatrix tasks anyway and /tmp/pause does not effect quants, which were the problem).
I will have to pause the entire nico1 host for multiple nights
Right now would probably be a bad time for that, as soon I will have to switch off dbX/backup1, and currently rely on being able to reduce the queue size so I can let them work till the end. Some of them already did run dry tonight because the pause flag was set for relatively long and I reduced the queue size a few days ago.
It's up to you, though, and I can try to cope, but it would add a level of manual managing that I could avoid at this point :)
Normally it is not an issue to pause, especially for a night. It is always an issue when the system is in an exceptional state though, e.g. when doing big models (which requires some intervention due to dependencies the system cannot see) or shortly before switching off nodes.
The problem is the constant meddling without feedback. If you'd communicate this I could fix it and remove the pause flag
What you mean? I did communicate everything a few messages ago as you can see under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/3#6754b97f1bc6b93608c48774 or the following quote:
I think starting a second imatrix computation task while snowflake is still running might not have been the best idea as it caused snowflake to run out of RAM and SSD streaming again. I now set the /tmp/pause flag to stop any further imatrix computation tasks from running.
-2000 488 snowflake-arctic-instruct run/imatrix (GPU-2d) / 196.40s/c 162.9/1194.8m(219.4) [271/365] 6.7983
42+ 13 Gugugo-koen-7B-V1.1 run/imatrix (GPU-18) 53/32 1.00s/c 2.7/6.1m(5.1) [194/367] 26.2758
I did describe exactly what I did and why I did so.
which helped nothing in this situation as the scheduler did not start new imatrix tasks anyway and /tmp/pause does not effect quants, which were the problem
No it did start imatrix computation for Gugugo-koen-7B-V1.1
while the snowflake-arctic-instruct
imatrix computation was still running (as can be seen in above posted status page snipet) and later even tried to start another one but got luckely paused by the /tmp/pause flag. Please check your logs why this happened.
Yes the quantization tasks where an issue as well but they are not as bad as parallel imatrix tasks. Quantization tasks will eventually finish and free up enough RAM for imatrix tasks to no longer stream from SSD while if two imatrix tasks start streaming from SSD none of them will ever finish. We were lucky it was only a 7B model and so fully offloaded to GPU. What was even scarier is that despite snowflake-arctic-instruct
running on booth GPUs another imatrix task was started and it just happens to not allocate memory on the GPU not used by snowflake-arctic-instruct. If a model uses multiple GPUs for imatrix computation there never should be a case where another imatrix task starts or GPU memory conflicts might occur
Right now would probably be a bad time for that, as soon I will have to switch off dbX/backup1, and currently rely on being able to reduce the queue size so I can let them work till the end. Some of them already did run dry tonight because the pause flag was set for relatively long and I reduced the queue size a few days ago.
No hurry then I will wait for dbX/backup1 to be gone. I already have really good performance measurements so I can already start analyzing it even without waiting for data from StromPeak or use this time to measure some other devices like my phone.
I did describe exactly what I did and why I did so.
You are right, somehow I didn't see that message, and you acted well. Sorry for doubting you.
Please check your logs why this happened.
It happened because I wanted to see the effect of it - since that model would fit completely into the vram, it should have worked, after a small disruption due to loading the model. Either that, or I would have gained understanding. I was still there when it happened, and even if it weren't planned, I would have cleaned up. The same happened when the quant jobs have been started at 22:00, which was not planned :)
There was plently of RAM availalable - it might still have started streaming due to bad memory management in linux, but that is another story.
I also don't think (and don't see) how it would have started a third imatrix job, as so far it has never tried to start three jobs, simply because it would not have the budget and gpu available. It did start a second one after snowflake was done, though.
We were lucky it was only a 7B model
It wasn't lock, there simply was no budget for (much) more.
What was even scarier is that despite snowflake-arctic-instruct running on booth GPUs another imatrix task was started
It was only running on one gpu - I changed the job to reflect that (the status display never reflected that because it was originally overwritten).
If a model uses multiple GPUs for imatrix computation there never should be a case where another imatrix task starts or GPU memory conflicts might occur
Right, and as far as I can see, that rule was never violated.
I will wait for dbX/backup1 to be gone.
Thanks, that helps a lot.
I don't think it will fail. [missing imatrix data for a tensor]
Llama.cpp commonly fails to quantize moe's for this reason (I have lots of models where I don't have imatrix quants for that reason). I do not know if this message is correlating perfectly to that (the message is new), but llama.cpp does not quantize tensors it has no imatrix data for - it's the same message you get when trying to do low-bpw quants without an imatrix. It predominantly happens on "less important" models, so I usually do not make a fuss of it and simply skip the model, or in some cases the imatrix quants.
llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
There goes my IQ1 of snowflake :/
Also... the 99.22% is almost certainly not accidental ((100/128)*127), but if we just skip one expert, shouldn't there be more tensors affected?
I'm generating the remaining quants. I see only tow options: a) find training data that exercises that expert b) patch llama.cpp to write out the data and try to use it - even if ti generates garbage for one expert, it doesn't seem to be easily triggered. Patching might be trivial (just force it to write it out) or hard (quantize might crash if we don'T synthesize "acceptable" data).
There goes my IQ1 of snowflake :/
It indeed fails but only for very low bit per weight quants. This is because as expected it statically quants the layers containing missing experts which in this case is layer 0. There is a check in llama.cpp that stops the quantization process if one tries to statically quant with a too low bit per weight as this usually results in unusable model. You are right. If there is still partial data at the end of imatrix training imatrix quantization will fail for all low bit per weight quants. All other imatrix quant will work without any issues and without any real-world quality impact as only one out of 35 layers is quantized statically so 97.1% of the model is quantized using the imatrix. Here the full llama.cpp error:
============================================================
Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
The result will be garbage, so bailing out
============================================================
Also... the 99.22% is almost certainly not accidental ((100/128)*127), but if we just skip one expert, shouldn't there be more tensors affected?
I think in this specific architecture things can get rerouted to different experts for each layer so bad training data would only affect the first layer. But honestly the snowflake architecture is extremely complicated and poorly documented so I do not yet fully understand it.
I'm generating the remaining quants.
Awesome!
a) find training data that exercises that expert
I will try bartowski's imatrix training data on some smaller quant on the RTX 3080 GPU to check if it will activate all the experts.
b) patch llama.cpp to write out the data and try to use it - even if ti generates garbage for one expert, it doesn't seem to be easily triggered. Patching might be trivial (just force it to write it out) or hard (quantize might crash if we don'T synthesize "acceptable" data).
Patching out this check for quantization should be easy. It was only added to avoid users generating quants that are essentially garbage. The main issue is that it will not just affect this expert but the entire first layer. Despite there only being one expert missing the imatrix training skips storing it in its entirety. The first and last layers are usually quite important so there will be some negative impact on quality but it will be far from garbage. A better option would be to force imatrix training from storing the partial data of the first layer but I have the feeling that is that would be easy llama.cpp developers would have long done so.
Just had some worrying experience: the huggingface-cli silently failed to download all files, but also did not fail - when I tries to redo https://huggingface.co/xxx777xxxASD/L3-SnowStorm-v1.15-4x8B-B it skipped over 3 model files that are nevertheless in the repo.
I wonder how much silent corruption that would cause.
Patching out this check for quantization should be easy. It was only added to avoid users generating quants that are essentially garbage.
That's not what I proposed - b) proposes to use the data, not skipping it in quantize.
Also, llama.cpp tends to crash during quantisaiton, it did not actually generate garbage quants that often, although it was one outcome.
With your proposal I would expect very bad results, because we force low bpw quantisation without any data on a tensor that seems vitasl, while the b) proposal would hopefully only leave it partially trash. The problem I see is that just writing e.g. 0 migfht make llama.cpp crash, so we might even have to synthesize data. The latter problem could be tackled when it happens, though.
None of this seems trivial to me.
I really don't want to implement your porposal in any case, I think it would be better to just leave out those quants in that case. Which also destrtyos my cxhance of getting an _IQ1_S :)
Despite there only being one expert missing the imatrix training skips storing it in its entirety.
You think all the experts are in that one tensor? (or those three, actually)
You think all the experts are in that one tensor? (or those three, actually)
The dimension of blk.0.ffn_down_exps.weight
is [ 4864, 7168, 128, 1] which indicates it contains data of all 128 experts. If you look at https://github.com/ggerganov/llama.cpp/blob/43ed389a3f102517e6f7d5620d8e451e88afbf27/gguf-py/gguf/gguf_writer.py#L138 you see that all tensors with "_exps." in the name are supposed to contain data for all experts.
That is exactly what I mean - that means your suggestion that one expert is missing does not match the data. We might lack a part of one expert, but we still have data for that expert, and we still activated it at some point.
So the naive explanation, that our training data fails to activate an expert, must be wrong.
BTW, I don't understand any details of what the imatrix actually measures, specifically, what it means to lack data for part of a tensor.
I would not be terribly surprised if this was just a model defect (and maybe not even a defect). My reaosning is that we have models that generate NaNs, and according to the llama devs, this means the model is completely unusable, yet still they work fine, so there must be a way for parts of a model to be "unused". Of course, that reasoning is weak because f16 and below can't even represent nans, afaicr.
And for something completely different, in the downloader/interactive model summary page, I have changed the quality score calculation to be strictly monotonic - before, Q8_0 would unstably sort after Q6_K because they'd end up with the same integer score of 99. Now Q8, i-Q6 and Q6 get 99, 98, 97, respectively. I think that's a reasonable trade-off between being a simple ranking and assigning meaning to absolute quality differences. It also allows sharing imatrix and static quants in one table.
I don't think I can improve it much in the short-term (especially since I didn't do client work in the last days at all), but once I found a nice way to make the link, I will put the link to the model page and update all models. On the other hand, when I am more active working on my day job, I also tend to be more active on my hobby side. Strange how these things work - if there is little for-money work, I lack the impetus of doing side projects, too.
(Example: https://hf.tst.eu/model#testrepo-i1-GGUF)
We might lack a part of one expert, but we still have data for that expert, and we still activated it at some point.
blk.0.ffn_down_exps.weight contains data for all 128 experts but we only imatrix measure 99.22% of it so we exactly miss one expert for that specific tensor. Wo do get data for all experts on all tensors not associated to layer 0. We miss one expert in one layer which causes llama.cpp to not save any imatrix data for this specific layer. We do have data of all expoerts for every other layer.
In any case I will soon try different imatrix training data so see if I can somehow manage to cover this specific expert in layer 0.
I would not be terribly surprised if this was just a model defect
There indeed could be an issue in the model router that makes it impossible to ever get routed to this specific expert which would be really unfortunate.
There indeed could be an issue in the model router that makes it impossible to ever get routed to this specific expert which would be really unfortunate.
I agree fully with your explanation (which matches my much more fuzzy understanding), but clearly this expert must somehow be activated if the other tensors for this expert somehow do. Clearly my understanding is flawed/missing, because _I am surprised you can activate only part of an expert. I would assume all weights to be used. But I don't know how the imatrix measurement decides what was active and what not - my understanding is that using a tensor, or an expert "slice" of it is essentially just a matrix multiplication, which should "use" all of it.
In any case, good luck the with training data. Clearly the best solution, if you can pull it off. I can see imatrix-training-full-4 rolling in already :)
And as for quantize using the gpu, I can try to make another llama build profile (avx512 nocuda). It's interesting, because I never had such a profile, nico1 always used either my cuda or cuda512 profiles (cuda + avx512). And if each quant is using 384MB, that's quite a lot for not needing anything.
seems Alsebay has deleted almost all of his models. haven't been fast enough in quantizing them.
seems Alsebay has deleted almost all of his models. haven't been fast enough in quantizing them.
How sad. Turns out he deleted them all for nothing. Today we finally got an official document explaining the new HuggingFace storage quota: https://huggingface.co/docs/hub/storage-limits and discussed in https://huggingface.co/posts/julien-c/388331843225875
*We aim to continue providing the AI community with free storage space for public repositories, please don’t abuse and upload dozens of TBs of generated anime 😁. If possible, we still ask that you consider upgrading to PRO and/or Enterprise Hub whenever possible.
Maybe we should consider upgrading the mradermacher account to PRO as it is just $9/month which is nothing compared to ouer operation cost but it is not required for us or anyone else to do so.
I think if hf restricts the mradermacher account after saying "unlimited repositories" they are shooting themselves in the foot. They already did, though. Not sure what to do, but I am not sure it should be supported. Anyway, let's see what happens. I am fine with best-effort if it means we can continue. And if hf would contact me and ask to tune it down, I would be open to that, too. I am not expecting problems though, as clearly they must be aware of the account with most repositories on hf.
In other news, my parents are out of the hospital and very likely will recuperate soon. Lots of stress less. And while my main file server is still read-only, there are only four files so far that are unreadable (backup is running, but it looks that I cna get essentially a 100% backup - the missing files are a single log file and some partial torrent downloads). So less stress there, too. We could do some useful work as well in the last days, so less stress there, as well. win win win.
You might be shocked to hear, but I am contemplating nuking the log-tree on my file server filesystem, mounting the resulting fs read-write, deleting the damaged files and if a scrub says it's fine, will continue to use the filesystem without reformatting. Can't cope with another week or two for the restore. And hey, I have backups...
My account looks the same btw., i.e. no longer is there a public repo quota. I interpret "best effort" as "unlimited, till we simply can't sustain us".
Maybe instead of paying, we should ask them for free gpu time, or a free pro upgrade :)
Still feeling a bit woozy after so much relieving news today. On the other hand, my §"%/"§ing parents rang me out of the bed after only a few hours sleep, after I explicitly told them to sit in the cafeteria for an hour or two before I fetch them. So this day is mostly lost due to me being tired. And I can't even complain...
I am fine with best-effort if it means we can continue. And if hf would contact me and ask to tune it down, I would be open to that, too.
This is exactly what it means. Even for free accounts storage for public repositories is unlimited as long it is not getting abused. They are mostly just begging for PRO. Like for every tech company for every PRO subscriber they have they can get a much larger sum of money from investors. This is also why the price of a PRO subscription is way lower than it should be given what you get.
I am not expecting problems though, as clearly they must be aware of the account with most repositories on hf.
They for sure are aware of us and appreciate our work.
In other news, my parents are out of the hospital and very likely will recuperate soon. Lots of stress less.
Awesome to hear!
And while my main file server is still read-only, there are only four files so far that are unreadable (backup is running, but it looks that I cna get essentially a 100% backup - the missing files are a single log file and some partial torrent downloads) So less stress there, too. We could do some useful work as well in the last days, so less stress there, as well. win win win.
Great to hear that you didn't lost any important files.
You might be shocked to hear, but I am contemplating nuking the log-tree on my file server filesystem, mounting the resulting fs read-write, deleting the damaged files and if a scrub says it's fine, will continue to use the filesystem without reformatting. Can't cope with another week or two for the restore. And hey, I have backups...
I would likely do the same if I had a file system as massive as yours.
My account looks the same, i.e. no longer is there a public repo quota. I interpret "best effort" as "unlimited, till we simply can't sustain us".
That's exactly what they mean.
Maybe instead of paying, we should ask them for free gpu time, or a free pro upgrade :)
Amazon S2 frequent access storage for 500TB+ is $21$/month/TB so they already pay around 100k/month in storage cost for us but that's still almost nothing compared to what the bandwidth cost must be. Let's appreciate what they give us an don't ask for more. If there are no models on HuggingFace there is no point in it even existing so our and other users time and resource investment is HuggingFaces’s biggest value and what keeps HuggingFace alive as a platform. We are essentially donating the resources of 11 server and a massive amount of time to HuggingFace and the open source AI community so I'm sure they see and appreciate what we do.
Here a screenshot of their community post which clarifies things:
Still feeling a bit woozy after so much relieving news today.
Today was awesome. I'm especialy rleaved about HuggingFace removing the storage quopta for public repositories as the storage limit worried way more than it should have.
On the other hand, my §"%/"§ing parents rang me out of the bed after only a few hours sleep, after I explicitly told them to sit in the cafeteria for an hour or two before I fetch them. So this day is mostly lost due to me being tired. And I can't even complain...
Similar things happened so many times to me as well. It always seems to happen when I explicitly tell them to keep me sleeping.
And as for quantize using the gpu, I can try to make another llama build profile (avx512 nocuda). It's interesting, because I never had such a profile, nico1 always used either my cuda or cuda512 profiles (cuda + avx512). And if each quant is using 384MB, that's quite a lot for not needing anything.
I wonder if removing the GPU from quantisation tasks would have any performance impact. I those 400 MB don't really matter as we never really use the full GPU memory for imatrix anyways. But if it serves no purpose for quantisation we can just use llama.cpp without CUDA or set CUDA_VISIBLE_DEVICES to nothing.
In any case, good luck the with training data. Clearly the best solution, if you can pull it off. I can see imatrix-training-full-4 rolling in already :)
Datasets I tried so far:
- c4_en_ja_imatrix
- calibration_datav3
- imatrix-with-rp-format-data
- 4chan pol_062016-112019_labeled
- Tech-Awesome-Hub/mix-data
- GitHub Readme
- MMLU
- Merges between above datasets
I the only ones that has 127 out of 128 experts other than yours was "calibration_datav3" from bartowski and " imatrix-with-rp-format-data". Many datasets got way less experts than that. It clearly is the quality of training data and not the amount that matters. 4chan pol_062016-112019_labeled is massive but when I aborted it, it only had 122 out of 128 experts on layer 0. MMLU which I though is really diverse only managed to trigger 121 out of 121 experts on layer 0. "Tech-Awesome-Hub/mix-data" was with just 120 out of 128 experts on layer 0 even worse than that.
In conclusion you have really awesome imatrix training data and many of the training data I tried was significantly worse. So "imatrix-training-full-3" is likely better than you think. I will continue trying to find datasets that activates all experts. If you have any idea what datasets to try please let me know. I'm really interested in this topic.
A somewhat urgent request for your input, deepseek imatrix just failed:
common_init_from_params: KV cache shifting is not supported for this model (--no-context-shift to disable)'
so, imatrix does context shifting? I am surprised. Do you think it would be an issue to specify --no-context-shift to disable in all llama-imatrix calls?
I wonder if removing the GPU from quantisation tasks would have any performance impact.
I am, as usual, unburdened by actual knowledge, but I always thought it's cpu-only. And I suspect the 384MB is some kind of, wlel, not leak, but probably some dummy workspace allocation. In any case the gpu is completely idle when quantizing.
or set CUDA_VISIBLE_DEVICES to nothing.
I'll do that and see what happens.
In conclusion you have really awesome imatrix training data
No wonder, as the first part is bartowskis training data :)
common_init_from_params: KV cache shifting is not supported for this model (--no-context-shift to disable)'
so, imatrix does context shifting? I am surprised. Do you think it would be an issue to specify --no-context-shift to disable in all llama-imatrix calls?
Support for the --no-context-shift
option was added to imatrix computation yesterday by bartowski in https://github.com/ggerganov/llama.cpp/pull/10766 so make sure to use latest llama.cpp or it will not have any effect.
According to https://github.com/ggerganov/llama.cpp/issues/9390 if disabled:
- Requests bigger than context window will result in an error.
- n_predict for each sequence will be capped to n_ctx - n_tokens_prompt
I don't think any of this should be needed for imatrix computation and so it should be safe to disable it for all imatrix computation tasks.
Online repacking got merged which removes llama.cpp support for all Q4_0_N_M quants: https://github.com/ggerganov/llama.cpp/pull/10446
I highly recommend to no longer generate them as they no longer run in latest llama.cpp. Even bartowski will no longer upload the now depreciated and unsupported ARM/RISC-V quants: https://huggingface.co/posts/bartowski/807894839859408
I'm quite happy about this llama.cpp change as ARM/RISC-V quants where kind of stupid as they used the same data just aligned differently to be optimized for a specific architecture.
I'm quite happy about this llama.cpp change as ARM/RISC-V
I was waiting for this, too. But even more stupid is then to remove support for these quants. If the plan was to desupport it, it should not have been added in the first place. Sigh.
I don't think any of this should be needed for imatrix computation and so it should be safe to disable it for all imatrix computation tasks.
Hmm.. why have the option in the first place then (for imatrix computations). Weird.
Anyway, thanks a lot for your updates/feedback. I'll try it out on deepseek asap, and then probably hardcode it.
[snowflake] If you have any idea what datasets to try please let me know.
I don't, but maybe something was published on the training material, or it's area of expertise. For example, if it lists support for 22 languages, maybe we need some of these languages.
Also, the more datasets you try, the more I am convinced that it might indeed be unused, in some way, and the way to go would be to force writing out the incomplete measurements. In fact, I think that might be the way to go for lots of moe's which have this problem. There must be some neutral values that will cause the quantisation error to be not worse than without an imatrix. And even if this destroys part of these tensors, we might not even care, as our own imatrix trianing data will already destroy or degrade parts of tensors that are not exercised.
I'd have looked at patching it out, I just absolutely hate to deviate from upstream sources :) But sooner or later, I feel I will have to.
I think I crashed nico1 with DeepSeek. It's 471GB, which is way below the 480GB limit that's currently in place (maybe 10GB were offloaded). It did survive a few iterations, during which I managed to stop the quantisations that were running and/or frozen. There was another imatrix calculation running, but that one finished fine. I don't think this should have happened - probably the 480GB limit is too high for quantisation without you being aware.
I've overriden deepseek for the time being.
PS: I haven't watched top, so I don't know if memory usage for deepseek (or the new llama-imatrix) is considerably larger than for other models.
PPS: you should really consider some swap to recover from light oom conditions. maybe. possibly.... worth a try?
PPPS: turns there was a 400GB-limited mmap/mlock active on the gguf file, although it should have been successfully locked before llama-imatrix was started.
I started nico1 again. You can do DeepSeek-V2.5-1210 now as nothing beside nico1 is currently running. I recommend you interrupt any other quantisation and imatrix task before starting it as RAM will be relatively tight.
Sorry, was sleeping. I'll have a look. I'll investigate why rich1 can no longer reach nico1.
interesting, wireguard config says destination port 7103, but it used another port (51832). maybe it switched because rich1 sent packets from that port. but why would it not recover... a mystery.
KaraKaraWitch is now publishing all repos without initial testing (so they are not private). Also an unintended consequence of the policy change.
Hmm https://huggingface.co/WeMake/VX-Unholy-13B says it is gated and I have requested access (probably many moons ago), but my gated repo request page has no entry for it.
In other statistics, of the 30000 models I queued for looking at for my second walkthough, 2000 are left, s thats the maximum amount of mdoels I cna queue (and I guess it will end up big 200 more, before I add a few more months).
That is every surprising, because I am only in June. So there was an explosion of models at the beginning of the year, and a serious slowdown now. (Or somehow my script is buggy, always a possibility)
and more unimportant FYI: i have added an LD_PRELOAD wrapper around uploads that simply wraps each read in an alarm(90); read(); alarm(0)
. hopefully this hack will fix the stuck uploads.
And in news that will doubtlessly fill you with contented happyness, I am through with my second queuing run (february to end of august). The last months in that range were indeed pretty much empty. Very weird.
I plan to look at the post-august months, and at the time before february. I expect the former range to yield few models, and I plan to be much more selective with the pre february range, so I think this is the likely maximum queue extent we will ever see.
Phew.
In not that good news, I seem to have lost the ability to ungate repositories completely. When I try a click-through gated repo, I simply don't get acess, and the list of gated repos in my account settings is empty except for one collection.
@nicoboss some, uh, hints on what you can do in case of a broken model you have access to.
- Only the files in /dev/shm (model.status, model.log) keep the model in error state. once removed the scheduler will try again once it runs.
- You can edit things, fix things, and then delete the error status files, followed by pushing (echo push nico1 >/dev/tcp/10.28.1.1/16713).
- You could move the original download away and replace it by the model subdirectory, in which case the scheduler would try to pick it up from there.
- I will eventually provide you with a better tools, though... Long term, it could make sense to move everything to nico1 (and either have containers everywhere, or simply give you a user account - I planned for these eventualities many months ago by making the default umask 0 :)
- If things go wrong, you can do the next step manually, e.g. you could somehow provide the .gguf file, and when the scheduler runs and error state is cleared, it would simply pick off from there.
- there is very little state that is not externalised, e.g. the scheduler distinguishes a partial download from a succeessful download by the existance of the model.hfd-success file. There are also .override, .interrupt, .nobudget and .force files. You can stop a model by creating a model.override, make it ignore the budget, or simply force-start it.
I'm debating whether to make some kind of web interface, which would also allow, other people to do things, but... I'm a comand line person.
It seems you seem to be quite willing to help, and I would be very grateful. Just queuing models while I am not available would be a great help, and I will gladly work on making all this possible. And if you don't find the time to help more than occasionally, thats fine, too. Not wanting to pressure you :)
I was waiting for this, too. But even more stupid is then to remove support for these quants. If the plan was to desupport it, it should not have been added in the first place. Sigh.
Adding them as separate quants was a mistake. In hindsight online conversion should have been the way to implement this for the beginning. What started with a few ARM quants got out of hand quickly and soon we have likely dozens of Q4_N_M quants optimized for different architectures so switching to online conversion was the only reasonable way for them to do. No that there is online conversion supporting existing Q4_N_M quants is useless as llama.cpp can now just write data to memory in an optimized way while loading the model.
Hmm.. why have the option in the first place then (for imatrix computations). Weird.
It's probably because imatrix computation reuses the same code as other llama.cpp components and so offers similar configurations even if some of them doesn’t really make sense for imatrix computation.
I don't, but maybe something was published on the training material, or it's area of expertise.
According to their advertisement everything should be public but I'm having trouble locating anything useful. They put everything spread across random blog articles and papers and this massive fragmentation makes finding anything too time consuming.
For example, if it lists support for 22 languages, maybe we need some of these languages.
I already tried multiple multilingual imatrix datasets without any success.
Also, the more datasets you try, the more I am convinced that it might indeed be unused, in some way, and the way to go would be to force writing out the incomplete measurements. In fact, I think that might be the way to go for lots of moe's which have this problem. There must be some neutral values that will cause the quantisation error to be not worse than without an imatrix. And even if this destroys part of these tensors, we might not even care, as our own imatrix trianing data will already destroy or degrade parts of tensors that are not exercised.
I already tried around 10 MB worth of datasets so yes it might indeed be unlikely any reasonable prompt will activate that expert. It likely is something super niche like enterprise programming language like Cobol or Erlang as it is an enterprise focused model.
I'd have looked at patching it out, I just absolutely hate to deviate from upstream sources :) But sooner or later, I feel I will have to.
Maybe that really is the way to go and it would also solve this issue with other MoE models. What I already tried is forcing the router to use all experts using --override-kv llama.expert_used_count=int:128
but it unfortunately had no effect for imatrix computation.
I think I crashed nico1 with DeepSeek. It's 471GB, which is way below the 480GB limit that's currently in place (maybe 10GB were offloaded). It did survive a few iterations, during which I managed to stop the quantisations that were running and/or frozen. There was another imatrix calculation running, but that one finished fine. I don't think this should have happened - probably the 480GB limit is too high for quantisation without you being aware.
Instead of a hardcoded limit check how much memory is used by the host using /host/proc/meminfo
and inform me if it won't fit. There was a VM running using 24 GB of memory at a time and maybe some other things.
you should really consider some swap to recover from light oom conditions. maybe. possibly.... worth a try?
Yes I will likely look into it but quite a pain with ZFS. I really hate swap but the current behavior of it just rebooting on OOM also isn't ideal. I wonder what happened to the OOM repaper that always prevented OOM crashes in the past.
PPPS: turns there was a 400GB-limited mmap/mlock active on the gguf file, although it should have been successfully locked before llama-imatrix was started.
mlock would explain the crash.
interesting, wireguard config says destination port 7103, but it used another port (51832). maybe it switched because rich1 sent packets from that port. but why would it not recover... a mystery.
No idea why this happened as well.
KaraKaraWitch is now publishing all repos without initial testing (so they are not private). Also an unintended consequence of the policy change.
100 GB should be enough to test models privately. He likely got way more than 100 GB as you got CurrentPrivateStorageUsed + 100 GB when they introduced this limit. Beside going Pro he could also email them to request more private storage for testing which they will most likely accept as a valid reason for their new private storage grant program. I wonder why he is not testing them before uploading. It seems quite wasteful to upload models you have not even tested. The machine you use to train a model should also be able to run it as far I'm aware unless for merges.
I like the new policy as closed models are almost always for meant commercial use and so used by operations that really should pay for HuggingFace. They have to make money somehow and enterprise customers make the most sense in my opinion.
By the way when I researched HuggingFaces finances it seems like the vast majority of their earnings comes from consulting services. I luckely work for a company where we don't waste money hiring consultants.
In other statistics, of the 30000 models I queued for looking at for my second walkthough, 2000 are left, s thats the maximum amount of mdoels I cna queue (and I guess it will end up big 200 more, before I add a few more months).
Awesome to hear!
That is every surprising, because I am only in June. So there was an explosion of models at the beginning of the year, and a serious slowdown now. (Or somehow my script is buggy, always a possibility)
Your observation is likely correct. There was a lot more activity back then. For example take a look at https://huggingface.co/cognitivecomputations which created a Dolphin version of every good AI base model. Most of them are from early 2024.
and more unimportant FYI: i have added an LD_PRELOAD wrapper around uploads that simply wraps each read in an alarm(90); read(); alarm(0). hopefully this hack will fix the stuck uploads.
Nice. This seems like a good workaround. Let's hope this fixes this issue.
And in news that will doubtlessly fill you with contented happyness, I am through with my second queuing run (february to end of august). The last months in that range were indeed pretty much empty. Very weird.
Today we reached a queue size of over 4000 so I'm really happy it will now finally go down from here. Especially now that we lose 4 hosts in one day.
I plan to look at the post-august months, and at the time before february. I expect the former range to yield few models, and I plan to be much more selective with the pre february range, so I think this is the likely maximum queue extent we will ever see.
Post-august you already had nico1 so there should be way less and as observed there generally are way less models recently. Before February would likely be insane but we can be way more selective.
Hmm https://huggingface.co/WeMake/VX-Unholy-13B says it is gated and I have requested access (probably many moons ago), but my gated repo request page has no entry for it.
In not that good news, I seem to have lost the ability to ungate repositories completely. When I try a click-through gated repo, I simply don't get access, and the list of gated repos in my account settings is empty except for one collection.
Sounds like a strange HuggingFace bug. Maybe they never anticipated someone ungating so many models. For easy models you can always ask me or Richard to ungate and for hard ones we always have Guilherme34
@nicoboss some, uh, hints on what you can do in case of a broken model you have access to.
Thanks you so much for the useful information. I highly appreciate it. This should make it much easier for me to fix models in the future as less coordination will be required.
I'm debating whether to make some kind of web interface, which would also allow, other people to do things, but... I'm a comand line person.
No worries. Using the command line is perfectly fine for me as I’m mainly a command line person as well. In a fraction of time required to create a webpage we could likely create a nice command line application/shell script automate all common manual tasks.
It seems you seem to be quite willing to help, and I would be very grateful. Just queuing models while I am not available would be a great help, and I will gladly work on making all this possible.
I would love to help with this. Mainly with queuing models requested by users so they get their request fulfilled faster if you are unavailable and you don't have to care about this when you are busy. In that case it should also not matter if I'm ever too busy to help as all time I can spend on this will be an improvement over the current situation.
Should the queue ever get empty I will queue some historical models I feel are improtant and then maybe do some model authors I like but would likely run out of ideas at some point. I don't think I would have the dedication to go through and judge 30000 models to select the best ones. Your work on selecting models is highly appreciated.
And if you don't find the time to help more than occasionally, thats fine, too. Not wanting to pressure you :)
No worries I like to help getting interesting models to work. My time is always limited so I can't look into every single model that failed so I focus on interesting models and the ones requested by users.
No that there is online conversion supporting existing Q4_N_M quants is useless
Well, not for those already downloaded... In any case, yes, I agree that, if its a maintenance burden, it should indeed just go.
Instead of a hardcoded limit check how much memory is used by the host using /host/proc/meminfo and inform me if it won't fit.
That's... I guess you'll have to tell me what you would want me to look for and/or calculate. It is, however, notoriously difficult to check this beforehand, so likely this would just make the imatrix job fail (i.e. the imatrix job would check). That's not super-bad, as that is already happening for special models.
KaraKaraWitch is now publishing all repos
Well, it's another psychological effect. The alternative would be to gate the models, I guess, and keep them public, until they are tested.
Thanks you so much for the useful information. I highly appreciate it. This should make it much easier for me to fix models in the future as less coordination will be required.
Yes, and don't be shy, even if I was a bit, ehe, cranky recently. You are interfering a great deal, and it's predominantly very useful :)
No worries. Using the command line is perfectly fine for me as I’m mainly a command line person as well
I was thinking less of you, and more of others. But, yeah, command line is the way to go at first.
You have been rate-limited; you can retry this action in about 24 hours. If you're a new user, your limits will raise progressively over time. Get in touch with us at [email protected] if you need access now.
Small models are too much for huggingface. I'll mail them.
Small models are too much for huggingface. I'll mail them.
Oh no maybe it was not so intelligent after all to do all the lage models first. We are using many nodes and such limits are usually either token or IP based but rarely user based so this should not be an issue. If it's an upload limit try giving each host a separate upload token. If it’s a download limit then maybe we are always using Guilherme34's token and so exceed that token's rate limit in which case either download anonymously or using a dedicated token by default. If this issue only occurs on rich1 then maybe it is because we just started some face picture data collection project there. In the end there really could be a per user limit in which case we have to email them or workaround the limitation.
Have you realized that mradermacher is now part of the https://huggingface.co/TopContributors-ModelDownloads organization? That is so cool!
-2000 14 Falcon3-Mamba-7B-Instruct run/imatrix (GPU-2d) 101/64 22.58s/c 42.5/126.0m(127.0) [112/335] 6.9410
Something tells me that a 7b should not take 2h for imatrix quantization.
Oh no maybe it was not so intelligent after all to do all the lage models first.
We did not do all the large models first, we only preferentially did them. rain/kaos/back etc, all did small models the whole time. So if we did it more "intelligently", we would just have hit it earlier when rich/marco/nico would hit small models randomly.
The limit is is repository creation btw., I cna try to optimize it, but I will ask for an exception first. The issue started on rich1, but has no affected everything. I suspect it was simply the speed of quantizing small static models.
Have you realized that mradermacher is now part of the https://huggingface.co/TopContributors-ModelDownloads organization? That is so cool!
Hmm, I thought I had it mentioned already, but what happened is that whenever I clicked on "Quantizations" on a model page, I got a blocking invitation page asking me to either join or refuse to join that organisation. Normally I ignore those things until I have made up my mind (not sure I want to be part of any such organisation :) but since I was forced, as my access to the webpage was limited, I hit accept.
BTW, it also made all uploads fail, which I have to clean up manually. At least it didn't hammer their servers. And since I am rather limited w.r.t. email (my private mail server is the one currently being restored), I had to ask marco to contact the website team :)
Actually, whats the company mail server queuing... only 3130 mails it can't deliver to me. Sigh.
And since as far as I can see, all uploads are failing, I will pause nico1, rich1 and marco until this is sorted out.
And since as far as I can see, all uploads are failing, I will pause nico1, rich1 and marco until this is sorted out.
Try using different upload tokens on each host. Even those limits are probably applied on a per upload token level according to @RichardErkhov . It’s at least worth a try.
@RichardErkhov Already reached upload limits, api limits, inference limits, file list limits, repo creation limits, request limits, gated request limits, file size limits, file count limits, space size limits so he should be an expert when it comes to limits. Almost all limits he encountered where on a per token bases. Since he uses 3 different upload tokens to no longer hit a single limit.
I've changed the upload to retry after 10 minutes when it happens, so only newly started jobs will fail(which are far easier to clean up). I'll wait for a while to see if hf increases the limit - they did it before (when it was so low that siumnply me clicking around on the webserver triggered it regularly). depending on the situation I will try different hf tokens per host. However, I am pretty sure a single host cna already trigger it, so I might have more tokens per host.
However, I am pretty sure a single host cna already trigger it, so I might have more tokens per host.
That's exactly what @RichardErkhov is doing. 1 token per host is usually enough for him for a single host but if not he uses a separate tokens for every python instance.
For now I would just do different tokens for each host as this should be simple to implement and a single host triggering the limit is relatively unlikely.
Nope, doesn't work, leia just triggered it, and leia already uses a seperate hf token.
I will try to reduce the load on the api and check in another way for successful repo creation, later tonight when I have time. Thanks, hf.
requests.exceptions.HTTPError: Invalid user token. If you didn't pass a user token, make sure you are properly logged in by executing huggingface-cli login
, and if you did pass a user token, double-check it's correct.
I am now completely blocked, it seems. I can't do anything whatsoever.
Creating new tokens does not have any effect, but from time to time, an upload goes through. It seems I am severely rate limited, and I feel this is not due to the increased upload frequency. It feels like some new limit. Also, huggingface-cli upload uses the limited create_repo API, so me reducing calls to it will have very little effect.
I guess only hf can do something about it, and if they don't in a few days, we have to shut down.
Things are slowly starting to upload again. I guess we'll find out more this evening.
Nope, the rate limit is still there, and severe. I'll try some tricks and hope there won't be a disaster tonight.
Reducing the amount of API calls to the bare minimum seems to be the only solution for now so try every trick possible. As far I'm aware every commit is an API call so maybe we should batch together some files for small models. Also make sure downloads don’t use any mradermacher token.
The rate limit doesn't seem that severe. All uploads seam to eventually make it through. Commits to 15 models where successfully made in the past hour and I see equal outgoing network traffic on nico1 than on any normal day:
The rate limit doesn't seem that severe.
I have already almost halved the number of api calls yesterday and implemented batching of uploads of small (<=10B) models. The latter hasn't taken effect yet, and I assume that will fix it, but I think it is pretty severe, especially as we had similar rates earlier, when we did the first batch of small models (the nice 800 ones), so this definitely looks like something that has been changed since then.
Ok, the first batched uploads are through, and nothing seems to have imploded. I hate making hurried changes like this, especially when I am not there to watch things potentially implode. But I have to sleep now. Let's hope for the best.
I have already almost halved the number of api calls yesterday
Oh wow wasn't aware of that. It's quite insane we are still hitting the limit despite those changes and decommissioning db1, db2, db3 and backup1.
implemented batching of uploads of small (<=10B) models. The latter hasn't taken effect yet, and I assume that will fix it
I think and hope so as well.
so this definitely looks like something that has been changed since then.
Yes this definmately seams like a new limit or @RichardErkhov would have known about it. He already managed to exceed almost every rate limit on HuggingFace possible.
Ok, the first batched uploads are through, and nothing seems to have imploded.
Awesome to hear. Let's hope everything continues to go well.
I hate making hurried changes like this, especially when I am not there to watch things potentially implode. But I have to sleep now. Let's hope for the best.
Definitely not an ideal situation but better than to hit the rate limit. Everything looks good to me for the repositories I checked. I will be here and watch things but I'm quite confident nothing bad will happen. Have a great nigh!
Maybe things are not going so great after all. "auto-patch README.md" is going a bit crazy and is removing references to existing static quants on some long-completed models:
- https://huggingface.co/mradermacher/Nemotron-4-340B-Instruct-hf-i1-GGUF/commit/f4bc99d59dcd92f65c681dfc50bd6a757435f300
- https://huggingface.co/mradermacher/Hermes-3-Llama-3.1-70B-lorablated-i1-GGUF/commit/baab2edf5a54d43d775b368ff065be2d063c1da4
- https://huggingface.co/mradermacher/SILMA-9B-Instruct-v1.0-i1-GGUF/commit/d8fbfa1fe718e78034b552dbe4318482a9ace9e7
The same it also does to static quants where it removes references to imatrix quants:
I assume this is caused by poor error handling inside the "auto-patch README.md" where it assumes an API rate limit status code means the model doesn't exists. Also scanning every model every uploaded is not so great of an idea if we are concerned about API rate limits.
Interesting it now started to fix things it previously broke:
@RichardErkhov would have known about it. He already managed to exceed almost every rate limit on HuggingFace possible.
Haha, he often reminds me of younger me. The real test is when we hit other large blocks of static-only jobs again (today it mostly did much slower imatrix jobs).
I assume the amount of create_repo calls has gone down by a factor of about 5.
I assume this is caused by poor error handling inside the "auto-patch README.md" where it assumes an API rate limit status code means the model doesn't exists.
good idea, but unfortunately, it checks for that either by downloading the README.md without the API (original model) or by using the list of all mradermacher models (for fiinding other quant repos). I'll have to look at it. As long as the original url is still intact, it will be fixable.
Also scanning every model every uploaded is not so great of an idea if we are concerned about API rate limits.
I'm not doing that on every change, fortunately, that's a background job that has essentially a fixed rate limit (more models == fewer iterations per time). The API affected seems to be only repo creation (which is called once or two per job, and was called twice per upload).
I'll have a look into the problem, thanks for catching those, which is a job well done :)
Interesting it now started to fix things it previously broke:
Fascinating, so, obviously intermittent errors of some kind. It runs on each repo after each job finishes, and tries to scan through all repos separately every 2 days at the moment. What you see is likely the background job that fixes the links to original urls and so on.
Hmm, not good, all those wrongly updated model pages are not in the llmjob log, so it must have been done by the background job. Unfortunately, that one really uses the list_models api call to get a list of all repos once, and then just checks if the static/imatrix repo exists, while the foregrtound job doesn'T use the api but does a GET on the actual model (html) page to see if the model exists.
Unfortunately, I indeed key the latter it on status 200, because you get all kinds of status codes when the model doesn't exist (404, 401...), so it's quite hard to know when it temporarily failed. I guess well have to live with this at the moment, unless I want to add more api calls for this.
I think repo creation has an especially low(*) api limit, and whoever did that was probably not aware of every upload calling this endpoint (in fact, I was not aware - maybe it is a recent addition, because I manually create the repo on upload because hf-upload would otherwise fail).
*: comparatively speaking :)
Unfortunately, that one really uses the list_models api call to get a list of all repos once
That is where I was wrong, it sahould have done it, but due to heavenly refactoring, it failed, so this explains it. The foreground job can still fail to correctly patch it, but the background job should get that part right. And if the original model page is missing occasionally, that shouldn't cause a diff. Famous last words.
The only thing that disappointed me was the complete non-reaction of huggingface - I wrote a polite mail, and they didn't even bother with a negative reply. As far as I am concerned, hf is now the enemy (meaning, just another big corporation).
Even with the changes we still run into rate limits. This is brutal. Clearly a sign from hf that they don't appreciate us.
Even with the changes we still run into rate limits. This is brutal. Clearly a sign from hf that they don't appreciate us.
Just because some random employee set a limit too tight doesn't mean they don't appreciate us. Someone likely just thought that limiting repository creating to 100 per hour makes sense as nobody could reasonably exceed that not realizing that the same API call is called for every commit.
The only thing that disappointed me was the complete non-reaction of huggingface - I wrote a polite mail, and they didn't even bother with a negative reply. As far as I am concerned, hf is now the enemy (meaning, just another big corporation).
They are notorious for being slow. @RichardErkhov successfully contacted them in the past regarding an API issue by creating an issue on their GitHub but they have not yet fixed it after almost 3 month despite confirming the issue: https://github.com/huggingface/huggingface_hub/issues/2581
Especially now most of them are likely already in Christmas holiday so I'm really not surprised information like this is not reaching the right persons. Even in my much smaller company bug reports often get lost somewhere in middle management. I recommend you create an issue on their huggingface_hub GitHub instead where you are much more likely to reach someone capable of fixing this issue.
But honestly things don't seem that bad. Despite all this API rate limits it does not seem to affect our throughput so maybe we can just life with it. It seems unlikely that us having such a massive queue of only small models will ever happen again. When we are at queue size, I'm currently very satisfied with the progress and we already got it down from over 4K to below 3.5K in just a few days.
Just because some random employee set a limit too tight doesn't mean they don't appreciate us.
No, but I have contacted three days ago, and they didn't even bother to reply. I judge by actions.
They are notorious for being slow. @RichardErkhov successfully contacted
Your example shows a reaction time of less than a day, though, so clearly they can if they want to.
I recommend you create an issue on their huggingface_hub
I am not going to create potential drama somewhere - they asked me to use e-mail, and I used e-mail. If somebody wants to do that, that is fine, but, again, I went through the official channels for this, I don't want any special service.
But honestly things don't seem that bad. Despite all this API rate limits it does not seem to affect our throughput so maybe we can just life with it.
I can of course live with this, but it obviously affects our throughput. An hour ago,m no quanting was done, and right now, four nodes are still not doing anything much.
Nico, I feel you are a bit panicking because I sound so negative - Don't worry, I am not trying to declare war on hf or giving up, I am merely adjusting their way too good reputation in my head. And I have learned to judge companies by their actions, not by the goodwill of fans. Or should have learned :) This is an attitude correction for me, not a disaster.
Addendum: you can probably tell by now that I am a staunch anti-neoliberalist and work for a tiny, very personal company for a reason :) Don't worry, I am also a realist :)
@mradermacher The status page(http://hf.tst.eu/status.html) is frozen since 2024-12-20 16:05:00+0100 and booth nico1 and rich1 are idle. There no longer seam any models to be uploaded so I assume something critical broke and I don't think there is anything I can do to fix it.
I checked kernel log on StromPeak and the time it broke seams to somewhat allign to the time my RTX 3080 GPU crashing but that is not used by nico1 as only the RTX 4090 GPUs are assigned to your LXC container and so should not be related:
Dec 20 15:55:19 StormPeak kernel: NVRM: GPU at PCI:0000:c1:00: GPU-c8fe94f9-541b-e16b-da0f-b8d38ea5283e
Dec 20 15:55:19 StormPeak kernel: NVRM: Xid (PCI:0000:c1:00): 62, pid='<unknown>', name=<unknown>, 2027f626 2027f426 2027fcf4 20288f2a 20288e30 2021b5b8>
Dec 20 15:55:24 StormPeak kernel: NVRM: GPU 0000:c1:00.0: RmInitAdapter failed! (0x62:0x55:2477)
Dec 20 15:55:24 StormPeak kernel: NVRM: GPU 0000:c1:00.0: rm_init_adapter failed, device minor number 0
(...)
Dec 20 15:58:48 StormPeak kernel: INFO: task nv_open_q:2903 blocked for more than 122 seconds.
Dec 20 15:58:48 StormPeak kernel: Tainted: P O 6.8.12-5-pve #1
Dec 20 15:58:48 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 15:58:48 StormPeak kernel: task:nv_open_q state:D stack:0 pid:2903 tgid:2903 ppid:2 flags:0x00004000
(...)
Dec 20 15:58:48 StormPeak kernel: INFO: task nvidia-smi:2356875 blocked for more than 122 seconds.
Dec 20 15:58:48 StormPeak kernel: Tainted: P O 6.8.12-5-pve #1
Dec 20 15:58:48 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 15:58:48 StormPeak kernel: task:nvidia-smi state:D stack:0 pid:2356875 tgid:2356875 ppid:2341557 flags:0x00004006
(...)
Dec 20 16:00:50 StormPeak kernel: INFO: task nv_queue:2901 blocked for more than 245 seconds.
Dec 20 16:00:50 StormPeak kernel: Tainted: P O 6.8.12-5-pve #1
Dec 20 16:00:50 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 16:00:50 StormPeak kernel: task:nv_queue state:D stack:0 pid:2901 tgid:2901 ppid:2 flags:0x0000400
After more carefully reviewing the kernel log it indeed seams that nico1 got somehow affected by the issue with the RTX 3080 GPU:
Dec 20 15:58:48 StormPeak kernel: INFO: task llama-quantize:2364235 blocked for more than 122 seconds.
Dec 20 15:58:48 StormPeak kernel: Tainted: P O 6.8.12-5-pve #1
Dec 20 15:58:48 StormPeak kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 15:58:48 StormPeak kernel: task:llama-quantize state:D stack:0 pid:2364235 tgid:2364235 ppid:1469293 flags:0x0000000
llama-quantize should not use any GPU and the faulty GPU is not even attached to your LXC container so really strange this happened. There are tasks running so not sure if the system is in a state where it can tolerate a reboot of nico1 but it currently is not working at all so it likely can't get any worse. It would be really interesting to know how a stuck quantize task on nico1 brought the entire system to a halt.
I disconnected nico1 from the internet but still kept it running. Let's see if that is enough for the system to fix itself. All other hosts should now detect nico1 as offline and hopefully manage to recover.
It didn't help. I will reboot StormPeak now but unlikely that fixes anything as even without nico1 the system didn't recover.
I rebooted StormPeak which fixed the RTX 3080 issue and started nico1 again but as expected this unfortunately didn't fix whatever issue brought the entire system to a halt.
Good morning. I don't know what happened. A llama-quantize should hang the job only, but maybe something else also went wrong. The connection timeout (once established) is currently 3600 seconds, but that either didn't trigger or somehow it happened multiple runs of the scheduler. rich1 is also gone at the moment, which might play a role as well.
I also disabled the local scheduler a week or so ago because there is some weird bug where static jobs finish successfully within 10 seconds without doing anything, meaning static quants are not generated at all, so that didn't help either.
Obviously, there is a bug somewhere.
Since I am not in such great shape still, I opted to kill all processes holding loocks and this got it going again, but without post-mortem. We'll have to do this a few more times, I guess, to find the issue ...
Don't know if I can do it, but I plan to queue more models before the queue dries out -. otherwise, I'll have to tell richard that soon his box will be idle and needs to be taken over, and then a short time later, I will beg to get exclusive access again :)
In other news, my main home server (that I need for breathing and basic survial, figuratively speaking :) is restore to a state where I can actually use it in read-write again. Doesn't mean much to you, but the last weeks were... unpleasant, I practically couldn't do anything.
And if we don't talk to each other much, merry christmas and a happy new year :)
I think I'll make an attempt at a huggingface-cli replacement tjhat doesn't call create_repo.
it seems to work. that means we will soon be down to exactly one create repo call per repo we create. However, I have the suspicion that maybe only successful calls are counted, in which case we just hit the limit. The only thing we could do is then mix static and imatrix quants to half creation rate. Or sit it out and hope for the best.
At the very least, though, we re-gain the ability to upload quants separately, and without being rate-limited by those calls.
(D'oh, forgot the README patcher)
Not a single rate limit hit tonight, seems it worked, and it was the total calls to create repo, whether successful or not, but fortunately not the rate-limited calls.
On the other hand, we had some big models, so less repo creations overall. But we didn't even get through the newly queued models yesterday due to the rate limit.
Maybe finally I can find some time to link the download page before it becomes obsolete.
gee, found another locking bug that kept jobs from being started all night.
Since I am not in such great shape still, I opted to kill all processes holding loocks and this got it going again, but without post-mortem. We'll have to do this a few more times, I guess, to find the issue ...
gee, found another locking bug that kept jobs from being started all night.
Awesome to hear that you were able to find and fix another locking bug. I can only imagine how complex maintaining this entire system must be. I wrote a distributed system for the satellite project I'm doing together with Richard where we have around concurrent 30 workers often only staying for a few hours and there where so many edge cases to consider.
Don't know if I can do it, but I plan to queue more models before the queue dries out -. otherwise, I'll have to tell richard that soon his box will be idle and needs to be taken over, and then a short time later, I will beg to get exclusive access again :)
Richard would for sure appreciate it if you can keep fully utilizing his server and don't run out of models for him to quant. If the queue gets too small you can maybe make it so all the remaining models are getting priority scheduled to rich1 so marco and nico1 run dry first which to my knowledge are the only server where someone has to pay for electricity.
Just so you know currently we are also using the same server that hosts rich1 for a satellite project worker so when we had that rich1 LXC outage we just scaled up satellite to use all resources and downscaled it again once your LXC container was fixed. I'm sure Richard will always find some other temporary use for this server should the queue ever run dry. I also have quite close contact with him so don’t worry about it.
In other news, my main home server (that I need for breathing and basic survial, figuratively speaking :) is restore to a state where I can actually use it in read-write again. Doesn't mean much to you, but the last weeks were... unpleasant, I practically couldn't do anything.
I'm so glad to hear that. This for sure must have been a really bad time for you.
I think I'll make an attempt at a huggingface-cli replacement tjhat doesn't call create_repo.
That sounds like a great idea.
it seems to work. that means we will soon be down to exactly one create repo call per repo we create. However, I have the suspicion that maybe only successful calls are counted, in which case we just hit the limit. The only thing we could do is then mix static and imatrix quants to half creation rate. Or sit it out and hope for the best.
At the very least, though, we re-gain the ability to upload quants separately, and without being rate-limited by those calls.
Not a single rate limit hit tonight, seems it worked, and it was the total calls to create repo, whether successful or not, but fortunately not the rate-limited calls.
On the other hand, we had some big models, so less repo creations overall. But we didn't even get through the newly queued models yesterday due to the rate limit.
Wow thanks a lot! This is awesome. I'm so happy we managed to find a workaround to avoid the rate limit.
Maybe finally I can find some time to link the download page before it becomes obsolete.
It would be really cool if you could do so. I love your download page! It would be great if you can show me an example before you do all of them as this might be the last time we change all the model cards so it needs to be as good as possible. Something else I noticed is that sometimes our quants appear as "Finetunes" instead of "Quantizations" in the parent model as can be seen in https://huggingface.co/models?other=base_model:finetune:nicoboss/Meta-Llama-3.1-405B-Instruct-Uncensored - maybe this can be fixed as well when we have to update all model cards anyways.
And if we don't talk to each other much, merry christmas and a happy new year :)
I wish you a happy new year as well!
I can only imagine how complex maintaining this entire system must be.
The problem is that code is constantly added and algorithms changed while the system is running :-)
[download page] It would be great if you can show me an example before
I hope I can do it incrementally, e.g. for new models only at first. But yeah, I'll try to ask for your opinion. If you wish, you can can even make a suggestion - I want to use some custom css to make a small box with the link only, and some very short explanation, such as "Compare static/imatrix quants, download and search on our... [[[Model summary page for this model]]]" or so. Suggestions or even outright examples are welcome :*)
so all the remaining models are getting priority scheduled to rich1 so marco and nico1 run dry first
The problem is in the specifics. Unless requested, models get queued in batches, and then we have two choices: leave a model in the queue, or queue it somewhere. In the latter case, we can choose where to queue.
If rich1 simply has priority, it would simply accept all models till the budget is full or the queue size limit is reached, neither of which is likely for daily batches, and also not desirable. At the moment, it is kind of distributed by speed, as nodes do static quants first, so faster nodes gobble up more jobs.
We need some kind of back pressure/weighting. And something like this is implemented (differently for models with nice <= 50), but it wouldn't be able to avoid scheduling on nico1 or marco. The current scheduling restrictions on nico1 are nice, because they mostly answer the question at night, and I will put a different scheduling restriction on marco (basically take it out completely once our queue is usually empty).
The only solution, I am afraid, is to essentially block nico1 completely (other than imatrix generation). And that might be doable, after all, we did this for many months. Or we only manually schedule jobs on nico1. Or only bigger jobs, which would be delayed on the much slower rich1 (which also might conceivably busy with other jobs, as it is effectively a shared server). Something like that. Share your thoughts :)
gpus@nico1
As an unrelated side note, for a few days now, I was using only one graphics card on purpose, except when I was in trouble (because of scheduling or downtime issues unrelated to the actual model workload), and at the moment, one gfx card is sufficient.
I really do plan to queue a few more months before the queue runs dry, though.
Update: Yeah, I think that's it - disable automatic quanting on nico1 except maybe for requested models (<= -1000), hand-queued models and very big models.
peculiar: we have been rate-limited again. pretty sure our repo creation rate was very average (especially as nico is paused).
more peculiar: even though our rate is way lower, the wait time (once rate limited) is much higher.
i hope they didn't further restrict uploads, or repoc reations :/
I saw that yesterday, rich1 was pretty idle, we even decided to finish off satellite by doubling the processing power because rich1 otherwise was completely idle ... What is going on ? Did huggingface answer anything in the email ??
hf completely ignored my mail, afaics. it's quite strange, every time i reduced repo creation rate api calls, it worked for a few days, then -> new rate limit. or, alternatively, the rate limit is weirdly implemented. right now, I think we are at the theoretical minimum rate (one repo creation request per actually created repo).
it's also possible that the rate limit is not strictly implemented as a per-account rate limit. maybe it's just not reliable, just like anything else they implemented :)
I should try contacting them lol. What should I write haha? Im not the best at email writing, so would appreciate if you could draft it =)
or I can try contacting them elsewhere I can have contact with them
i hope they didn't further restrict uploads, or repoc reations :/
I don't think it changed since it got introduced. They for sure wouldn't introduce such changes during the Christmas/new year holiday period where most of their developers are on holiday.
especially as nico is paused
When I paused nico1 today for the performance measurement project I got the following error but it all seem to work despite this:
./nico1-pause: line 19: /root/s2/llmjob: No such file or directory
I checked and was able to confirm that the entire "s2" folder is missing. Only thing that didn't work was unfreezing and completing the frozen task but not important as I don't intend on rebooting this time. Let's just hope they don't automatically start as long nico1 is paused.
140+ 14 CosmicNoodle-7B blocked/imatrix/gpu
Any idea what this means? I saw simular blocked satuses for the entire day before I paused nico1.
I checked and was able to confirm that the entire "s2" folder is missing.
Right, everything is now in /llmjob, rather than splattered over the system. I forgot to update the script(s). Will update them.
All you missed out on was resuming the frozen/Stopped quantize jobs, so they didn't interrupt and didn't exit.
140+ 14 CosmicNoodle-7B blocked/imatrix/gpu
The status of jobs does not update when paused, so this is simply the last status update. I think :) If it does not clear up when resumed, I will have to have a look.
It might also be that the job has failed somehow, but didn't have an exit status. In that case, the job scheduler doesn't know what to do and just ignores it. (Well, it might actually block a gpu in that case, but that isn't the case here).
nico1 is now unpaused.
-2000 360 si falcon-180B
-2000 236 si goliath-120b
Nice I see you queued falcon-180B and goliath-120b. I hope you are not just adding the missing static quants but will also requantizing the already existing imatrix quants. I definitely want to give falcon-180B another try. I remembered how I excited I was when it released as it was the biggest openly released LLM at that time but then the model turned out to be quite underwhelming but maybe with modern CoT prompting techniques and better system prompts this almost forgotten base model can be of use. While finetunes are nice in the end base models contains the knowledge I seek to extract and so are of much greater value.
Edit: Seams like it is requesting the existing imatrix quants. How awesome!
-999 205 I Llama-3-Motif-102B error/134 12/24,IQ1_M [691/867]
What a strange error - not something I've ever seen before but you might be familiar with it. So strange how all the other quants so far worked.
[ 691/ 867] blk.76.attn_q.weight - [ 9216, 9216, 1, 1], type = f16, converting to iq1_m .. /root/cvs/llama.cpp-cuda512/ggml/src/ggml-quants.c:4453: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
Nice I see you queued falcon-180B and goliath-120b. I hope you are not just adding the missing static quants but will also requantizing the already existing imatrix
I was missing the static quants only (and incidentally, any missing imatrix ones). I was also so disappointed in falcon-180b. Anyway, I'll redo the imatrix ones then, too, then.
error/134
That is the process exit code, in this case, ABRT: ggml-quants.c:4453: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
I will just put it here in case you didnt notice it =) @mradermacher
I should try contacting them lol. What should I write haha? Im not the best at email writing, so would appreciate if you could draft it =)
or I can try contacting them elsewhere I can have contact with them
I didn't notice, indeed. Hmm.... I'm in a bit of a cinch here - I normally don't want in a position to ask for special treatment, but obviously, I am very specially treated by hf already. And sometimes it might be better not to wake up sleeping tigers.
So... I mailed them, nicely, and they didn't consider it. At the moment, it is fine most of the time, and annoying some of the time. And we are creating repos faster than normal, due to going through all the small ones. So maybe it's best to not push them further and delay mailing them until we really run into a painful rate limit.
It might be an issue for you, if you start quickly quantozing all the, uhm, remaining models (yes, I haven't forgotten about the list :)
alright then. when it bothers you too much, I guess just send me a text for a message, I will try to do something with it. I guess it will be when we start quanting "remaining models" haha
Yeah, but that will hopefully your problem :)
when will I start quanting lol ? In 2026 haha ? When will it be my problem ? Maybe I should just send a message now to see with them about it? Or should I pursue other projects while waiting for your part to be done ?
I wanted to provide it much earlier, but too much other stuff came in between that I... couldn't preempt. Turns out the problem is a bit harder than I thought, too, but I have most of the filtering stuff in place.
well I guess I will eventually get it haha, well good luck with anything you have =)
Thanks for your understanding :) I'll try to provide it before rich1 runs dry(er)
@mradermacher
The RPC setup is ready for DeepSeek-V3
, DeepSeek-V3-Base
, Hermes-3-Llama-3.1-405B-Uncensored
and Hermes-3-Llama-3.1-405B-Samantha
. We should do DeepSeek-V3/DeepSeek-V3-Base in 8-bit and Hermes-3-Llama-3.1-405B-Uncensored/Llama-3.1-405B-Samantha in 16-bit.
The servers are not primed yet and I have no idea if on latest llama.cpp this is still required. To prime just call llama-cli -m /tmp/model.gguf -p "Hi" -n 2
and the RPC arguments. Should priming still be required we would idealy automate it.
Here the RPC arguments to use:--rpc 192.168.200.201:7201,192.168.200.202:7202,192.168.200.203:7203,192.168.200.204:7204 -ngl 10000
Please make absolutely sure no imatrix or quantization tasks are triggered while an RPC task is running or the entire host will crash due to OOM while GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
. Especially for then 405B models RAM will be extremely tight.
To move the GPU to CastlePeak I had to reboot StormPeak. I stopped nico1 and waited for the script to terminate and then shutdown the LXC container and host. Somehow this ungracefully killed Gemma-2-Ataraxy-Gemmasutra-9B-slerp
and Gemma-2-Ataraxy-v2a-9B
imatrix computation so please restart those.
I'll have a look when I get up again. I'll try without priming (and then complaining). As for automating it, I' can basically just run llama-cli with the -model, rpc and - n 2? That would be easily automatable.
Somehow this ungracefully killed
Yeah, the scheduler can't know what goes wrong when the connection fails. More concerning is the rsync that was transferring falcon-180b-chat was also hanging, and that one had a --timeout 600, which should have triggered :/
But it's easy enough to clean up, or rather, I usually have to clean up a failed job per day anyway. At least currently when we are in the most-junk phase of the archive queue.
We should do DeepSeek-V3/DeepSeek-V3-Base in 8-bit
Might be a good time to test the hf quant download code that exists, but has not been tested yet (it had to be rewritten for nico1). Did we ever get zero-copy-concatening to work on nico1? We'll probably find out...
Hmm, or maybe not.
Did we ever get zero-copy-concatening to work on nico1? We'll probably find out...
I used it yesterday to concatenate the parts of Hermes-3-Llama-3.1-405B-Samantha that at the time was uploading to HuggingFace because I forgot to hardlink. cat concatenation worked instantaneously. I was so impressed that I didn't had to wait 10 minutes for the data to copy like on ZFS. That was almost like magic. I assume the file system somehow created a new file based on all the blocks of the old file without copying anything.
As for automating it, I' can basically just run llama-cli with the -model, rpc and - n 2? That would be easily automatable.
Yes I it just needs to do prompt processing for a token and generate 1 token if they still have not yet fixed that issue. Awesome it is not that hard to automate because manual priming always requires so much coordination.
I'll have a look when I get up again.
Any idea when that will approximately be. I'm asking because I obviously need to have all my services and the ones I provide to @RichardErkhov and @Guilherme34 turned off before we start with RPC imatrix computation. I have already everything turned off but I might turn some services on again in the meantime if I know when you will start.
I have a minor request for the download page, could you show the raw perplexity values for a model. The other raw values (besides raw eval, which I don't see a use for) can be derived algebraically, but raw perplexity can only be derived from your numbers after running two perplexity calculations. It would be helpful to compare perplexity values with values I generate with a compressed KV cache, or a non-standard quant recipe using your imatrix.dat files (which I am very grateful for you providing).
I have a minor request for the download page, could you show the raw perplexity values for a model. The other raw values (besides raw eval, which I don't see a use for) can be derived algebraically, but raw perplexity can only be derived from your numbers after running two perplexity calculations. It would be helpful to compare perplexity values with values I generate with a compressed KV cache, or a non-standard quant recipe using your imatrix.dat files (which I am very grateful for you providing).
The quality value currently shown on the new download page (https://hf.tst.eu/model#Qwen2.5-3B-i1-GGUF) are meant to provide the user the average quality of a specific quant and does not depend on the model shown. It is based on my measurements of 616 quants from the Qwen2.5 series of models. You can download the raw data from http://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst
We don't measure perplexity of every single model beside the perplexity value llama.cpp computes during imatrix computation. I'm not sure how useful providing that would be given that we use a proprietary imatix training dataset. The model you download is never leaked to the server hosting the new download page. All dynamic content on is generated using client-side JavaScript for privacy reason so I don't think it's the right place to provide any model specific data. If there is a valid use-case for it we could consider adding the perplexity value computed during imatrix computation to the model card or maybe upload the imatrix training log as dedicated file to future models.
Regarding the llama.cpp version I installed b4435 017cc5f on all RPC servers which was and still is the latest release. I recommend you use the exact same version. I recommend against using latest develop 53ff6b9 as it majorly refactores llama.cpp backends and I don't feel confident that this version is stable. I would prefer not spending another week redoing all the RPC imatrix quants because their refactoring turns out flawed. Latest develop seams currently so bad even their automated release pipeline failed which is why b4435 017cc5f is still the latest release at the time of writing.
Don't forget to compile llamma.cpp without CUDA and with RPC support for the RPC setup to work.
Script to install it:
#!/bin/bash
rm -rf llama.cpp/
git clone --recursive https://github.com/ggerganov/llama.cpp.git
cd llama.cpp/
git checkout b4435
cmake -B build -DGGML_RPC=ON
cmake --build build --config Release -j
Yeah, the scheduler can't know what goes wrong when the connection fails.
To my knowledge nico1-pause informs booth the imatrix and quantizing scheduler as they are then booth marked as paused on the website and it shouldn't be surprising for a paused host to lose connection because preparation for a reboot is a very common reason for me to pause nico1.
More concerning is the rsync that was transferring falcon-180b-chat was also hanging, and that one had a --timeout 600, which should have triggered :/
That's strange. rclone with timeout survived all my internet issues I had back in coaxial days and now a restart caused it to hang. That’s indeed quite surprising.
But it's easy enough to clean up, or rather, I usually have to clean up a failed job per day anyway. At least currently when we are in the most-junk phase of the archive queue.
I know but it would be nice if it would be possible to reboot a paused host without causing unnecessary work for you.
It is based on my measurements of 616 quants from the Qwen2.5 series of models. You can download the raw data from http://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst
I see, it's based on that data (I've been meaning to augment it with custom quants and KV compression, haven't had a chance to do that yet).
The quality value currently shown on the new download page (https://hf.tst.eu/model#Qwen2.5-3B-i1-GGUF) are meant to provide the user the average quality of a specific quant and does not depend on the model shown.
I don't think that is possible, since different model families behave very differently when it comes to quantization. It's also not really clear that these are estimates independent of the model, as it doesn't mention that and the numbers are very specific, the only way to tell is if you look at multiple model's you will notice that the numbers are the same.
A specific example of why this matters is that your metrics make static Q5_1 seem strictly worse than static Q5_0, except in ppl where static Q5_0 is marginally better but still not worth the increase in size, and evaluation which is very noisy. This is not the case for all models. For Gemma-2 and, to a lesser degree, Llama-3 and Mistral Nemo, static Q5_1 should perform better than static Q5_0. As you noted in Qwen 2.5, they both perform very similarly, and for Phi-3.5, static Q5_1 should perform worse than static Q5_0.
For weight quantization, estimating the behavior requires a lot of knowledge. You have to understand the model's architecture and llama.cpp quant strategies for each size. Additionally, you must consider how "dense" the information in the model is because the more tokens a model is trained with, the higher the quantization error for a given bpw quantization budget (this is very apparent when you compare Llama-2 with Llama-3).
Comparatively it is a lot more trivial to estimate the effects of KV cache quantization, you can get a decent estimate based on whether it uses MHA, MQA, GQA, and if GQA, the ratio, but there might be more to it, as I've seen people report much worse performance than I'd expect for certain models.
I don't think that is possible, since different model families behave very differently when it comes to quantization.
The quality values shown on the download page are meant to help to user to decide choose the highest quality quant that can be run under given hardware constraints. I'm aware that there are quality differences between different families and especially for static IQ3 quants those differences can be quite significance but measuring every single quant of every single model is not feasible so this is the best we can do with reasonable compute cost.
For weight quantization, estimating the behavior requires a lot of knowledge. You have to understand the model's architecture and llama.cpp quant strategies for each size. Additionally, you must consider how "dense" the information in the model is because the more tokens a model is trained with, the higher the quantization error for a given bpw quantization budget (this is very apparent when you compare Llama-2 with Llama-3).
Comparatively it is a lot more trivial to estimate the effects of KV cache quantization, you can get a decent estimate based on whether it uses MHA, MQA, GQA, and if GQA, the ratio, but there might be more to it, as I've seen people report much worse performance than I'd expect for certain models.
As you already figured out it all gets extremally complex which is why our quality numbers are based on measurements instead of theory. It would be awesome if one could be based on the model architecture tell for each quant how good it will be. I’m quite skeptical that something like this will ever be possible in a way that would allow us to provide accurate individualized quality numbers to the approximately half a million quants we have uploaded so far. In any case this might be a really interesting problem to solve for some very intelligent PhD students.
It's also not really clear that these are estimates independent of the model, as it doesn't mention that and the numbers are very specific, the only way to tell is if you look at multiple model's you will notice that the numbers are the same.
I guess it should be better labeled.
A specific example of why this matters is that your metrics make static Q5_1 seem strictly worse than static Q5_0, except in ppl where static Q5_0 is marginally better but still not worth the increase in size, and evaluation which is very noisy.
Our quality scale is based on mean correct token probability and not perplexity. We determined that correct token probability better matches with what a casual user perceives as quality than perplexity. I personally really don't like perplexity numbers. At least use KL-divergence numbers instead.
This is not the case for all models. For Gemma-2 and, to a lesser degree, Llama-3 and Mistral Nemo, static Q5_1 should perform better than static Q5_0. As you noted in Qwen 2.5, they both perform very similarly, and for Phi-3.5, static Q5_1 should perform worse than static Q5_0.
This might be the case. I measured many of them in the past. Some inaccuracies based on different architectures is expected.
The in my opinion greatest issue with the current quality numbers is that they the same for all model size while there are massive quant quality differences between 0.5B and 70B. This is something that should not take much effort to implement as all data we have in the table under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2#674a7958ce9bc37b8e33cf55. It also can be implemented in a way where it keeps our privacy focus download page design.
@mradermacher I have some great news regarding the RPC based imatrix computation setup. It seems as if priming is no longer required in latest llama.cpp. At least I was able to compute an imatrix over RPC without priming first.
I used it yesterday to concatenate the parts of Hermes-3-Llama-3.1-405B-Samantha that at the time was uploading to HuggingFace because I forgot to hardlink.
I know it works with new enough coreutils and xfs/btrfs. The question is whether we ever solved the permissions problem, because it requires either a new syscall or an ioctl, both of were are blocked in the container. In any case, it's not really relevant, especially as I think quantizing form the source gguf is better.
To my knowledge nico1-pause informs booth the imatrix and quantizing scheduler as they are then booth marked as paused on the website and it shouldn't be surprising for a paused host to lose connection because preparation for a reboot is a very common reason for me to pause nico1.
It is surprising that a host goes down during imatrix computation - pause only stops the scheduler itself, not the jobs running. And there is really no other way, for the scheduler, the job simply times out, with nop information on why (it might still be running, host might be rebooted etc). At the very least I would have to check uptime
monotony before every job start. Unless we increase reboot frequency, I'd rather clean upo opccasionally than implement and tets that in a rumning system :)
On the opther hand, if it reboots when idle, the scheduler should indeed be able to cope with it, although in the past there have been issues with ssh not timing out etc.
I guess it should be better labeled.
The page does say more documentation is coming, and also, somebody we know said he would write a lot more info for users of that page. I don't have the slightest doubt that the page will improve over time :)
The in my opinion greatest issue with the current quality numbers is that they the same for all model size while there are massive quant quality differences between 0.5B and 70B. This is something that should not take much effort to implement as all data we have in the table under
Except we don't have the model size available via the API. We would have to download either metadata (as for the search, which is not synchronized to the repos), or (partially) download a quant and parse it. Or use some heuristic on the file size. I don't think quant sizes make much of a difference to warrant that, though - model families would make a bigger difference.
Also, I wonder about hf doing that (partial quant downloading), because I heard that one hidden cost of aws is that partial few-byte download cause the whole file to be copied internally and they would pay for that. At least there seem to have been some cases where such money-amplification attack were done. I wonder if they are aware of that (if true). In any case, that was a tangent :)
I have some great news regarding the RPC
Indeed!
sleep
Well, that didn't work out. In any case, I am here now, and we can do stuff today. Since you haven't mentioned actually starting anything, I assume I can use rpc.
I know it works with new enough coreutils and xfs/btrfs. The question is whether we ever solved the permissions problem, because it requires either a new syscall or an ioctl, both of were are blocked in the container. In any case, it's not really relevant, especially as I think quantizing form the source gguf is better.
Just using cat doesn't work inside your container to instantaneously concatenate them? I thought I did it yesterday inside your container and it worked but maybe I was on the host instead.
It is surprising that a host goes down during imatrix computation - pause only stops the scheduler itself, not the jobs running.
The pause script waits for all jobs to be completed and uploaded - at least for quantization jobs. It apparently doesn't wait for running imatrix jobs to finish before terminating. We could for sure make the pause script wait until no more imatrix processes are running. In any case now that I know I will just make sure they are all done before I reboot so this won't happen again.
The page does say more documentation is coming, and also, somebody we know said he would write a lot more info for users of that page. I don't have the slightest doubt that the page will improve over time :)
Yes don't worry. I intend on improve it a lot. Once it is on every model card I will for sure be way more motivation to do so.
Except we don't have the model size available via the API.
It's on the HuggingFace model page so you can likely just scrape it or figgure out how the webpage obtains this information. But honestly just going on the model size should be good enough as it is just to give the user a rough quality estimation.
Well, that didn't work out. In any case, I am here now, and we can do stuff today. Since you haven't mentioned actually starting anything, I assume I can use rpc.
Awesome! I have not started anything and all hosts are ready to be used for RPC. I unfortunately won’t be able to help you much as I have to sleep now as I have work tomorrow (or I guess technically today because it is already past midnight).
Should something with the RPC servers go wrong you can always SSH them from nico1 using [email protected]
and then enter tmux attach
the access RPC server console.
The quality values shown on the download page are meant to help to user to decide choose the highest quality quant that can be run under given hardware constraints. I'm aware that there are quality differences between different families and especially for static IQ3 quants those differences can be quite significance but measuring every single quant of every single model is not feasible so this is the best we can do with reasonable compute cost.
As you already figured out it all gets extremally complex which is why our quality numbers are based on measurements instead of theory.
Our quality scale is based on mean correct token probability and not perplexity. We determined that correct token probability better matches with what a casual user perceives as quality than perplexity. I personally really don't like perplexity numbers. At least use KL-divergence numbers instead.
I'm sorry if I'm coming across as demanding. I really do appreciate the work team mradermacher does. I understand that measuring every quant would require a herculean amount of compute and is not feasible and I was not suggesting that.
My point was that the numbers on the download page can easily lead to confusion and misunderstandings, and I was just highlighting one such example, since for 4/6 including KL-divergance of the metrics on that page it will show Q5_1 static being worse than Q5_0 static, and the other 2 metrics are either extremely noisy, or extremely close. I've seen data (not going based on theory) that shows the other models I mentioned do not behave the same in that regard ( and even then I probably should have been more specific on the exact models as even within a model family that isn't always true, gemma-2 27b is erratic with the legacy quant's but the 9B is not). This issue doesn't exist for the imatrix version of Q5_0 and Q5_1, both in your data and the other data I've seen.
The only other anomaly I've seen data of where a larger quant performs worse or the same is mistral nemo instruct 2407 and static k-quant's around 3-4bpw.
Personally, I didn't see a point to the legacy quant's anymore as they are legacy for a reason, but I found out from this account's discussion page that for some platforms and some users they are worth it for the lower energy consumption. I also like KLD data which is why I'm so grateful you gave me a lot of it. It's hard to find, and resource intensive to create.
It would be awesome if one could be based on the model architecture tell for each quant how good it will be. I’m quite skeptical that something like this will ever be possible in a way that would allow us to provide accurate individualized quality numbers to the approximately half a million quants we have uploaded so far. In any case this might be a really interesting problem to solve for some very intelligent PhD students.
That is impossible, like I mentioned the training data and ordering matters, and at that point even if it is possible to estimate, I don't see how that would be easier than just testing the resultant LLM.
The in my opinion greatest issue with the current quality numbers is that they the same for all model size while there are massive quant quality differences between 0.5B and 70B. This is something that should not take much effort to implement as all data we have in the table under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2#674a7958ce9bc37b8e33cf55. It also can be implemented in a way where it keeps our privacy focus download page design.
I have much less data on this, as there aren't that many great sources for metrics of larger (>15B) models, but are you sure that the closest Qwen 2.5 model size is representative of the quant quality of a non Qwen 2.5 model? I don't think so but like I said I don't have enough data to be completely confident about this, but as it stands I still don't believe that is true.
I think the notes section on the current model cards and download page, and the quality metric which is derived from correct token but only showing integer's and has no tie's besides source/fp16 is helpful ( maybe adding a comment to Q5_1/Q4_1 static explaining that if it is better than Q5_0/Q4_0 is extremely hit or miss depending on the model). I think the other 5 categories (KLD., Correct token, Same Token, PPL, and Evaluation) are not helpful, as they have nothing to do with the model you are currently viewing, and are suggestive they do.
For example with a Llama-2 based model the KLD of smaller quant's should be much better than what the table indicates as Llama-2 is nowhere near as dense as Qwen-2.5. Llama-2 was trained with 2 trillion tokens vs 18 trillion for Qwen-2.5, and the data I've seen also reflects that. I think that issue will persist even if you compare to the closest Qwen 2.5 rather than the overall Qwen 2.5.
The pause script waits for all jobs to be completed and uploaded
It's hard because the jobs technically run on kaos.
Just using cat doesn't work inside your container to instantaneously concatenate them?
I have no idea, I thought so. But maybe you enabled that and I forgot? Anyway, if it does, scripts will sue it, if it doesn't, they will still work, so no worries here :)
It's on the HuggingFace model page so you can likely just scrape it
Except it's generally not on the hf page, yeah :) And fore those models where it is, it is quite unreliable.
Anyway, Q8 quantisatrion for v3-base is running (at ~400MBps). Did you move the models to a slower disk? Probably not, that means it wasn't so wise to statically quantize two models at the same time before. Maybe it will work out if staggered to the imatrix quants.
I will boldly attempt to do more q8-quantisations while deepseek is imatrix'ing, as it shouldn't be that tight. Now is your chance to intervene :) Well, only the other deepsee model, actually.
It's hard because the jobs technically run on kaos.
Its fine now that I know I will just check the status page and GPU utilization and see if there are any imatrix processes running before I reboot.
Except it's generally not on the hf page, yeah :) And fore those models where it is, it is quite unreliable.
It's not on all SafeTensors models? Also the parameter count should be super accurate as it originates from the ModelTensorsParams object. Just search the HTML source code for <div class="SVELTE_HYDRATER contents" data-target="ModelTensorsParams" data-props="{
and you will find the raw ModelTensorsParams object containing a ton of important model metadata including the parameter count. We can also use it to check if a model is llama.cpp compatible before even downloading as ModelTensorsParams contains tokenizer_config which must contain LlamaForCausalLM or another tokenizer supported by llama.cpp.
Anyway, Q8 quantisatrion for v3-base is running (at ~400MBps). Did you move the models to a slower disk? Probably not, that means it wasn't so wise to statically quantize two models at the same time before. Maybe it will work out if staggered to the imatrix quants.
No still the same BTRFS 2x SAMSUNG MZVL22T0HBLB-00B00 SSD pool as always. Each of them should have a 7 GB/s read and 5.2 GB/s write speed if empty and trimmed. 4KB read IOPS is 1000000 and 4KB write IOPS is 850000. Because we are using RAID 0 it should even be twice as fast under optimal conditions. Make sure to trim your SSDs and fill them so little that they run in SLC instead of TLC mode when possible.
I will boldly attempt to do more q8-quantisations while deepseek is imatrix'ing, as it shouldn't be that tight. Now is your chance to intervene :) Well, only the other deepsee model, actually.
Just check the hosts memory first using /host/proc/meminfo
to make sure enough is available and adapt the cgroup limit accordingly. Please also leave a few GB as buffer just in case. Keep in mind that while the host has 512 GiB of RAM only 503 GB of it are usable and a few GBs are also needed for the host and the containers hosting the StormPeak RPC servers.
Deepseek should be slightly less tight than 405B but booth will be quite tight.
Its fine now that I know I will just check the status page and GPU utilization and see if there are any imatrix processes running before I reboot.
I can probably register the jobs on nico1 as well somehow, so it would be easy to wait. But not today :)
I can probably register the jobs on nico1 as well somehow, so it would be easy to wait.
Not so easily, but I can make network rpc calls in bash. Yay. (I've updated the pause script, it might wait for imatrix jobs now, pretty much untested).
Note to self, that's how you configure rpc imatrix for big models, also, set DONTRUN*.
"extra_args" : "--rpc 192.168.200.201:7201,192.168.200.202:7202,192.168.200.203:7203,192.168.200.204:7204",
"force" : 1,
"llama" : "/root/cvs/llama.cpp-nocuda",
"ngl" : "10000",
"quant" : "Q8_0",
I had secretly hoped Deepseek would be faster...
It's done but stuck at hfu and I can't find the imatrix or a log. Where does it even upload it to? There is not yet a DeepSeek-V3-Base-i1-GGUF repository on HuggingFace. I guess to kaos. Hopefully nothing broke because after the imatrix task was done things went quite crazy and even somehow managed to crash one of my RPC servers but already restarted them all and everything is ready for DeepSeek-V3
as next massive RPC imatrix computation task.
-3000 713 DeepSeek-V3-Base run/hfu
Edit after like half an hour the DeepSeek-V3-Base hfu task is now compleated/gone.
I had secretly hoped Deepseek would be faster...
It was for sure faster than expected. It only took around 10 hours while 405B takes like 20 hours and FatLllama was like 40 hours. Keep in mind that half the performance is lost due to using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 while allocating memory far larger than the available GPU memory instead of -ngl 0 duetoi this not beeing supported for RPC servers. The RPC overhead must be almost neglectable.
nico1 complete the entire imatrix job backlog and is currently idle. Let's start RPC imatrix computation for DeepSeek-V3
if you have time. Any idea what happened to DeepSeek-V3-Base
was everything successfully?
Oh nice I see you just started it! Thanks. It is usuing all the RPC servers as expected.
The deepseek-v3-base imatrix is safe.
It's done but stuck at hfu and I can't find the imatrix or a log. Where does it even upload it to?
To kaos, which has all imatrix files in a directory, and serves them to the other hosts. Actually multiple directories, for various reasons. Currently ~70GB.
The hfu failed because the rsync had a rare failure where it failed after it had successfully transferred and deleted the file (at least, I hope so - I rely on rsync doing the right thing, and rsync rarely fails me completely, it is one of the few tools I trust a lot :), which causes the next attempt to fail as well.
(It ends with the path of the imatrix training data, so I assume it is complete)
Hopefully nothing broke because after the imatrix task was done things went quite crazy and even somehow managed to crash one of my RPC servers but already restarted them all and everything is ready for DeepSeek-V3 as next massive RPC imatrix computation task.
The imatrix-training-remote script had a little refactoring while deepseek was imatrixing. And an unclosed $()... and unfortunately, this was not counted as an error, so all following imatrix quants failed in a way that made the scheduler think an imatrix was created when it wasn't. Quite messy to watch, but nothing is lost.
I can't imagine the rpc server crashed because of anything happened after deepseek, because the imatrix jobs following would use the syntatcically broken sacxript, which was not capable of running any commands (basically it failed to compile after the first few lines), so no llama calls were made. I would assume the rpc server crashed during/at the end of deepseek-v3-base imatrix generation, so we should look out for this when deepseek finishes tomorrow noon or so. I will try to queue some other models before then going to the next big model, just like today.
It was for sure faster than expected
Your expectation was based on better understanding :)
nico1 complete the entire imatrix job backlog and is currently idle.
Yeah, it's all manually managed, unfortunately, and I wouldn't even know how to teach the scheduler the dependencies between all these jobs when we imatrix big models. And all that during a 6 hour phone call :)
The deepseek-v3-base imatrix is safe.
The hfu failed because the rsync had a rare failure where it failed after it had successfully transferred and deleted the file
I'm glad and relieved to hear that.
it is one of the few tools I trust a lot
Great to know. I will use it more often in this case.
Quite messy to watch, but nothing is lost.
Great that nothing got lost. Refactoring scripts directly in production must be stressfull.
I would assume the rpc server crashed during/at the end of deepseek-v3-base imatrix generation
That is probably exactly what happened. I remember that we had the RPC server crashing after imatrix computation back when we did FatLlama as well. It even was the same RPC server that had all its layers in GPU memory instead of using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1. But it is all fine as I want to manually restart the RPC servers after every imatrix computation anyways as I don't trust the llama.cpp RPC code to properly handle the transition from one to another model.
I will try to queue some other models before then going to the next big model, just like today.
There is very high demand for DeepSeek-V3
imatrix quants as nobody so far was able to compute an imatrix for it so let's do them first. I'm personally really interested to try the imatrix quants of this model as well and we even have oobabooga asking for it to do some Q2 quality measurments. DeepSeek-V3 should now also be prioritized higher than DeepSeek-V3-Base.
Hermes-3-Llama-3.1-405B-Samantha
and Hermes-3-Llama-3.1-405B-Uncensored
will take around 20 hours each for imatrix computation and be extremely tight to fit into RAM so let’s complete imatrix quants for DeepSeek-V3
first to not further delay that.
Yeah, it's all manually managed, unfortunately, and I wouldn't even know how to teach the scheduler the dependencies between all these jobs when we imatrix big models. And all that during a 6 hour phone call :)
I wouldn't even know how to teach the scheduler the dependencies between all these jobs when we imatrix big models. And all that during a 6 hour phone call :)
That is quite impressive. I'm having troubles focusing on two things at once. Listening works but as soon I have to talk I can no longer anything else. A 6 hour phone call is quite insane. I don’t think I ever had such a long one.
it's all manually managed
I highly appreciate all the effort you put into all of this.
Great to know. I will use it more often in this case.
In that case, there are two main things that impressed me: whenever I needed an option, it was either already there or in the queue already. And when rsync fails to recreate the same file (by checksum) it will delay the update and try once more, and only then fail with an error, i.e. even if the algorithm has a bug, it would detect and report that. It's a lot of small things like that that increased my trust - it's not only trying to be a swiss army knife w.r.t. features, but also cares about correctness a loot.
Great that nothing got lost. Refactoring scripts directly in production must be stressfull.
Mostly only if you don't have the time to deal with the fallout at that very moment. Otherwise it's mostly a challenge. You should try it more often in your company :-)
But, seriously, it was an absolutely trivial change... Aren't they all :(
There is very high demand for DeepSeek-V3
OK. I will probably be gone when it finishes, I can try to pause nico1 isntead of essentially switching it off, so whoever sees the "I" first can unpause.
That is quite impressive. I'm having troubles focusing on two things at once. Listening works but as soon I have to talk I can no longer anything else. A 6 hour phone call is quite insane. I don’t think I ever had such a long one.
I probably have very similar problems focusing, but these phone calls are very relaxed, and I can almost always disengange for a while when I need to. It's not like a tense customer conference call or anything, so don't get the wrong impression. It mostly means I will watch the queue status from time to time, so if something goes wrong... too bad.
I think falcon-180b-chat is as disapppointing as always, especially at lower quants, but I'd be happy to hear your assessment (we didn't have the -chat before btw.)
I successfully resumed nico1 10 minutes after it finished. DeepSeek-V3 hfu is stuck again but doesn't matter as it happened when it was already uploaded. DeepSeek-V3 and DeepSeek-V3-Base are now booth quantizing. Thanks a lot!
I must say, you were quick :)
If it got stuck again, there is either another issue, or there is some generic issue with larger imatrix files (and indeed, in the past, it only happened with larger files). I'll have a look, but indeed, if the transfer is successful, it will distribute it, even if the imatrix scheduler feels drunk.
Hopw fast is /bpool? I assume slower than your cpu would like. I will try to copy the models to my local disk to get faster quanting.
Or maybe not, I get weird speeds. Will experiment.
@nicoboss speed is weird. I can read the source gguf at almost 500MBps with llama-quantize, pv or rsync. Very consistently. But if I start two processes, I get 900MBps. Do you have some kind of per-process I/O rate limit? I ask because nico1's CPU is very bored waiting for data :)
@nicoboss speed is weird. I can read the source gguf at almost 500MBps with llama-quantize, pv or rsync. Very consistently. But if I start two processes, I get 900MBps. Do you have some kind of per-process I/O rate limit? I ask because nico1's CPU is very bored waiting for data :)
It's likely because of the relatively high compression level I used to make all this massive models fit on bpool. I used zstd-6 in case you wonder.
Oh also I reduced ZFS ARC cache to 1 GB during RPC computation and forgot to increase it to something more reasonable. I now increased it to 50 GB. Not sure if that will have any impact as this highly depends on how llama.cpp is reading the data.
Possibly, if only one core would decompress, this could be the speed (otoho, zstd decompression speed does not depend much on the compression level, and usually it's nmot the process readinfg that does the decompression).
Anyway, I have barely space for one source model. I copied it over to my disk and the two quant jobs, which were at the same tensor when I noticed, are now widely apart (they have different cpu priorities, but that had no effect before).
And yeah, it's just plain reading. The strange thing is that if three processes read, I get 1.3GB/s, so it's not a fundamental issue.
Well, its much faster now - I was worried that cppying the source over will reduce I/O capacity so much that it wouldn't be a win, but it is.
The CPU is now mostly busy, but it's also doing IQ quants instead of the Q2_K quant earlier. However, since both jobs are doing the same quants, I guess there still is an effect due to I/O being separated works.
Another reason I completely forgot to mention is that back when I started I realized that I wanted things to go faster so I increased the amount of cores assigned to your LXC container from 48 to 60. Because the first quantization task already started at that time they likely have created less than the optimal amounts of threads resulting in less CPU utilization than usual.
Hopw fast is /bpool?
Because I was curious, I checked what disk bpool is using. bpool consists of a single Samsung SSD 990 PRO 4TB
. It has 7,450 MB/s read and 6.900 MB/s write speed when in SLC mode, but it currently is in TLC mode as it is almost full. It has 1,600,000 IOPS 4K read and 1.550.000 IOPS 4K write IOPS.
Possibly, if only one core would decompress, this could be the speed (otoho, zstd decompression speed does not depend much on the compression level, and usually it's nmot the process readinfg that does the decompression).
And yeah, it's just plain reading. The strange thing is that if three processes read, I get 1.3GB/s, so it's not a fundamental issue.
That 500 MB/s per process limit is likely related to decompression speed. It is not a per process limit but a per read limit. If you access the file using many concurrent read operations zfs will pool them all to separate threads resulting in much better read performance.
Well, its much faster now - I was worried that cppying the source over will reduce I/O capacity so much that it wouldn't be a win, but it is.
Performance is awesome now. Copying it for sure was the right decision. Once DeepSeek-V3 is done we continue with RPC imatrix computation without waiting for the now slower DeepSeek-V3-Base
Because the first quantization task already started at that time they likely have created less than the optimal amounts of threads resulting in less CPU utilization than usual.
Well, 99% idle means it didn't matter how much threads were created :) If you look at the disk I/O and cpu stats, you can clearly see the pattern (or could) - about 25s disk I/O, followed by 6 seconds CPU. Now the disk I/O phase takes about 7.5s (For V3 and the same old 25s for the V3-Base).
when in SLC mode
Shouldn't matter, as TLC mode should have the exact same reading speed.
The problem is clearly not the hardware. I can easily get >1GBps when reading with multiple threads. But somehow, a single thread (such as llama-quantize or cp) tops out at around 450MBps.
It is not a per process limit but a per read limit.
I'm so glad I don't have to suffer this horribly badly designed filesystem on my side of things then. What happened to readahead, interleaving? I'm not asking for concurrent decompression, just, like, basic filesystem advancements we had since the early 90ies... (This is only half-joking :)
I also don't buy any such decompression limit. A single 4.3GHz efficiency(!) core of my desktop CPU decompresses a zstd -14 compressed f16 gguf at 1.3GiBps, while piping it into another program.
Alas, I had hoped it would have been some container I/O rate limit of sorts - not only would I then want to know how that works, but it would also be fixable :)
'm so glad I don't have to suffer this horribly badly designed filesystem on my side of things then.
I'm starting to get quite convinced to switch to BTRFS. ZFS performance is bad, and it lack of zero copy support to quickly concatenate files making downloading quants over command line annoying. I plan on switching all my AI related storage pools to BTRFS. This would make all your temporary storage attached to your LXC container be BTRFS as well.
I'm a mainly concerned about the BTRFS RAID 5 support which seams not be considered stable. I will soon build a 4x18 TiB RAID 5 pool replacing my current hpool. Using BTRFS for that would make a lot of sense as it is not possible to defragment a ZFS file system making HDD performance after a few years quite terrible. Will BTRFS RAID 5 read performance increase as well when I do RAID 5? For ZFS RAID 5 with 4 disksgives you an up to 3x read speed increase compared to a single disk.
I'm a mainly concerned about the BTRFS RAID 5 support which seams not be considered stable.
I would definitely not use that, although it seems stable in the sense that you need a full scrub after power outages (but that is pretty much the situation for most linux software raid5s as well, as well as for hardware raid that doesn't have extra backup for this). I still wouldn't use it, because practically nobody uses it in production, afaik.
(I once asked on the xfs list whether realtime subvolumes are finally stable in xfs on linux, after having used them on irix to good effect, and I was essentially told, "nobody knows, please start using them and then tell us. if nobody uses them, nobody will ever know" - I decided not to use them, but it was an honest response).
Personally, I use hardware raid5 (which has its own perils, although my experience has been pretty good) for main storage, and multi-device btrfs filesystems for cold(er) storage (with 4 times redundancy fore metadata...). And I have a backup for the important stuff, although restoring my latest 140TB disaster took slightly over one month :(
ZFS is probably more reliable, in some sense. But maybe that's just as with OS X - I thought it had a well-thought out user interface until I actually used it myself for a bit, and was appalled how much worse than even windows it is, in terms of UI consistency. I was similarly appalled with ZFS, although I think it is better than the OS X UI :)
But, yes, for "just" some ai-related storage pools, I think you won't look back, even if you don't gain much. I still use ext4 for database backup store, a software raid5
with nvme-cache for gguf storage, xfs for my non-SMR backup disks and so on. The right filesystem for the job.
I'm starting to get quite convinced to switch to BTRFS.
I pray that will work out fine, otherwise you can rightly complain to me :) Although, zero copy support, while maybe a killer feature, is only one in a long series of features. I'd say if your management requirtements are not that high, switching for certain things such as storage pools is probably something we will not regret - and you still can od a lot of management that most filesystems can't, such as shrinking devices, or adding/removing/replacing devices.
And btrfs is about as sensitive as zfs to hardware issues (as well as its own corruption issues, if any).
Anyway, the reason why I wrote the above is that I am usually a bit more careful with shittalking what other people use, because when my shit-talking convinces them to switch, and they are disappointed, it will be on me :) Therefore, feel free to use ZFS for all kinds of things where it works for you. zero-copy and speed are not that important (and btrfs is certainly not a fast filesystem. But it can recover, performance-wise, from disk full situations, where XFS/ext4 cannot for example).
I know you already know to use the right tool for the job, but I had to say it as insurance :)
Will BTRFS RAID 5 read performance increase as well when I do RAID 5?
I can't tell. Generally though, raid5 read performance will not be (much) faster then the equivalent raid0 volume, and with respect to btrfs, I don't think they have any optimisations, i.e. it will be ever so much slightly slower than an equivalent raid0 because it won't use the redundancy. But that's just conjecture, not knowledge.
If you need redundancy, and mirroring is too expensive, I would recommend not to use btrfs raid5, but linux software raid. Or zfs... And with software raid, you can then choose whether writes are slow but safe, or fast and very slightly unsafe.
Or you have some kind of backup, and feel bold enough to accept potential problems. Then I'd be happy to hear about your long term btrfs raid5 experiences.
Was working on the model link button, only to find out that huggingface's markdown dialect seems completely undocumented, thwarting my plans. Sigh.
@nicoboss
since my favourite eye-catching button (http://data.plan9.de/hfbut.html) fails due to lack of support on hf's side, why not go oldschool
and simply link a nice gif^Wwebp animation as button. that way, we can replace it's contents on every page without changing the markdown at all.
@nicobossI'll be asleep soon. If you wish and you see when deepsek-v3 is done, you can delete the SOURCE gguf in /tmp and copy over the V3-Base, and then e.g. symlink it over /tmp/quant/DeepSeek-V3-Base.gguf or so. Should be safe to use ln -sf at any time.
I would suggest continuing with the hermes models in the evening or later, assuming they take 20 hours, so that the box doesn't idle because they finish at a bad time. Or maybe whenever v3-base is done, because it shouldn't take that much longer. I will not try to do rpc imatrixing without your "go" signal.
I'm on b4457 btw.
soon b4458
I would suggest continuing with the hermes models in the evening or later, assuming they take 20 hours, so that the box doesn't idle because they finish at a bad time. Or maybe whenever v3-base is done, because it shouldn't take that much longer. I will not try to do rpc imatrixing without your "go" signal.
We would ideally start imatrixing as soon DeepSeek-V3 is done uploading because doing RPC on Monday from 08:00 to 18:00 would not fit well as I then need infrastructure for work and the only way to finish booth of them before that would be by starting imatrix as soon as possible and no later than Saturday morning.
b4458
OK I will make sure to update the RPC servers now because I know for a fact that latest llama.cpp doesn't seem compatible with the current ones. I figured this out the hard way when I tried measuring the network bandwidth.
I updated all RPC servers to b4458 and they are ready to be used.
The DeepSeek-V3 Q4_1 hfu task already stuck for 5 hours and outgoing traffic averring around 60 bytes/second. I checked DeepSeek-V3-i1-GGUF-DeepSeek-V3.i1-Q4_1.gguf*.log
:
DeepSeek-V3.i1-Q4_1.gguf.part6of9: 92%|█████████▏| 43.4G/47.2G [11:34<06:57, 9.32MB/s]'(ProtocolError('Connection aborted.', OSError(28, 'No space left on device')), '(Request ID: aed16604-8c79-4cd0-abbc-054f32cd128f)')' thrown while requesting PUT https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com/repos/5d/08/
I killed the hfu process and hope it will retry the upload. I have copied the DeepSeek-V3-Base.SOURCE.gguf to /tmp storage but due to this unexpected upload issue storage is with 700 GB free getting somewhat tight.
Edit: The llmjob parent process doesn't seam to care and instead of retrying is just wating for a no longer existing hfu process. Moving the log to /root also didn't help.
Edit2: Started another one using /usr/bin/perl /llmjob/share/bin/llmjob hf-upload-folder DeepSeek-V3-i1-GGUF DeepSeek-V3.i1-Q4_1.gguf*
- feel free to kill 3804514
if you want as it is not doing anything.
Edit3: Yes just starting another one seamed to work which is good as only 485 GB storage left.
Edit4: It uploaded it and even continued where it stoped! https://huggingface.co/mradermacher/DeepSeek-V3-i1-GGUF/commit/d6c0da4b6cde336b2da5c767a00cbeaf6ffc7e25
Edit5: Killed 3804514
as it is now useless.
Edit6: Manualy deleted all the DeepSeek-V3.i1-Q4_1.gguf.part* files because they where not auto-deleted probably because I only started a single task of a much bigger process but everything is fine as it detected that this taks is now done and finally started with the last DeepSeek-V3 IQ3_S one.
Good morning ;)
Let me sort through this.
The disk was full practically minutes after I left. Great. The quantize scheduler does not take imatrix jobs into account (and vice versa), but it normally works because about half of the disk is available to imatrix. But not at the moment, due to the massive deepseek gguf. Well, it still probably paid off.
The disk was full because we had a few 70b's too much. I think the python exception chain is a bit confusing - I don't think there was a protocol error anywhere and the OSError was simply local. I also don't see why disk full would affect uploading - why would huggingface_hub have to write anything to disk? But... yeah, it probably did and failed.
Now we also know what llama-imatrix does when the disk is full - it keeps running till the very end, despite diagnosing the disk full almost at the beginning (it saves the imatrix every 10 chunks), and then fails. Lovely. That must have cost some extra programming over the boring "crash when write failed" approach of us lower coders :)
The hfu process is the parent of the llmjob upload. It starts llmjob upload, so killing hfu does nothing much. The llmjob upload runs python as child, using the huggingface_hub library, and communicates with it via a pipe. killing the python3 child will kill the upload and retry. Killing the llmjob parent of the python3 process will also retry, but might keep python running, trying to upload as well.
The whole thing works like this:
quantize is done with a quant and creates a child process (bash forks) for uploading => the child runs hfu (also bash) and deletes the files after success => runs llmjob upload* (perl) in a loop => python that does the work.
quantize => quantize-fork => hfu => llmjob => python3
Killing hfu will keep the upload running, but will also then keep the quant files. If the quantize job is restarted, it would wait for the upload to finish and then try to upload it again, causing it to be deleted. If the quantize job finishes, you will eventually get a pink line in the status display because the jobs is done, but the GGUF directory is not empty.
You can quickly get a list of relevant processes using the "ils" command:
hfu 141233 141234 707273 712736
hfu-New-Dawn-Midnight-34b-GGUF 141233 141234 707273 712736
hfu-New-Dawn-Midnight-34b.Q5_K_M.gguf 141233 712736
hfu-New-Dawn-Midnight-34b.Q5_K_S.gguf 141234 707273
llmjob-Llama-3.1-8b-ITA-imatrix 136909 61139 61140 61141
llmjob-New-Dawn-Midnight-34b-static 141233 141234 707271 707273 712734 712736
"hfu" is all upload jobs, hfu-MODEL-GGUF all quantize-related ones (there is also an -i1-GGUF), and the Q5_K_M.gguf ones are uploading that one. The hfu processes are the ones from hfu downards, that is, hfu, llmjob upload, python (or other processes, such as sleep when it waits for a retry) and does not include the quantize child that waits for it.
The llmjob ones are the ones doing the work, for example by running the quantize script, which is responsible for the noquant and quantize phases.
It's exactly how I started, with a for loop that iterates through the quant types, then I added conversion to it, and then uploads. And now I am loathe to touch it except for small changes :)
There is also an ikil command ("ikil -9 hfu-New-Dawn-Midnight-34b.Q5_K_M.gguf" would kill the upload and leave the files on disk). There are a few others, such as "iwait NAME" which waits for all processes with that name to exit (e.g. "iwait hfu" in the pause script waits for all uploads).
The quantize child that deletes files after a successful hfu should not be part of any named group, but I do not know if I fucked this up or not :)
Now you know maybe more than you ever wanted to know about this.
It uploaded it and even continued where it stoped!
the hub library will hash files before upload, and if a single file already was uploaded before, it will not upload it again, only the missing files. But it does not resume individual files. I assume that is what you saw.
However, for a month or so, the huggingface_hub lib now has a way to actually resume files, but it is a bit clumsy for our purposes (requires one dir per upload), requires cleanup, and I haven't looked deeper into it yet. It would be beneficial, though, as it only hashes the files ones (but that also means extra trouble if the files change).
BTW., this is a very peculiar issue:
DeepSeek-V3.i1-Q4_1.gguf.part6of9: 92%|█████████▏| 43.4G/47.2G [11:34<06:57, 9.32MB/s]'(ProtocolError('Connection aborted.', OSError(28, 'No space left on device')), '(Request ID: aed16604-8c79-4cd0-abbc-054f32cd128f)')
The wrapper I use calls the upload method in a try block and reports any exception back to perl (which would possibly die or retry, but not report it in this format). So that means huggingface_hub failed internally, and simply printed the exception and then... chose to hang or so?
And something fishy is going on, 2 TB in use (du /tmp) but 2.8TB in use (df).
Ah right, the huggingface uploader was still hanging and keep the deleted file. Now we plenty of free space again. Sigh.
DeepSeek-V3-Base failed:
/llmjob/share/bin/quantize: line 230: 685578 Bus error $QRUN "$QUANTIZE" --allow-requantize "${OVERRIDE_KV[@]}" $IMATRIX "$srcgguf" ./"$OUT.$HOSTNAME~" "$qmethod" $threads
A Bus Error... often means that the undelrying file of an mmap is gone. What the heck (the source gguf is still there, afaics). I am also currently copying the SOURCE for Base, which didn't run into issues, other than getting relatively slow (normally, I get a very stead >400MBps, now it's more like 300MBps). I will resume once the file is copied.
Can we start with RPC imatrix computation now? All the RPC servers are on version b4458 and ready.
If we throw away the last 4+ hours of computation for deepseek-v3-base, yes, but what about the plan I laid out?
Also, unrelated, I wonder if it is normal for DeepSeek-V3-Base to be so slow. It's basically crunching on IQ2_XS for the whole morning till now, and is only half-way through. That strikes me as a bit extreme - hopefully the new llama doesn't have a slowdown, and IQ2 is really just so much slower.
The other issue is that we have such a backlog that I can't even force-push models anymore - some breathing space would be good (although we are not really critical yet, but the effective absence of nico1 for so many days is felt), but my plan of continuing tonight (or whenever deepseek-v3-base can be interrupted) does not include any.
If we throw away the last 4+ hours of computation for deepseek-v3-base, yes, but what about the plan I laid out?
We should obviously wait for things to finish. I see you already started the process by interrupting it once done.
I wonder if it is normal for DeepSeek-V3-Base to be so slow.
IQ2 took insanely long for DeepSeek-V3 as well. I'm not really sure why but wouldn't blame it on latest llama.cpp.
we have such a backlog that I can't even force-push models anymore - some breathing space would be good (although we are not really critical yet
The imatrix computation queue ran basically dry. You must mean the quant backlog. We can always create nico2 on Threadripper and nico3 on CastlePeak if we temporary need more quant resources or have nico1 delegate tasks to other nodes accessing the same disk using network storage. If we go this route just keep in mind that CastlePeak is only turned on, on demand (but could be automated over wake on LAN) while Threadripper is always running but less energy efficient. All nodes will be unusable during RPC computation. With CastlePeak + Threadrippebut we could double the quant throughput of nico nodes.
The imatrix computation queue ran basically dry.
for imatrix I need storage space, and I ran out of storage space elsewhere, indeed because of the quant power shortage, and the increased latency of imatrix calculations - and simply the sheer number of days :)
nico[23] would probably not help, because the shortage is only felt when nico1 is doing very large models for a long time (usually when it is tied down doing rpc), and only for multiple days. I don't care for the low-priority jobs, they can get stuck for weeks, it's more the daily ones, and mostly the requested ones.
In any case, I had a suggested plan, but haven't seen a reply to it, so I assume you missed it - since the 405b models take slightly less than a day, my plan was to start them in the evening, so we are both awake when they finish. The whole system needs manual adjustments when the jobs finish. I would have hoped I can get one more deepseek quant through, but I had to interrupt it to not risk delaying it further in case you insist on starting early :)
Anyway, you are the boss, so I will start asap.
for imatrix I need storage space, and I ran out of storage space elsewhere, indeed because of the quant power shortage, and the increased latency of imatrix calculations - and simply the sheer number of days :)
You could have deleted the source the DeepSeek-V3-Base source GGUF I copied to /tmp for faster quants as we can always just copy it again if it’s even worth it for the few remaining quants.
I'm generally thinking if we need to increase storage on nico1. It would be nice not having to ever worry about it but it really is only an issue if we are doing these massive models which are rare. If we are doing normal models even the 4 TB semes somewhat underutilized.
nico[23] would probably not help, because the shortage is only felt when nico1 is doing very large models for a long time (usually when it is tied down doing rpc), and only for multiple days. I don't care for the low-priority jobs, they can get stuck for weeks, it's more the daily ones, and mostly the requested ones.
That should luckily be rare. Having 4 such massive models at once was really unfortunate timing. We should never have more than 2 of them unless multiple large models happen to release at the exact same time as it was the case here.
In any case, I had a suggested plan, but haven't seen a reply to it, so I assume you missed it - since the 405b models take slightly less than a day, my plan was to start them in the evening, so we are both awake when they finish. The whole system needs manual adjustments when the jobs finish. I would have hoped I can get one more deepseek quant through, but I had to interrupt it to not risk delaying it further in case you insist on starting early :)
Sorry for not responding to it. I thought that plan got somewhat obsolete due to the delays we encountered. As long we don't start it in early morning the timing should be fine. Straining it now means it should complete somewhere morning tomorrow when booth of us are awake.
As mentioned before the reason I pressed so hard on starting with RPC imatrix tasks is because I hoped getting all remaining imatrix RPC tasks done before Monday working hours when I usually need my infrastructure for work but now that we started so late this probably isn't going to happen anyways. Having all the hardware configured for RPC is somewhat disruptive because that RTX 3080 GPU currently inside CastlePeak is the GPU I would otherwise use as display output on StormPeak which I use as my main PC.
While RPC tasks are running I trun off every service on every of my nodes to make sure enough memory is available. This includes the LXC container I use to host the development environment for my job. Luckily doing RPC on the upcoming Monday will be fine as I’m spending the entire day doing server hardware maintenance (installing GPUs into servers) and meetings on so I really don't need my development environment. Honestly just a lot of drama for nothing because I’m too careful that nothing I do in my spare time could ever affect my job.
Anyway, you are the boss, so I will start asap.
I’m not. I always suggests what I believe is most beneficial for this project but in the end, you can always overrule me and do whatever you want. If you for example know you will be sleeping on Sunday morning it wouldn't have made sense to start it now.
Some good news regarding the 405B RPC tasks. Thanks to us using the same GPU offloading setup as for the even larger FatLlama 1.7T
memory is not as tight as I feared.
CastlePeak: 87.49% (220.07 GiB of 251.53 GiB)
StormPeak: 92.54% (465.65 GiB of 503.19 GiB)
Threadripper: 92.20% (115.82 GiB of 125.63 GiB)
NVIDIA GeForce RTX 3080: 9909MiB / 10240MiB
NVIDIA GeForce RTX 4090 01:00.0: 19147MiB / 24564MiB
NVIDIA GeForce RTX 4090 C1:00.0: 24115MiB / 24564MiB
NVIDIA GeForce RTX 2070 Super: 7787MiB / 8192MiB
It is still tight but if there is any task that can fit into the remaining memory feel free to run it at your own risk.
Yeah, my own estimate after studying /host/proc/meminfo for a while would say about 15GB should be very safe, more would take some experimenting. Unfortunately, that rules out most everything but rsync (I allow one rsync job at the moment). Quantizing does not strictly need much memory, but might cause thrashing.
I have added a second estimate that has higher spread but should converge faster (from below). According to that it should take 15 hours. I will also condition the queue so that, hopefully, it will do other imatrices once it is finished, and then continue with hermes...uncensored.
You could have deleted the source the DeepSeek-V3-Base source GGUF I copied to /tmp
You did that? I definitely did it (too) then. In any case, the DeepSeek-V3+DeepSeek-V3-Base would never have fit at the same time.
But the storage space that is getting low is the storage space on the other boxes. To get an imatrix job queued, another box must have converted it, and when more and more high priority models get queued in front of the existing ones, the space eventually gets tight, I can't queue more models and so on, especially if some of them are bigger.
If we are doing normal models even the 4 TB semes somewhat underutilized.
I don't fully agree with this, as you have seen how quickly it can get full - it is a semi-regular activity. It just takes a medium-sized model and bad network (and lying to the scheduler, as is required for big models). But I don't suffer much from storage problems - during normal operations. it is totally adequate (2TB would be too small though).
And big models always need handholding, both from swapping hardware, preparing boxes, shutting down services on your side, and configuration changes on my siude.
That should luckily be rare. Having 4 such massive models at once was really unfortunate timing.
Oh, we also had a uncommon amount of 50B/70B models, too. But even lots of big models like this are not an issue if space can be saved by temporarily putting stuff on other storage pools (as with /bpool now) and there are some pauses for other things in between.
I thought that plan got somewhat obsolete due to the delays we encountered.
Well, I expected it to finish in the morning,a nd then the whole day till the evening would be a buffer day. But the whole timing was based on an expected 20 hour time, and it seems to be more like 15 hours + ~1h setup or so.
but now that we started so late this probably isn't going to happen anyways.
It might actually happen... We could even try immediate back-to-back.
And yeah, there is a tension between our uses and your uses of your hardware., So far, we managed pretty well, IMHO, to satisfy everybody.
Anyway, you are the boss, so I will start asap.
Well, you are, over your hardware. Don't worry, you didn't give me the feeling that you'd brutally overruled me.
If you for example know you will be sleeping on Sunday morning it wouldn't have made sense to start it now.
It might certainly be very tight, and I might not be there when it stops. And we don't do it often enough for bugs to be ironed out quickly :)
I would definitely not use that, although it seems stable in the sense that you need a full scrub after power outages
After every system crash as well and a scrub of 72 TB must take at least one day.
but that is pretty much the situation for most linux software raid5s as well, as well as for hardware raid that doesn't have extra backup for this
I'm glad ZFS doesn't have this issue.
I still wouldn't use it, because practically nobody uses it in production, afaik.
Seams unfortunately too risky for now so I will likely have to go for ZFS again for next generation of hpool.
ZFS is probably more reliable, in some sense.
It likely is but also slow and misses many cool features that are in BTRFS like defragmentation, zero copy and file/directory specific compression. In return ZFS has some features it implemented better than BTRFS like easely seeing compressed size of a file, setting a size limit for a subvolume or efficient virtual file systems for VMs and in my opinion with ARC, L2ARC and metadata caching a better caching system. Like always there are many tradeoffs and there is no clear winner. One just has to decide on a case by use case basis.
But maybe that's just as with OS X - I thought it had a well-thought out user interface until I actually used it myself for a bit, and was appalled how much worse than even windows it is, in terms of UI consistency. I was similarly appalled with ZFS, although I think it is better than the OS X UI :)
OS X is terrible in every way and so is Xcode which in my opinion is the worst popular IDE ever created. Every OS is better than OS X. I would even prefer ReactOS over OS X despite being an unstable mess with its own NT-like kernel.
But, yes, for "just" some ai-related storage pools, I think you won't look back, even if you don't gain much. I still use ext4 for database backup store, a software raid5 with nvme-cache for gguf storage, xfs for my non-SMR backup disks and so on. The right filesystem for the job.
I think so as well. For all AI related workloads, the advantages of BTRFS clearly beat ZFS. I will switch bpool to BTRFS once we are done with Hermes-3-Llama-3.1-405B-Samantha
and Hermes-3-Llama-3.1-405B-Uncensored
.
I pray that will work out fine, otherwise you can rightly complain to me :)
No worries I would never do that. It is not your fault if you convince me about something and I don't do enough research/testing myself to be sure it actually fits my purpose and is stable enough for my use-case. I would be fully to blame if I let that happen and I really appreciate your honest opinion about BTRFS.
Although, zero copy support, while maybe a killer feature, is only one in a long series of features.
The ability to defragment is a quite massive killer feature for any HDD based storage pool because having to copy all data to some temporary storage and back to a newly created pool just to defragment must be one of the worst designs ever. I don’t even want to think about how I will find 54 TB of temporary storage to rebuild it should new hpool ever get too fragmented. This is the main reason I would have liked going BTRFS over ZFS for hpool.
I'd say if your management requirtements are not that high, switching for certain things such as storage pools is probably something we will not regret - and you still can od a lot of management that most filesystems can't, such as shrinking devices, or adding/removing/replacing devices.
The main thing regarding management I will lose is the ability to limit the size of a subvolume without ruining performance but I rarely have the need to limit storage and instead prefer if everyone can use as much they need until the storage pool is full which then forces me to cleanup or move things to different storage pools. If limiting the size is required, I can always create the storage over the Proxmox UI which will then create a size limited EXT4 loopback device on top of BTRFS. It is a bit annoying that there is no way to create BTRFS native storage pools using the UI but I can implement that myself by editing the Proxmox web interface if I ever feel the need for it.
And btrfs is about as sensitive as zfs to hardware issues (as well as its own corruption issues, if any).
Like the bitrot on the SSDs you are using. I probably should run scheduled scrubs on them like I do on my ZFS pools because as far I'm aware that doesn't automatically happen for BTRFS by default.
Anyway, the reason why I wrote the above is that I am usually a bit more careful with shittalking what other people use, because when my shit-talking convinces them to switch, and they are disappointed, it will be on me :)
As mentioned before it is my responsibility to do my own research before doing something and not to randomly trust someone’s personal opinion. And so is it everyone else. Nobody has the right to be upset about you for providing them with free advice
I would say even for paid experts one would be stupid to blindly trust them as they often seem to have some kind of personal agenda like selling you certain types of products from which they get a commission.
Therefore, feel free to use ZFS for all kinds of things where it works for you. zero-copy and speed are not that important (and btrfs is certainly not a fast filesystem
Don’t worry I will always use whatever I feel best fits my use-case.
But it can recover, performance-wise, from disk full situations, where XFS/ext4 cannot for example).
ZFS cannot as well because someone thought that creating a file system without defragmentation capabilities despite releasing it in 2005 when HDDs where the norm is a good idea.
I know you already know to use the right tool for the job, but I had to say it as insurance :)
No worries I will not and never would blame you for my own decisions no matter how much your input influenced them as they are my own responsibility. But I understand that you need to cover your ass as there are so many entitled idiots blindly trusting your advice and then blaming you for their mistakes. I likely should start adding disclaimers to all my recommendations as well just in case.
After every system crash as well and a scrub of 72 TB must take at least one day.
With 8 disks I usually have no issue saturating 12Gbit/s, but yes, "abouit a day" sounds right. But the disk is usable during that time.
Still wouldn't use btrfs raid5, too few people use it :)
I would even prefer ReactOS over OS X
That is very hardcore :)
I probably should run scheduled scrubs on them like I do on my ZFS
It's probably not worth doing it (for my pool), though - if it's raid1 metadata, then the chances of having a second corruption in the same block are low, and will then likely be found during normal usage, or be not important. For data (in single profile) it would only detect, not correect, anything anyways, and we scrub all data we write, pretty much :)
For archives, sure.
ZFS cannot as well because someone thought that creating a file system without defragmentation capabilities
For decades, I copied my storage volumes once every 1-2 years (usually because of a disk upgrade), and that was the only way to recover performance. For a while. At least on the busy main raid volumes.
@nicoboss there is currently only 69G free on / - df shows 4023 GB used, but du only 3.4T (uncompressed size, even). lsof also doesn't show any deleted files that could account for that. (In fact, Injust had a disk full condition, but managed to delete a model before imatrix would fail - but it is only a matter of time until it is full again).
Any idea what is going on? I don't currently see where these extra 600G could be.
For the time being, I've disable automatic file uploads, so unless more space is going missing, at least the current imatrix should be safe.
Yeah, my own estimate after studying /host/proc/meminfo for a while would say about 15GB should be very safe, more would take some experimenting. Unfortunately, that rules out most everything but rsync (I allow one rsync job at the moment). Quantizing does not strictly need much memory, but might cause thrashing.
I think technically using 30 GB might be safe. RPC doesn't use mmap so the cached memory might not be needed. That should be enough for quantization tasks if you cgroup limit them to 25 GB.
That NVIDIA GeForce RTX 4090 on PCIe 01:00.0 unlike the other RTX 4090 doesn't use GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 and instead runs all the layers fully in GPU memory so that 30 GB RAM together with that 5 GB of remaining GPU memory might be enough for -ngl 0 imatrix computation on small models.
Personally I don't think doing anything during RPC would be worth the time, effort and risk but feel free to go for it if you want.
I have added a second estimate that has higher spread but should converge faster (from below). According to that it should take 15 hours.
That's awesome. This must be because we are using the FatLlama 1.7T RPC configuration which better distributed the layers across nodes, makes more intelligently makes use of the faster GPU memory and ensures the two RPC servers on StromPeak don't interfere with each other. Didn't expect that to safe 4 hours going from 19+1 hours to 15+1 hours.
I will also condition the queue so that, hopefully, it will do other imatrices once it is finished, and then continue with hermes...uncensored.
Well, I expected it to finish in the morning,a nd then the whole day till the evening would be a buffer day. But the whole timing was based on an expected 20 hour time, and it seems to be more like 15 hours + ~1h setup or so.
It might certainly be very tight, and I might not be there when it stops. And we don't do it often enough for bugs to be ironed out quickly :)
Great. It only taking 15 hours kind of messes with my plan as well as it now might be so early on Sunday morning that I will still be asleep.
It might actually happen... We could even try immediate back-to-back.
If we start Hermes-3-Llama-3.1-405B-Uncensored at the same time or slightly earlier than Hermes-3-Llama-3.1-405B-Samantha today we should be able to get it done before working time and I can start my development enviornement at 08:17 before leafing for work for the unlikely case I would need it.
And yeah, there is a tension between our uses and your uses of your hardware., So far, we managed pretty well, IMHO, to satisfy everybody.
I'm really happy with how well my use and our use can coexist without impacting each other’s. Thanks a lot for how well you are handling this. I'm extremely happy with the current setup. It really couldn't be any better. I don't even feel any slowdowns when working on StormPeak while we are using it for imatrix and quants. It is just RPC where things are getting a bit difficult but even there it is just a matter of planning RPC tasks in a way they have the least impact. It is absolutely worth it to do RPC imatrix computations even if they require some effort and sacrifices as those are the best openly available LLMs and the ones I end up using the most. The slight incontinence of the RPC setup is nothing in comparison to what I went through to create the Hermes-3-Llama-3.1-405B-Uncensored and Hermes-3-Llama-3.1-405B-Samantha finetunes.
You did that? I definitely did it (too) then. In any case, the DeepSeek-V3+DeepSeek-V3-Base would never have fit at the same time.
Yes you tolled me to do so:
@nicobossI'll be asleep soon. If you wish and you see when deepsek-v3 is done, you can delete the SOURCE gguf in /tmp and copy over the V3-Base, and then e.g. symlink it over /tmp/quant/DeepSeek-V3-Base.gguf or so. Should be safe to use ln -sf at any time.
When I saw your message and saw that DeepSeek-V3 was doing hfu while showing 24/24 I softlink it back to /bpool and deleted it from /tmp then started copying DeepSeek-V3-Base to /tmp which I then softlinked to /tmp once copy was done. I wasn't aware that after hfu 24/24 there is still a DeepSeek-V3 quant left nor did it matter as it just ended up doing that one from the slow storage pool. The only unfortunate thing is that it somehow managed to run out of storage. Maybe because you copied the same file despite telling me I should copy it?
But the storage space that is getting low is the storage space on the other boxes. To get an imatrix job queued, another box must have converted it, and when more and more high priority models get queued in front of the existing ones, the space eventually gets tight, I can't queue more models and so on, especially if some of them are bigger.
That is indeed quite unfortunate. I don't think there is much we can do about that. Maybe we could run some small imatrix tasks while doing RPC but large ones will always have to wait. Best mitigating factor for sure is always completing the imatrix queue between RPC tasks as we currently do.
I don't fully agree with this, as you have seen how quickly it can get full - it is a semi-regular activity. It just takes a medium-sized model and bad network (and lying to the scheduler, as is required for big models). But I don't suffer much from storage problems - during normal operations. it is totally adequate (2TB would be too small though).
Should it at some point no longer be enough just let me know and we could consider adding a third SSD to spool.
And big models always need handholding, both from swapping hardware, preparing boxes, shutting down services on your side, and configuration changes on my siude.
Its currently not possible to automate RPC mainly because I have to physically move the RTX 3080 GPU from StormPeak to CastlePeak - at least until I ever buy another GPU. I could automate shutting down services and the configuration part on your side could maybe also be automated as well. Luckily models requiring RPC are so rare that automating them is not a big concern and doing them manually allows us to carefully plan when we do them to minimize the impact they have.
Oh, we also had a uncommon amount of 50B/70B models, too. But even lots of big models like this are not an issue if space can be saved by temporarily putting stuff on other storage pools (as with /bpool now) and there are some pauses for other things in between.^
I like the strategy of moving some things to temporary storage as that way I can use the storage for other projects if we are not currently doing big models. That way we can make optimal use of storage resources at the cost of some additional work. I will switch bpool soon to btrfs increasing its performance and making sure it will always be reserved for AI workloads.
Any idea what is going on? I don't currently see where these extra 600G could be.
I will investigate this and let you know once I figured it out.
Personally I don't think doing anything during RPC would be worth the time
The only thing worth it would be running hfdprep or quantisations, unless somebody eagerly waits for an 8b imatrix - doing small imatrix ones between big rpc jobs is fine - when I only look for models once per day, we already have 24h maximum latency...
Didn't expect that to safe 4
Well, we are not there yet, but it sure looks like it. I hope my formula isn't wrong...
Maybe because you copied the same file despite telling me I should copy it?
Maybe. I am wholly confused now.
I don't think there is much we can do about that. Maybe we could run some small imatrix task
Well, doing some imatrix between rpc ones is already helping, and is usually good for a few days. But queing theory says that arrival times will be clumpy, so it's just unfortunate that we had such an overload :)
In hindsight, the solution would have been to not quantize both deepseek models at the same time. Will remember that.
The next unknown is whether the imatrix scheduler will wait for both gpus to be empty before he starts the next 405b job, as it should, but has never been tested. But with some luck I'll be awake watching. If you can't find anything about the missing 600G (or if they are not really missing for some reason, but the disk really is full) I'll delete one of the hermes ggufs tomorrow.
Well, we are not there yet, but it sure looks like it. I hope my formula isn't wrong...
It will be right as booth the DeepSeek-V3 RPC imatrix jobs where faster than expected as well. I only thought maybe MoEs are faster but now it’s clear that it’s the setup maybe in combination with some llama.cpp improvements.
In hindsight, the solution would have been to not quantize both deepseek models at the same time. Will remember that.
Doing them all together is unfortunately quite convenient from an RPC hardware setup perspective. Having to move the GPU back and forth between StormPeak and CastlePeak for every model we want to do over RPC would be quite time consuming. The GPU is too heavy for CastlePeak and so requires one-use cable ties to prevent it from bending so much that the GPU fan hits the cables below while on the StormPeak side the power cable is a bit too short so it takes a while to get them in and out but an additional GPU would solve these issues.
In fact, Injust had a disk full condition, but managed to delete a model before imatrix would fail - but it is only a matter of time until it is full again
That is so scary. I'm glad you were able to prevent it from failing just in time.
Any idea what is going on? I don't currently see where these extra 600G could be.
If you can't find anything about the missing 600G (or if they are not really missing for some reason, but the disk really is full) I'll delete one of the hermes ggufs tomorrow.
Turns out the culprit was the deleted 800 GiB EXT4 image I used on 26th December to convert the DeepSeek models into the BF16 base model. It was still using around 750 GB of storage despite being empty and deleted. I did delete it over the Proxmox UI and the image was gone but the storage wasn't freed because there was still a terminal open somewhere that had that folder as it's working directory which apparently is enough to prevent its and its contents deletion.
lsof | grep spool
bash root cwd DIR 0,0 16 256 /spool/images/107/vm-107-disk-0 (deleted)
The next unknown is whether the imatrix scheduler will wait for both gpus to be empty before he starts the next 405b job, as it should, but has never been tested. But with some luck I'll be awake watching.
Let's hope that works out. I'm also hoping the RPC servers can do this without a restart but they probably can. Should they crash I made it so they immediately restart and in worst case you can even SSH them or wait for me to be awake. Even if we start it on Sunday noon it will still easily finish before Monday 08:17 assuming it only takes 16 hours.
We unfortunately experienced an OOM event on StormPeak which ended up killing the llama-imatrix process but ironicaly none of the RPC workers:
-2000 811 Hermes-3-Llama-3.1-405B-Samantha error/1 (GPU-2d) / 240.52s/c 588.1/1258.7m(938.7-1009.1) [183/314] 6.6381 (status: failure)
[Sun Jan 12 00:27:39 2025] Out of memory: Killed process 2080792 (llama-imatrix) total-vm:14656992kB, anon-rss:615044kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:10008kB oom_score_adj:800
ZFS is to blame for this. I forgot that it is 00:17 on the second Sunday of the month. By default, ZFS does all its scrubs then. Because ZFS developers lack some common sense they decided it is a good idea to do the scrubs of all the storage pools at the exact same time which leads to a massive resource peak. Because it is all at once it managed to eat up enough memory to OOM kill the llama-imatrix process. I'm quite surprised the kernel didn't OOM crash because with GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 it really should have crashed.
Most of the scrub tasks finished by itself after a few minutes and the other ones I canceled. Thanks to your awesome preparation nico1 is not ideal until you wake up but instead started to working on Hermes-3-Llama-3.1-405B-Uncensored
RPC imatrix. I'm a bit surprised it hasn't done the higher priority imatrix tasks in-between first but it makes sense due to GPU-18 already being pre-allocated to it. New plan is to finish Hermes-3-Llama-3.1-405B-Uncensored
, let the other imatrix quants run and then immediately retry Hermes-3-Llama-3.1-405B-Samantha
.
49 811 Hermes-3-Llama-3.1-405B-Uncensored run/imatrix (GPU-18) / 232.73s/c 208.9/1218.0m(1042.7-10932.4) [6/314] 3.2876
I'm a bit surprised it hasn't done the higher priority imatrix tasks in-between first but it makes sense due to GPU-18 already being pre-allocated to it.
I am even more surprised - the "pre-allocation" is because it just shows the object member which isn't cleared, it would be ignored when it is not running.
I would assume the failed job might still allocate resources (because the scheduler doe snot know in which state it is), and the other job has the force flag set to ignore the budget. Sucks.
Update: yeah, since it was force'd, it would simply ignore resource allocation, because I would need a disticnt scheduling class ("rpc") to model separate resources. So the whole setup wouldn't have worked either way. Worse, if the scheduler had run for whatever reason, it would have immediately started the next rpc quant. I think I wanted to rely on the fact that the GPU allocation still does its job and reduced the number of gpus to 1, but then accidentally commented out that line again. Very unstable.
Doing them all together is unfortunately quite convenient from an RPC hardware setup perspective.
I meant quantization - it would have been easy to only quantize deepseek-v3 and some smaller models in parallel. The reason why I did both together was so that I could give ...-base a higher nice level, so deepseek-v3 had priority. for smaller jobs I would have to code it into the scheduler instead of manually renicing.
Turns out the culprit was the deleted 800 GiB
I am so relieved :)
compute_imatrix: 76.42 seconds per pass - ETA 7 hours 44.85 minutes
That is a for a 20B. That kind of thwarted my plan for quickly doing some imatrix calculations (the time has updated to 100-120min, but that's still remarkable for a 20B).
Must have been some weird nvidia thing - after 260 chunks it kind of normalised. But boy are we behind the schedule.
And unfortunately, I'll be gone for two hours. Will try to start the next model before I come back though.
Must have been some weird nvidia thing - after 260 chunks it kind of normalised.
No it was your scheduler starting Hermes-3-Llama-3.1-405B-Samantha
RPC imatrix computation while doing the other imatrix computations and quantisation tasks.
[Sun Jan 12 15:44:52 2025] Out of memory: Killed process 3298227 (llama-imatrix) total-vm:801635412kB, anon-rss:586492kB, file-rss:9728kB, shmem-rss:0kB, UID:100000 pgtables:1553044kB oom_score_adj:800
It also crashed the GPU only RPC server due to running out of GPU memory. We can call ourself so lucky this didn't crash the host because it really should have.
Guess we are now doing quantisations while doing imatrix RPC - I hope this was intended:
49 811 Hermes-3-Llama-3.1-405B-Samantha run/imatrix (GPU-2d) / 236.79s/c 68.6/1239.2m(59811.9-2394.0) [9/314] 3.3868
-9001 689 I DeepSeek-V3-Base run/imatrix 17/24,Q5_K_S [89/1025]
It seams to work based on the available RAM so everything will be fine just make sure to stick with one quantisation task while RPC imatrix is running:
It also crashed the GPU only RPC server due to running out of GPU memory. We can call ourself so lucky this didn't crash the host because it really should have.
Holy shit! We cna also be lucky the rpc servers didn't accept both processes.
Update: ok, I see, not both processes, it was after the other 405b was finished.
Guess we are now doing quantisations while doing imatrix RPC - I hope this was intended:
Yes, unlike the double imatrix one, this is intended. I had some trouble understanding how nested systemd-run calls work w.r.t. resource limits - apparently, new scope == new independent limits, which is a bit annoying,l because I wanted to run the quantize shell script and all uploads in the same scope, but quantize runs llama-quantize in its own scope, with again new resource limits.
It's because you were kind of ... sounding... in experimental mood yesterday, and I thought, now or never (the imatrix just having been started).
In any case, right now, there is still 26G of cache, so I guess we are not that tight. And deepseek has pretty tiny tensors (~8GB max unless I missed one).
Holy shit!
Seems the rule of "start if job is forced and current ram_usage is 0" somehow triggered despite ram usage obviously not being 0. I have no idea how that happened.
Just a heads-up: /bpool is no longer in use by me.
Why are currently so many imatrix tasks marked as blocked/imatrix/gpu
?
Maybe because I paused them for a few hours yesterday but I since long unpaused them? When we are at pausing would it be possible to have a separate /tmp/pause
trigger for each GPU? I always end up having to pause booth of them even if I only need one. Maybe we could get rid of /tmp/pause
and implement pausing/unpausing imatrix tasks similar than in nico1-pause
and nico1-resume
so the scheduler is aware which GPUs are available. I'm currently using /root/handlePause.sh
to pause/unpause so if you have time feel free to edit this script accordingly by adding arguments to specify the action and GPU and making it blocking so it is waiting for the specified GPU to finish its current imatrix tasks when paused.
Why are currently so many imatrix tasks marked as blocked/imatrix/gpu?
There were empty ".slog" files for each of those on kaos. Basically the screen/job output. But no .status file (with the exit code). As a result, the scheduler had no idea what state they were in and left them alone.
This is usually the result of a job either still running, or being killed without having a chance to write the exit code. For example, when I press ^C in screen, it would be like that. But of course I did not.
Now, as to why it was like that... I don't know. They are all from yesterday afternoon 15:20-15:30 CET.
The touch file method of pausing them should be absolutely harmless - it's just the shell script looping, i.e. for the scheduler, it should just be a longer job.
The log file does not show anything of interest (e.g. for Anubis, it downloaded the gguf, detected its size, then didn't start it because other jobs were running), it did continue to queue others, so it wasn't immediately obvious. Maybe I did something at the time, but I don't remember.
I suspect it's the problem where screen (apparently) recreates a zero-byte log file long after ther job is finished, i.e. job sets exit status, scheduler cleans up all files, screen recreates the log file, scheduler is stumped. Possibly because kaos was so busy at the time. It is somewhere on my todo list to either change how log files are written or get rid of screen, which did it's job during development. But you know, everything will subtly break when I do that, so ... :)=
imatrix tasks similar than in nico1-pause
Actually, there is, but not per-gpu. It would have been exposed fully whenever I get around to letting you take control of the queue etc., alas, life. I'll think about it.
A /tmp/pause.gpuid or so would be a quick fix, but that will just block a job. I once suggested a config file that would be fetched before each job is scheduled. But I'll try to do something more intelligent on the server side.
echo pause GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc >/dev/tcp/10.28.1.1/16713
echo resume GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc >/dev/tcp/10.28.1.1/16713
The other gpu "uuid" is GPU-188a5143-db69-7058-63b5-f2f1d2354f91
I'm testing it right now.
Works for me.
I should mention that there is no feedback for this pause on the status screen. I'll probably change how that is reported, too.
All pause flags are shown in the status header now:
last updated: 2025-01-19 13:42:01+0100 (1s) (imatrix.GPU-188a5143-db69-7058-63b5-f2f1d2354f91)
echo pause GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc >/dev/tcp/10.28.1.1/16713
echo resume GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc >/dev/tcp/10.28.1.1/16713
Thanks a lot for implementing this so quickly. This is awesome as I can now use one of the RTX 4090 GPUs without pausing the entire imatrix queue.
All pause flags are shown in the status header
That's perfect.
Thanks a lot for implementing this so quickly. This is awesome as I can now use one of the RTX 4090 GPUs without pausing the entire imatrix queue.
It's indeed great for the future, but so far, that wasn't holding us back. What causes stress to the queue right now is the sheer amount of models and big models that have been released in the last two weeks, limiting even progress of the low-priority models. But we are getting there :)
@nicoboss Tell me that you paused a gpu on nico1, because I am confused and don't know if I did it and forgot to resume ;)
In other news, I'm finished queuing evertything I'd ever wanted to queue from february to december last year. On to richard's list.
Tell me that you paused a gpu on nico1, because I am confused and don't know if I did it and forgot to resume ;)
I did pause the second GPU intentionally around an hour ago to give Guilherme34 the opportunity to test his new models. Guilherme34 needing some GPU resources today is the reasons why I asked for the single GPU pause feature to be implemented and I’m really glad to have it. I would usually give him the RTX 3080 but I’m currently using it myself.
What causes stress to the queue right now is the sheer amount of models and big models that have been released in the last two weeks, limiting even progress of the low-priority models. But we are getting there :)
Having many new exciting great models is awesome so don't worry about them delaying our progress on the low-priority ones. We will eventually get to them. The model backlog already reduced massively compared to our peak of over 4000 models.
In other news, I'm finished queuing evertything I'd ever wanted to queue from february to december last year. On to richard's list.
That's awesome to hear! We are making such great progress.
I did pause the second GPU intentionally
That's a relief :) I forgot about the timing and my command history from tetsinbg was a bit jumbled, so I really wasn't sure.
That's awesome to hear! We are making such great progress.
Yeah, and on to january and 2023 g
Venting: sometimes, it is the little things. I am trying to automate (some) llava mmproj extraction.
fname_out = f"{model_name.replace('/', '-').lower()}-vision.gguf"
Of course, the output filename is not configurable. Sigh. Why would anyone go to these lengths to make the output filename hard to guess.
@nicoboss rich1 seems to hang again (ssh does not greet, but wireguard pings still work)
Wow that was fast. After waiting 15 minutes, I decided to notify you you (cry for help), and seconds later, problem seems solved :)
OK, I think that extracting vision data from models takes enourmous amounts of memory, multiple times the size of the whole model data, apparently (32GB is not enough to extract a 7B), and this caused the hang.
Sigh. This does not work out.
That practically means nico1 is the only box that can do vision model extraction.
You have any cool ideas around this? Because that means I have to schedule certain model architectures on certain hosts now.
@mradermacher my server was hanging from mmproj for some reason, so I guess please dont generate it there. I guess it's because it doesnt have enough ram
Wow that was fast. After waiting 15 minutes, I decided to notify you you (cry for help), and seconds later, problem seems solved :)
I was sitting for quite a while to find the source of DDOS attack lmao
That practically means nico1 is the only box that can do vision model extraction.
I'm fine with having all the vision models on nico1.
You have any cool ideas around this?
Have you tried to just cgroup limit the mmproj extraction and see what happens? Unfortunately I'm quite certain it will out of memory crash as I had similar issues back when I did the mmproj extraction for https://huggingface.co/mradermacher/model_requests/discussions/415.
@mradermacher
When would work best for you to start with DeepSeek-R1
imatrix computation? I would need to reboot StormPeak move the RTX 3080 GPU back to CastlePeak before we can start requiring me to pause nico1. We could start tomorrow late morning/early afternoon as by then DeepSeek-R1-GGUF static quants should be done and there is enough time for you to prepare a Q8 model to be used for imatrix computation. It also gives enough time for the morning imatrix queue to be completed.
Have you tried to just cgroup limit the mmproj extraction and see what happens?
Yup, it gets killed.
We could start tomorrow late morning/early afternoon
Sounds like a good tentative plan, modulo disaster. And I can probably even quant during the imatrix computation. Non-vision models, that is. I hope I can make some inroads with the queue, but it's close to being normal finally - only two 70Bs left opn nico (much worse elsewhere, but we are getting there).
(Well, I'll probably only have a bit of time during noon to prepare, that's the only issue with the plan)
I'm fine with having all the vision models on nico1.
I've already changed job adding so it does that now. It does add horrible dependencies though, such as not having defined memory requirements for quanting. And it's probably a one-line change to fix (use_temp_files=True or so). I really don't understand why the llama.cpp developers think memory is free.
I'm actually scared to look at the code, because I can't fathom why a 15GB model resulting in 1.4GB vision tensor output would need more than 64GB of RAM to produce. Do they expand it to double or simply load the model twice? I mean, what else could it be111!!!!
I will queue new jobs before I got to bed, and then probably before noon, then force as many jobs as reasonable on other nodes so they can get some imatrix computations in.
It gets better and better. Apparently some qwen2vl vision models insist on cuda.
Yeah, it seems two 70B models vision extraction triggers the oom killer on nico1. This is troubling.
Update: yeah, 270g peak for a single 70B model. And all it outputs is 1.4GB. It must load and convert all tensors to f32.
Well, I'll probably only have a bit of time during noon to prepare, that's the only issue with the plan
I will try to have everything ready by noon.
And I can probably even quant during the imatrix computation.
Yes you can quant non-vision models while RPC imatrix is running buy mabye only with one concurrent task.
Yeah, it seems two 70B models vision extraction triggers the oom killer on nico1. This is troubling.
Can you somehow have the mmproj task check if another mmproj task is currently running and if so, wait for it to finish? That way we should never OOM unless there is an absolutely massive model. It makes me happy to finally see the OOM reaper do its job instead of letting the kernel crash. I'm currently using slightly over 100 GB myself so that likely contributed to the OOM situation as well.
It gets better and better. Apparently some qwen2vl vision models insist on cuda.
Good thing we are doing them on nico1 but let’s hope they don't need as much GPU memory as the need RAM.
It does add horrible dependencies though, such as not having defined memory requirements for quanting.
It will probably just steal mmap RAM from the imatrix tasks and then free it again once it’s done so shouldn't be an issue as long you don't run multiple of them at once.
I can't fathom why a 15GB model resulting in 1.4GB vision tensor output would need more than 64GB of RAM to produce
I'm now somewhat intrigued what they are doing as well. Seams quite ridiculous.
Can you somehow have the mmproj task check if another mmproj task is currently running and if so, wait for it to finish?
The mmproj task is the "noquant" task, and the default is to only run one. It was only a problem because I did maybe 15 models tonight, and let up to 6 run concurrently.
The bigger issue is a) interference with other big tasks such as imatrix and b) cuda.
[cuda] Good thing we are doing them on nico1
Actually I currently have to skip them because I compile all quant-related stuff without cuda, and some libraries like to pick up cuda if its available, and I don't want to install cuda libraries on all hosts. So far, it affected maybe 4 models, and the problem is bitsandbytes.
It will probably just steal mmap RAM from the imatrix tasks and then free it again
Yeah. I'll have a look and see if something obvious can be done about it. But I think you noticed how much I like maintaining forks :)
I'm now somewhat intrigued what they are doing as well. Seams quite ridiculous.
Oh, my, I would never try to stop you from having a look yourself :-)
Currently I only support qwen2vl, btw.
@mradermacher The RPC servers are now ready to be used for DeepSeek-R1-GGUF in Q8 (F16 obviously won't fit). I updated them to latest llama.cpp.
I slightly changed to wights distribution to put a slightly more layers on CastlePeak so if it fits with that configuration, we might have slightly more RAM available on StormPeak while the imatrix RPC computation is running.
Morning. Haha, that brutally didn't work out. I don't even know why imatrix calculations stopped. Sigh. I'll try to find out.
Ah, OK, most did get through, but again kaos was apparently too busy for some. Hmmhmm.
OK, things are not that bad, nico1 is pretty empty. I see how far I get with noromaid, but probably until everything is ready it will be through as well.
Ok, not perfect. we we are all set to go. Unfortunately, I will have to remove the override manually once the imatrix jobs have cleared, and I will be a bit busy probably when it happens, but I will give my best :)
@nicobossactually,I have touched /tmp/pause on nico1. The job shoulöd start but pause when one gpu is free, so whoeever sees both gpus free first can rm that file.
Actually, sorry for the noise, there actually is coder that should only start it once all gpus are unused, so I unpaused and will hope for the best.
@nicoboss
Also, regarding the DeepSeek-R1-Zero, you think we can have a Q8 in time? If you manage to convert it, you can rm -rf tmp/quant/DeepSeek-R1-Zero
to free some space, and maybe make a quantize from it to /tmp/DeepSeek-R1-Zero.Q8_0.gguf, and I can set up the job so it will start once the previous job is done, or so.
Update: I also wish space would exit the @name autocompleter.
Haha. Everything configured correctly (a first!), but I managed to put the quant into / not /tmp. And then I moved it to ~ instead of /tmp. Smooth operator :)
regarding the DeepSeek-R1-Zero, you think we can have a Q8 in time?
Yes I can by juggle things around. I can BF16 the model to some SSD NFS network storage than delete the HF model on spool and source GGUF to spool. Then I can move the source GGUF back to the NFS network storage and Q8 quantize to spool. Possible but harder than usual. I will put it to /tmp/DeepSeek-R1-Zero.Q8_0.gguf
once done.
Everything configured correctly (a first!),
That's awesome to hear. Everything is looking great so far.
but I managed to put the quant into / not /tmp. And then I moved it to ~ instead of /tmp. Smooth operator :)
No problem. Luckily that is a relatively quick error. Stupid mistakes like this happen to as well if I'm distracted.
Doing BF16 conversion with such limited resources was much harder than I though. NFS can only be used within privileged containers so I had to create a new one and mount spool into it. Then I had to setup and mount the NFS share and copy over all the BF16 scripts. Then once all of this was setup, I tried to run fp8_cast_bf16.py
just to realize it requires CUDA because of it using trition. I then had to copy over and install the NVidia drivers and figuring out how to give a privileged container GPU access which was different than for an unprivileged one. I then tried running it using 12 GiB of RAM and immediately OOM crashed the entire container and thanks to the NFS share the container got stuck in kernel mode and didn't even want to stop/start anymore. Now I gave it 25 GiB RAM and it seems to be happy and luckily also doesn't make the GPU run out of GPU memory. I also underestimated how much resource an NFS server needs and only gave it 8 cores 4 GiB RAM and it now spends 75% of the CPU running system code making things take a bit longer than expected. In any case most importantly it works and it will eventually be done.
I will let the BF16 conversion finish overnight and then run the much simpler and hopefully faster conversions during tomorrow morning so we can start DeepSeek-R1-Zero at lunch time if conversions finishes until then. That way we again have a morning for all the imatrix computation tasks to be computed.
Beside that I also somehow managed to get a now unkillable process stuck busy waiting for /sys/bus/pci/drivers/nvidia/unbind due to forgetting that moving that RTX 3080 from StormPeak to CastlePeak caused the PCIe IDs to change. So should one of the RPC server crashes for whatever reason one of the RTX 4090 GPU might permanently disappear until I fix it.
wow, i feel your pain. and even more luckily, the imatrix calculation survived so far. and nice juggling!
but to be honest, for me, solving these kind of problems under resource constraints is the most fun. it's like a puzzle. hacking computers can be the same kind of fun, or was, before these pesky buffer overflows became the norm :-)
/tmp/DeepSeek-R1-Zero.Q8_0.gguf
r1 will probably be finished while I am still fast asleep, or close. then there will be a bit of time where some other imatrix quants can be done, and if the quant is there, i will relatively quickly be able to start r1-zero.
also, the bf16 => q8_0 conversion is likely going to be I/O speed (maybe 40 minutes or so :)
/tmp/DeepSeek-R1-Zero.Q8_0.gguf
and all the RPC servers are updated to latest llama.cpp are ready. nico1 is currently paused as it would otherwise run out of storage. I will resume it in around 15 minutes once the DeepSeek-R1-Zero source GGUF is done being moved to the network disk.
Morning :) Ok :)
Good morning! I resumed nico1 and everything is now ready for DeepSeek-R1-Zero
RPC imatrix computation.
DeepSeek-R1.not-override ?
(haha, cute, hf offers a translation for this post :)
DeepSeek-R1.not-override ?
nico1 was idle as it completed all important models. I then checked and there was enough RAM and storage available. I wanted DeepSeek-R1 quants to be computed so I can try them out tomorrow. Unfortunately the task was in a blocked/override
state. I saw that there is a DeepSeek-R1.override
file so I thought that this file might be what causes it to be in this state so I renamed it to DeepSeek-R1.not-override
. Not sure if it worked or if you ended up unpausing it as nothing happened for quite a while but just when I wanted to write you about it, it started to quantizing it. The main reasons I renamed and not deleted it is so I can easily rename it back and you are not confused about why DeepSeek-R1 is no longer blocked.
The only issue with doing DeepSeek-R1 quants is that we cannot really tolerate upload failures so maybe we have to pause it again when I go to bed as then there is nobody to monitor and interfere should it run low on storage. Ideally we would have it always wait for the upload to be completed before starting with the new quant but that is likely too much effort to implement. Regarding what to do if an upload fails, I thought about "ctrl-a ctrl-s" to pause and "ctrl-a ctrl-q" to unpause to make it wait for the upload but last time I tried this it didn't really work for me so. I would obviously try that first and if it doesn't work heavily limit the CPU so it quantizes around 10 times slower.
What would be the correct way to make it stop after the current quant? Creating DeepSeek-R1.override
or DeepSeek-R1.interrupt
? Because that I should probably do before going to bed as I really don't want it to run out of storage when left unattended.
(haha, cute, hf offers a translation for this post :)
Haha nice. It tried using facebook/nllb-200-distilled-600M.
Well, it wasn't my plan, but good to see you learn the ropes :) Yes, the file is what puts the job into override mode, but the scheduler always has to run. Which happens from time to time when other jobs finish.
You can force it, until I provide the (as of yet mythical) llmc command, using echo push >/dev/tcp/10.28.1.1/16713
This would have more or less immediately started the job. You can also telnet 10.28.1.1 16732
to get the status daemon and press return to ask it for an update so you don't have to wait for the web page to update (q + return quits).
The only issue with doing DeepSeek-R1 quants is that we cannot really tolerate upload failures
The quantize script itself should pause when df reports less space than 1x the gguf or so, which should keep it from doing bad things if only one job is running that eats up space. But, yeah, who knows what will happen, and right now, the configured budget for nico is ~1.5TB more than normal, so it's good to keep an eye on it.
Update: QUANTDISKUSAGE=$(( $(stat -c%s -- "$SRCGGUF") * 60 / 100 ))
that should be 700G minimum free. the problem is that for very big jobs, I sometimes disable this check via touch /tmp/ignoredf
, so that should be removed (I removed it, it was actually on nico1, and I am prone to forget about it).
Ideally we would have it always wait for the upload to be completed before starting with the new quant but that is likely too much effort to implement.
It is implemented, but right now, the limit is configured to be 16 on nico1. We did run deepseek quantize before, with even less diskspace, and it worked, so I would not worry.
Regarding what to do if an upload fails, I thought about "ctrl-a ctrl-s" to pause and "ctrl-a ctrl-q" to unpause to make it
This will take effect (in screen) when the script tries to output something. As long as it is quiet, it will not hang. It will pause a running quantize, though.
In worst case, ctrl-c it.
And, in case you want to know, ctrl-c means the job will still "run" because it couldn't write an exit status. If that happens and you are sure the job doesn't run (use ils
), or if the job failed and you want to restart it, delete /dev/shm/JOBNAME.log (logfile) and ...status (exit code). And then "push" the scheduler, and it will retry. You could practise some time in the future (preferably not on deepseek :-)
What would be the correct way to make it stop after the current quant?
Yes, create an .override file and an .interrupt file, it will check after every quant and the scheduler will remove the .interrupt file.
Thanks a lot for taking the time to provide me all this valuable information.
Well, it wasn't my plan
If you want to run another model feel free to do so but DeepSeek-R1 seems to have the highest priority according to your own metric.
but good to see you learn the ropes :)
I still plan to some day help you manage the queue and for that I better get familiar with the system. I slowly start to understand it.
Yes, the file is what puts the job into override mode, but the scheduler always has to run. Which happens from time to time when other jobs finish.
Great to know. I expected that this is why it was delayed because I remembered this mechanism. I think back it triggered when something happened or at 07:00 in the morning but if I remember correctly, we changed it to be like once an hour or if something happens back when you implemented advanced nico1 electricity cost optimization.
You can force it, until I provide the (as of yet mythical) llmc command, using echo push >/dev/tcp/10.28.1.1/16713 This would have more or less immediately started the job.
Thanks. That will for sure turn out to be really useful.
You can also telnet 10.28.1.1 16732 to get the status daemon and press return to ask it for an update so you don't have to wait for the web page to update (q + return quits).
I still remember the telnet version of the webpage. The webpage updates relatively often but might still be useful to have even faster status updates.
q + return quits
That was one of the main reasons I barely used telnet status page. I didn't understand how to get out of it without closing the entire terminal.
QUANTDISKUSAGE=$(( $(stat -c%s -- "$SRCGGUF") * 60 / 100 ))
That is awesome. In that case we can just let it run over night. Realistically even without any limit there must be so massive upload failures it is quite unlikely to happen but this protection should make an out of space event almost impossible.
/tmp/ignoredf
I was wondered about that file earlier today. Thanks for explained it.
It is implemented, but right now, the limit is configured to be 16 on nico1. We did run deepseek quantize before, with even less diskspace, and it worked, so I would not worry.
The upload limit is quite cool and I remember it from the past when I had terrible internet. No need to change that on thanks to the much better auto pause during low disk situations.
This will take effect (in screen) when the script tries to output something. As long as it is quiet, it will not hang. It will pause a running quantize, though.
Last time I attached to the quantization screen session and it seams to have just ignored the shortcut and kept outputting things. Maybe I tried using screen inside tmux or had some other strange setup that made it not work. I try again on a not so important model in the future.
In worst case, ctrl-c it.
That would be quite sad but yes can be done in a worst-case scenario to prevent RPC imatrix from running out of space. But before that I can just put all cores but 2 as offline to make the entire LXC container almost pause except networking which would still be fast due to being handled by the kernel. Not sure if you ever realized but I completely switched to changing the number of CPU cores to load balance nico1 with any other CPU resources I might need on StormPeak. It seems to work much better than adjusting CPU limit or CPU units.
Yes, create an .override file and an .interrupt file, it will check after every quant and the scheduler will remove the .interrupt file.
I get it so after every quant it blocks if there is a .override if it was interrupted using .interrupt so putting booth will have the desired effect.
If you want to run another model feel free to do so but DeepSeek-R1 seems to have the highest priority according to your own metric.
I normally manually manage things when we force-schedule big models, but you made a right decision, according to the data you had. Even if it wasn't my decision I would be happy if you continue being more active like this, and I want to provide more tools so you can do so.
but if I remember correctly, we changed it to be like once an hour
At the moment (and for m any months) it is purely event driven again, i.e. without anything "push"ing it, nothing will happen.
Unrelated: in recent days, I sometimes found jobs to be "idle", which can practically only happen when a push gets lost. The push is at the moment literally the echo I have you - it was a quick hack, without error checking or retrying. But it worked fine, I wonder what happened recently.
That was one of the main reasons I barely used telnet status page. I didn't understand how to get out of it without closing the entire terminal.
q+exit is a relatively recent addition. Also, very few people remember the telnet escape (ctrl-altgr-] on german keyboards, then "close"+return). Can't say I ever used (unix) telnet for actual login, only as a simple tool to connect to a tcp port. You could use socat stdio: tcp:10.28.1.1:16713 or so I guess. Or netcat. But I always found telnet to be most convenient for such testing. (I used ncsa telnet on dos extensively, though..., which shows my age :)
The upload limit is quite cool and I remember it from the past when I had terrible internet.
It is pretty recent - I had some hack in the quantize script.
If I never explained that to you, the architecture is like this: llmjob is a perl script that copies itself to all hosts and manages the jobs. I don't think perl is your language of choice, so I won't recommend looking at it. Also, it's full of ad-hoc code. noquant and quantize phases are done by a bash script called "quantize" - I think you are quite good with posix sh. Not sure why I think all that, but it's the impression I got. I think all sources are on all machines, too, in case you ever need to look at it. There is also "imatrixjob-remote", which runs the imatrix jobs - logically, all imatrix jobs run on kaos/10.28.1.1, so you don't see much of it. It's basically a hacked copy of a hacked copy of llmjob.
I wouldn't design it like this if I would write it again, but the basic design is ok. And I think over the last year, it evolved quite a bit, and the oldest/most stable parts have been refactored into something nice. The scheduling and queuing algorithms are the most hacky atm.
Maybe I tried using screen inside tmux
You'd need to make sure all the keys are indeed send through all layers. It's quite annoying. For screen-in-screen (a relatively common case for me), it's just ctrl-a ctrl-a s. Or killing an ssh in a screen in ssh would be "return ~ ~ ."
But it should be possible. You cna practise by either setting it up yourself or waiting for some output after XOFF, and then seeing if a lot of output appears after XON. Because when done right, you should see the output continue, as if it were frozen, not continue as if it was buffered and continued in the background without you seeing it.
Well, that was not a good description...
You can also send stop signals. e.g. with "ikil -STOP ". We know it works because that's what the cronjobs on nico1 do at 17:00/22:00/07:00
I get it so after every quant it blocks if there is a .override if it was interrupted using .interrupt so putting booth will have the desired effect.
Uhm, functionally that's correct,. but let me clarify this: quantize (the script that runs llama-quantize or convert-hf-to-gguf) will check for an .interrupt file and exit with a special exit code. It does not care about the .override file.
But the scheduler (llmjob, triggered from kaos) does care and will ignore jobs with .override when starting new ones (but continue managing running ones).
So if you'd set .interrupt alone, quantize would likely exit, then (if its the top job in the queue) immediately start again, and then it would have to wait for all uploads to finish first - which is incidentally on my list of things to optimize.
--
Anyway, what I actually came here to write was that I am going to sleep now, and very experimentally, when deepseek is done, almost everything should automatically return to normal, i.e. nico1 should start quantizing two jobs (eventually, the "push" event for this does not exist, so it will have to wait until something pushes it).
Or maybe everything will implode in various ways. But at least I am pretty sure the imatrix.dat will be safe :)
If you want to run another model feel free to do so but DeepSeek-R1 seems to have the highest priority according to your own metric.
I normally manually manage things when we force-schedule big models, but you made a right decision, according to the data you had. Even if it wasn't my decision I would be happy if you continue being more active like this, and I want to provide more tools so you can do so.
but if I remember correctly, we changed it to be like once an hour
At the moment (and for m any months) it is purely event driven again, i.e. without anything "push"ing it, nothing will happen.
Unrelated: in recent days, I sometimes found jobs to be "idle", which can practically only happen when a push gets lost. The push is at the moment literally the echo I have you - it was a quick hack, without error checking or retrying. But it worked fine, I wonder what happened recently.
That was one of the main reasons I barely used telnet status page. I didn't understand how to get out of it without closing the entire terminal.
q+exit is a relatively recent addition. Also, very few people remember the telnet escape (ctrl-altgr-] on german keyboards, then "close"+return). Can't say I ever used (unix) telnet for actual login, only as a simple tool to connect to a tcp port. You could use socat stdio: tcp:10.28.1.1:16713 or so I guess. Or netcat. But I always found telnet to be most convenient for such testing. (I used ncsa telnet on dos extensively, though..., which shows my age :)
The upload limit is quite cool and I remember it from the past when I had terrible internet.
It is pretty recent - I had some hack in the quantize script.
If I never explained that to you, the architecture is like this: llmjob is a perl script that copies itself to all hosts and manages the jobs. I don't think perl is your language of choice, so I won't recommend looking at it. Also, it's full of ad-hoc code. noquant and quantize phases are done by a bash script called "quantize" - I think you are quite good with posix sh. Not sure why I think all that, but it's the impression I got. I think all sources are on all machines, too, in case you ever need to look at it. There is also "imatrixjob-remote", which runs the imatrix jobs - logically, all imatrix jobs run on kaos/10.28.1.1, so you don't see much of it. It's basically a hacked copy of a hacked copy of llmjob.
I wouldn't design it like this if I would write it again, but the basic design is ok. And I think over the last year, it evolved quite a bit, and the oldest/most stable parts have been refactored into something nice. The scheduling and queuing algorithms are the most hacky atm.
Maybe I tried using screen inside tmux
You'd need to make sure all the keys are indeed send through all layers. It's quite annoying. For screen-in-screen (a relatively common case for me), it's just ctrl-a ctrl-a s. Or killing an ssh in a screen in ssh would be "return ~ ~ ."
But it should be possible. You cna practise by either setting it up yourself or waiting for some output after XOFF, and then seeing if a lot of output appears after XON. Because when done right, you should see the output continue, as if it were frozen, not continue as if it was buffered and continued in the background without you seeing it.
Well, that was not a good description...
You can also send stop signals. e.g. with "ikil -STOP ". We know it works because that's what the cronjobs on nico1 do at 17:00/22:00/07:00
I get it so after every quant it blocks if there is a .override if it was interrupted using .interrupt so putting booth will have the desired effect.
Uhm, functionally that's correct,. but let me clarify this: quantize (the script that runs llama-quantize or convert-hf-to-gguf) will check for an .interrupt file and exit with a special exit code. It does not care about the .override file.
But the scheduler (llmjob, triggered from kaos) does care and will ignore jobs with .override when starting new ones (but continue managing running ones).
So if you'd set .interrupt alone, quantize would likely exit, then (if its the top job in the queue) immediately start again, and then it would have to wait for all uploads to finish first - which is incidentally on my list of things to optimize.
--
Anyway, what I actually came here to write was that I am going to sleep now, and very experimentally, when deepseek is done, almost everything should automatically return to normal, i.e. nico1 should start quantizing two jobs (eventually, the "push" event for this does not exist, so it will have to wait until something pushes it).
Or maybe everything will implode in various ways. But at least I am pretty sure the imatrix.dat will be safe :)
@mradermacher I think DeepSeek-R1 i1-Q2_K_S huggingface upload got stuck.
DeepSeek-R1 is already in run/imatrix 12/24,IQ1_M,waiting for prev hfu, df (hfu i1-Q2_K_S)
since I woke up 3 hours ago.
stat 'DeepSeek-R1-i1-GGUF-DeepSeek-R1.i1-Q2_K_S.gguf*.log'
Access: 2025-01-24 11:45:01.247003144 +0100
Modify: 2025-01-24 06:08:14.583974792 +0100
Birth: 2025-01-24 05:49:14.126076567 +0100
The log also contains an interesting error:
(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: 3aa3b504-5be6-4274-b3bf-f7837edcc6fb)')' thrown while requesting PUT https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com/repos/72/60/7260819ed2f1619e5a91bb148b0eff76fca17b11fa4502382724c6eb4ebc5bcd/df9da4c9f10d5956db5e2f928c411f7b847de87fd9c4722438b631d35438b32d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=<CENSORED>&X-Amz-Date=20250124T045302Z&X-Amz-Expires=86400&X-Amz-Signature=<CENSORED>&X-Amz-SignedHeaders=host&partNumber=62&uploadId=<CENSORED>&x-id=UploadPart
I see no upload bandwidth utilization if nothing but i1-Q2_K_S is uploading indicating it is likely not doing anything.
Now we know why ignoredf was set. My explanation was incomplete. Yes it waits for all uploads when disdkpace is < QUANTDISKUSAGE, but it also waits for ther previous upload to finish when disk space is < QUANTDISKUSAGE *4.
Meh.
Well, that overallocation saved my ass many times.
Now for the upload, I think there is a bug somewhere, such as not closing the other end of the pipe or so: the python upload process does not exist anymore but the parent is waiting for something (likely the python process). Shouldn't be the case, as it's a >>30 year old well tested library I am using for that, so it's probably something else.
I am pleased enough that nico1 correctly switched back ,to the normal job limits on its own.
Ah, no, python is still running. Right, I forgot that children of llmjob are not being tagged, so they don't show up in ils. I'll have to rectify this. Then it's probably that bug that sometimes happens where the huggingface libs simply print an error and then hang instead of reporting it to the caller. I'll investigate some more, but likely there is nothing I can do about it, I have to rely on python throwing an exception or returning form the call.
some python threads arewaiting for these:
python3 1934007 root 7u IPv4 10146651 0t0 TCP mradermacher.nico.re:40512->server-13-224-102-227.zrh50.r.cloudfront.net:https (ESTABLISHED)
python3 1934007 root 9u IPv4 10628179 0t0 TCP mradermacher.nico.re:40530->server-13-224-102-227.zrh50.r.cloudfront.net:https (ESTABLISHED)
python3 1934007 root 11u IPv4 10472890 0t0 TCP mradermacher.nico.re:40520->server-13-224-102-227.zrh50.r.cloudfront.net:https (ESTABLISHED)
And I suspect that the remote end does not know about these connections anymore. Unfortunately, my little timeout wrapper is loaded:
python3 1934007 root mem REG 0,40 1842788 /llmjob/share/hfu-preload.so (path dev=0,101)
So let's see why that one doesn't trigger.
Ah right, I didn't realise pyhton would use multiple threads, so my solution with alarm() is obviously broken. That, uhm, complicates things a "bit".
Anyway, imagine for some reason you wanted to just kill this upload and retry (which you wisely didn't so I can look at it), then you have options.
The code that waits in quantize is this:
iwait $PIDS || true
(quantize is one of those very few shell scripts I wrote that use set -e). But the || true means it is safe to kill without immediately causing havoc, and in this case, it is safe to kill ths iwait child of quantize, and then quantize will simply continue as if the upload had finished, which is what I did. That would preserve the upload processes for inspection.
Or, what I will do now that I think I understand why my wrapper didn't help (ok, I'll still have to attach gdb to see if I am likely right :) is I will kill the python subprocess that should be part of ils output but isn't yet, or the "llmjob hf-upload-folder" caller. That will cause the upload to fail, but the parent process (the hfu wrapper that started as a single line....) will retry.
PS: If you don't particularly enjoy reading through these thoughts, I will not be sad if you say so and I will be shorter next time. I suspect it does help me to document these things to somebody involved, though :)
Alternatively, maybe I could just enable tcp keepalive on connect(). That would be a much more sexy solution than calling poll() before every read... Hmm....
Update: better yet, let's do it at socket() time, then I don't even have to check that it is a tcp socket.
Update 2: Exciting. I've never configured keepalive parameters programmatically.
Update 3: even more interesting, keepalive enabling is a generic socket option, not a tcp-layer specific one. Only the actual parameters are tcp-specific. Are there any other protocols that even implement this?
let's see if this works better:
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 5
setsockopt(5, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(5, SOL_TCP, TCP_KEEPIDLE, [30], 4) = 0
setsockopt(5, SOL_TCP, TCP_KEEPINTVL, [5], 4) = 0
setsockopt(5, SOL_TCP, TCP_KEEPCNT, [20], 4) = 0
So, the above method works in the sense that it shuts down connections (whether due to keepalive or not, I can't tell), but python is still hanging, because the other side does not code the connection. It's actually quite interesting. Clearly, the other side is not interested in replying. Could be a very misconfigured firewall on the cloudflare side (cloudflare needs to die urgently) - the remote host is pingable and connectable on port 443, so I suspect it's typical cloudflare brokenness and shit all over the internet-ness.
I must admit I am not sure why keepalive doesn't completely kill the connection(s) here - either it's not enabled (but it seems to get enabled when I strace python3), or keepalive only shuts down the sending side, which doesn't make much sense either.
23:08:59.266571 eth0 Out IP 192.168.2.108.60772 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 25, options [nop,nop,TS val 2247150262 ecr 1588015898], length 0
23:08:59.267570 eth0 Out IP 192.168.2.108.52586 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 60, options [nop,nop,TS val 2247150263 ecr 1924325188], length 0
23:08:59.474572 eth0 Out IP 192.168.2.108.60772 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 25, options [nop,nop,TS val 2247150470 ecr 1588015898], length 0
23:08:59.475572 eth0 Out IP 192.168.2.108.52586 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 60, options [nop,nop,TS val 2247150471 ecr 1924325188], length 0
23:08:59.883573 eth0 Out IP 192.168.2.108.52586 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 60, options [nop,nop,TS val 2247150879 ecr 1924325188], length 0
23:08:59.891577 eth0 Out IP 192.168.2.108.60772 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 25, options [nop,nop,TS val 2247150887 ecr 1588015898], length 0
23:09:00.707580 eth0 Out IP 192.168.2.108.52586 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 60, options [nop,nop,TS val 2247151703 ecr 1924325188], length 0
23:09:00.771573 eth0 Out IP 192.168.2.108.60772 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 25, options [nop,nop,TS val 2247151767 ecr 1588015898], length 0
23:09:02.371575 eth0 Out IP 192.168.2.108.52586 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 60, options [nop,nop,TS val 2247153367 ecr 1924325188], length 0
23:09:02.435573 eth0 Out IP 192.168.2.108.60772 > 18.165.181.142.443: Flags [F.], seq 0, ack 1, win 25, options [nop,nop,TS val 2247153431 ecr 1588015898], length 0
I think tcp keepalive wasn't enabled - python creates some sockets with a proprietary linux extension: socket(..., SOCK_STREAM | SOCK_CLOEXEC, ...) or other flags, and what's worse, there is no portable way to detect this, as the mask for the actual type is not exposed to userspace (being non-posix). Grm.
This turns out way more complicated than initially expected.
As a sidenote, it seems uploads have to wait after every "chunk" for an ack from the other side (probably what aws requires, never looked into that). Very interesting. At least it is a learning opportunity.
trying to pause rich1
rich1 ~# ./rich1-pause
./rich1-pause: connect: Connection timed out
./rich1-pause: line 7: /dev/tcp/10.28.1.1/16713: Connection timed out
[Exit 1]
interesting. Well it seems that https://hf.tst.eu/status.html showing it as unreachable. Can I reboot server? I need to do few things before it can run again
@richarderkhov A working network is required for the whole thing to work, yeah (neither wireguard nor ssh work) (I wonder why it keeps failing). In emergencies you can reboot any time, of course, I just have to clean up, so make it count please :)
@nicoboss you can copy the r1-zero gguf to /tmp/quant and, if you want, remove the overide file and push.
Oh, it's already being copied :)
Quick status update regarding rich1. Richard decided to install Mail-in-a-Box on the host but missed that "Technically, Mail-in-a-Box turns a fresh cloud computer into a working mail server" was meant literally and instead of installing a mail server on top of the existing OS it replaces it to be a mail server in a process that cannot be undone. We spent hours trying to save the host but it is beyond saving so we will have to reformat it. This time with Debian 12 with Proxmox. rich1 should be available again tomorrow once the host is properly setup again.
The rich1 LXC container is on a separate disk and so beside the extended downtime should be unaffected. It seems all quants queued to rich1 were completed and uploaded before he stopped it as the tmp folder seams empty. We made a remote backup of the rich1 container just in case.
Oh, it's already being copied :)
I started the copy as first thing in the morning but it took 4 hours to copy it and so only finished early afternoon. NFS was 600 Mbit/s and this despite source and destination disks being an SSD on the same server. The only reason I had to use NFS is because the SSD was assigned to a VM.
I already regret that I built back the RPC setup as llama.cpp support for more massive awesome models will likely come soon: https://github.com/ggerganov/llama.cpp/issues/11290
wow, lots of, eh, mixed news :)
an empty /tmp folder on rich1 would be surprising, but we'll see what's going on when its back up. shit happens :)
wow, never heard of minimax. but let's face it, if 4xxB models become commonplace, it might be prudent to use Q8_0 for imatrix. I don't have an issue with that.
Regarding rich1 we successfully installed Proxmox on it today. I unfortunately caused an IP conflict while setting up OpenWrt minutes after he went to bed that so I currently have to wait for him to use iKVM to fix this. I’m confident we can get rich1 working again tomorrow.
Regarding the reason why nico1 is currently offline: My ISP decided to do maintenance today from 01:00 to 06:00 and on 3rd of February from 05:00 to 06:00. I wasn't aware of it spent quite a while diagnosing the issue because they have not put that on their website but I then figured it out on the website of their upstream ISP. They usually inform me weeks in advance but could be that I missed that.
nico1 is currently reasoning finetuning DeepSeek-R1-Distill-Llama-70B-Uncensored. This is scheduled to take almost a day but I will probably interrupt it at 0.5 epochs to not block imatrix quants for too long. I wanted to test auto_resume_from_checkpoints for a first time anyways. It also happened to be such good timing with the internet outage.
wow, never heard of minimax. but let's face it, if 4xxB models become commonplace, it might be prudent to use Q8_0 for imatrix. I don't have an issue with that.
minimax is a completely new base model and so probably warrants the effort of doing it in 16-bit even if it realistically will barely make a difference. The minimax model is extremely good getting close to the much larger DeepSeek-v3. Likely because while smaller in the sense of total parameters it has more active parameters.
It suddenly felt so lonely... :)
That's a long maintenance internal, but it happens.
minimax is a completely new base model
So... you do kind of agree :) I don't expect minimax to suddenly become popular for fine-tunes, though, and I don't expect many finetunes of llama-405b either.
nico1 is currently reasoning finetuning DeepSeek-R1-Distill-Llama-70B-Uncensored.
btw., you could, if you wanted, let it quantize (if it doesn't do that already, most likely it will work on r1-zero) - if it stops, you could edit /llmjob/share/bin/llmjob, find this line:
} elsif ($cmd eq "slave-scheduler") {
and replace the rich1 a few lines below that by nico1:
if ($HOSTNAME eq "rich1") {
Then "llmjob slave-scheduler" will run the scheduler locally, which is currently disabled everywhere except on rich1.
I tell you not so much because I really want you to do that, but more to trickle knowledge about the internal workings to you. llmjob slave-scheduler is invoked at the end of every job, and because of some bug I am hunting it does only try to locally schedule jobs on rich1, not anywhere else. And oh my, it still uses bash to actually send a push to the scheduler afterwards, why did I look at that code.
The file will be overwritten automatically the next time kaos contacts rich1 (it replaces itself, so that only works if its actually compiling, though).
In other news, I have a good lead on the weird job scheduling failures I have seen in the last month.
rich1 is alive again! I recommend to check if everything with it is fine and no work got lost. I forwarded TCP port 2222 for SSH and UDP port 7103 for WireGuard. rich1 now uses a similar Proxmox with OpenWrt router setup as nico1.
Since rich1 is online I see a lot of error/12
errors:
ram budget 490 use 0
0 ? Reasoning-Llama-3.1-CoT-RE1 error/255 (from rain)
0 ? Llama-3-Yollisa-SCE error/12 (from rich1)
0 ? SauerHuatuoSkywork-o1-Llama-3.1-8B error/12 (from rich1)
0 ? Janus-1.3B-LM error/12 (from rich1)
0 ? SJT-2.1B error/12 (from rich1)
0 ? Qwen2.5-7B-Instruct-1M-abliterated error/12 (from rich1)
0 ? Taurus-Opus-7B error/12 (from rich1)
0 ? DeepSeek-R1-Distill-Qwen-7B-RRP-Ex error/12 (from rich1)
0 ? SJT-990M error/12 (from rich1)
0 ? Qwen2.5-7B-RRP-ID error/12 (from rich1)
rich1 also has quite some likely model related errors:
rich1 nice size (static/imatrix) -- free 1219 budget 1057 uploads 0 hfd 1
0 17 si Llama-3-Yollisa-SCE-TopK_0.45 error/2 converting...
0 2 si ChainBlind-HadithIsnadParser-AraT5 error/1 missing spiece.model
0 2 si ChainAware-HadithIsnadParser-AraT5 error/1 missing spiece.model
0 2 si ChainBlind-HadithIsnadParser-withPrefix-AraT5 error/1 missing spiece.model
0 16 s Zurich-7b-GCv2-5m error/2 converting...
The reason for these errors is that most files from /tmp are gone. Also, I can't login to rich1 normally (connection refused) - did the ip address or port change?
Something semi-catastrophic must have happened on rich1.
I wasn't there when it came back, so I am not 100% what the state was, but it is a bit fishy that all big jobs are missing. I wonder if the job queue was deleted as well. That means an unknown number of jobs have been lost, da lot of 70Bs as well.
-rw-rw-rw- 1 root root 5.2k Jan 29 02:09 backup_rich1_meta.cbor
-rw------- 1 root root 141k Dec 28 00:14 backup_rich1_meta.cbor.x
almost certainly. the .x file is a copy i made while rich1 was down. sigh, now I have to somehow extract the jobs from there.
Anyway, until the network is fixed, nothing much can be done about all this. I suspect the port forwardings are missing.
Indeed, midnight morning on the 28th files in /etc/ have been deleted causing /tmp to be deleted on the next boot.
Any idea what else in the vm might have been changed? I'd rather start from debian than work with a partially corrupted vm with such surprises.
There are other changed files in etc, most benign (network/interfaces, hosts). But what would delete /etc/tmpfiles.d/tmp.conf, and why?
Just in case it ever comes up, I chattr +i's tmp.conf, because if that file is removed, thats quite disastrous, because I don't normally have a backup.
If I can trust the mtime, then the only obvious change outside of /usr is the tmpfiles.d/tmp.conf deletion.
resolv.conf also changed weirdly:
search example.com
nameserver 1.1.1.1
Is this the intended resolv.conf?
Also, it seems I have 500GB less space - du says 694GB, df says 1192GB is use.
@nicoboss to summarize,. so you don't have to read through my debug stuff
- ssh (2222 => 22) and (more importantly) wireguard (7103 => 7103) forwardings are missing. the latter is required for nico1 to get a reliable connction to rich1
- on the 28th 00:02 (likely), /etc/tmpfiles.d/tmp.conf was removed, causing /tmp to be deleted, which causes a loss of all models and jobs. i was able to restore most jobs with some work from a backup, but I don't always have a backup. it is important to find out what happened so it can be prevented in the future.
- about 500gb of diskspace seems to be missing, causing jobs to fail
Update: I tried a quick hack to regain connectivity, but somehow failed, so I think I now need ssh to be able to fix it.
ssh (2222 => 22) and (more importantly) wireguard (7103 => 7103) forwardings are missing. the latter is required for nico1 to get a reliable connction to rich1
This is fixed now. Sorry for the networking issues. ifupdown wasn't installed as it wasn't required with the old networking setup so /etc/network/interfaces set by the Proxmox host got ignore. It instead used systemd-networkd which resulted on it getting a random IP over DHCP breaking the port forwarding rules pointing to 192.168.1.101. I now installed ifupdown and enabled the networking service so this should't happen again.
on the 28th 00:02 (likely), /etc/tmpfiles.d/tmp.conf was removed, causing /tmp to be deleted, which causes a loss of all models and jobs. i was able to restore most jobs with some work from a backup, but I don't always have a backup. it is important to find out what happened so it can be prevented in the future.
No idea who or what deleted this config. /tmp was empty after Richard stopped the container on 26th of January. I don't think it will happen again as we are now using Proxmox to manage the container instead of LXC. Very unfortunately that we lost the entirety of /tmp.
resolv.conf also changed weirdly
That makes sense as Proxmox is injecting its own network configuration into LXC containers so nothing to worry about.
If I can trust the mtime, then the only obvious change outside of /usr is the tmpfiles.d/tmp.conf deletion.
You should be able to trust it as the container is still pointing to the same rootfs folder on the same disk. We didn't copy or move to container at all.
I'd rather start from debian than work with a partially corrupted vm with such surprises.
If you want to start fresh just let me know and I can easily give you a new container. It takes 1 minute for me to create a new one and doing so would be cleaner.
Also, it seems I have 500GB less space - du says 694GB, df says 1192GB is use.
This is because the same disk contains a backup of Richard's website and rich1 just in case. The rich1 backup was unfortunately made when /tmp was already gone. I will delete the backups as soon Richard confirms I can do so.
It all looks fine from my side, thanks for your work. The chattr +i should prevent accidental deletion in the future, but it is very weird. I could chalk it up to my script messing up and fogetting about it, but then it would have happened on previous reboots, and the directory had an mtime of when it was down. Very strange.
Also, it seems I have 500GB less space - du says 694GB, df says 1192GB is use.
I deleted the backups half an hour ago so all the storage should now be available for you to use again.
It all looks fine from my side, thanks for your work.
Thanks. Great to finally see rich1 working again.
I could chalk it up to my script messing up and forgetting about it, but then it would have happened on previous reboots
I don't think we ever rebooted rich1 after we had to reinstall it after the LXC corruption incident.
I don't think we ever rebooted rich1 after we had to reinstall it after the LXC corruption incident.
No, but rich1 and the vm rebooted multiple times before, and once after, and the only time that file was created was when I initially ran my script to configure wireguard and other stuff (i.e. twice only). I can only imagine some script went around and either deleted all 0-size files or any file starting with tmp.* - just very weird. But who knows, maybe whatever script that was run to essentially destroy rich1 also ran a find over the whole disk.
The only evidence is that something mucked with that directory on jan 28th, so it's unlikely to have been something that happened before. I was lucky that I made a copy of the queue just in case when it went down, otherwise restoring the jobs would be... difficult.
Thanks. Great to finally see rich1 working again.
Yeah, I was getting a bit desperate - nico1 much less than 50% usable for weeks, rich1 gone, and an unprecedented number of models, and big ones, too (I mean 70B..130B, not deepseek) made for very tense moments. All in all, it's making good progress despite all, and we even did make a tiny bit of progress on the nice 1000+ models.
Why does it say:
0 66 si Virtuoso-Medium-v2 error/255 repo create
The repository clearly exists under https://huggingface.co/mradermacher/Virtuoso-Medium-v2-GGUF - it is supposed to do static quants to that repo as the status shows si
.
Edit: Now that the imatirx is done it shows sI
as status but is still stuck at error/255 repo create
. Luckily it just skips this task and works on other tasks in the meantime.
Edit: Ah nice it either fixed itself or you manually fixed it. In any case the model is now getting quantized.
This night and also this morning hf had enourmous timeout problems. Everything was affected, including web page loading. It's not fully fixed yet, but much better. I need to manually retry when it fails at this step.
Ah, and yes, if "s" is in the flags, it will never try imatrix quanting first.
Oh, and btw., hetzner sometimes has good offers, which might or might not be something to consider for richard, if he actually pays €250/month. Can't see an obvious candidate, but didn't look long, and the offers change considerably over time, e.g.
https://www.hetzner.com/sb/#price_from=180&price_to=250&cpuType=AMD&search=threadripper
All of these are a bit faster than his box, and cheaper, afaics.
@mradermacher
For a few days I can no longer pause GPUs using echo pause GPU-188a5143-db69-7058-63b5-f2f1d2354f91 >/dev/tcp/10.28.1.1/16713
or use the nico1-pause
script. There is no error but nothing on the scheduler happens when I do so. I just ran the nico1-pause
script and it correctly interrupted all the tasks, but the status page doesn't show that nico1 is paused and tasks still run. The same for GPU pausing. They don't appear on the status page and just get ignored but there I was able to at least still use the /tmp/pause
flag but for nico1-pause
there are no alternatives for me to use.
I'll look into it - I replaced the other side of that echo with something else so we can use it to queue and manage jobs, so obviously, it must be subtly broken.
both should work again - I also paused nico1 and removed the override files (which I hope was your intent).
boy am I relieved that we are finally, finally, through all the big model backlog. in fact, the queue is free from all the original nice 1000 models.
Nice. Queue every other model on huggingface so we know how much is left lol
@RichardErkhov about 40000 of them or so. you can see for yourself soon, for real.
and totally off-topic, i just looked at the safentensors format, and i quite like it (other than being somewhat sloppily defined). but there seems to be no concern for alignment of any kind - according to the spec, tensors cannot be aligned to their data size.
In case you wonder what caused the 9-hour internet outage this morning: Threadripper is the node hosting the OpenWrt router. The PSU of Threadripper caught fire while I was asleep and it burned until it shorted the high voltage part which triggered the breakers. This morning the entire house was filled with the smoke of burned electronics and plastic and it still smells absolutely terrible. That wasn't a cheap PSU at all. It was a be quiet! Dark Power Pro which back then costed around $500 and so for sure a high-end PSU which lasted for 8 years 24/7 use before it catastrophically failed.
Regarding the damages: The PSU which started the fire is completely dead as multiple components inside melted together. The RAM and RTX 2070 GPU might be dead as Threadripper no longer boots if any of them are plugged in but further evaluation is required. The mainboard might be partially damaged as it only works when the PC is laying on the floor and constantly gets stuck either detecting RAM or during hardware initialization and only works in safe mode. When I checked journalctl during the PSU fire event I saw multiple HDDs reporting that they overheated despite being idle so I will have to check if they are fine as well.
Threadripper is now online for 10 hours again and looks somewhat stable. I temporary gave it the PSU from StormPeak and some older RAM 2x16 GB of RAM and an old GTX 770 GPU and have it lying flat on the floor which somehow got it to boot but no idea how long that will last. Tomorrow evening, I will start investigating what hardware is still alive. I will likely do so by putting them in a different PC so disruptions to nico1 should be minimal. I also enabled replication of the OpenWrt router VM to all the other nodes so should it break completely I could use CastlePeak as new router.
uhm, oh, wow. I assumed you wanted to switch gpus or so (because nico1 was paused), but the duration indeed made me wonder.
be quiet are not cheap, but very low quality, IMnsHO. Shouldn't catch fire, though. But it is just china crap. That's mostly my unadultered opinion - I lost all my four be quiet psus I had in my life with (small) fireworks, but no fire. It all made sense when I saw reviews opening them up and finding boards marked 150W in 300W psus. Not that this is any use to you... it can happen with any psu.
However, I can understand your situation - I once lost my apartment because an sgi indy caught fire due t a faulty PSU - a few months before sgi published an advisory on that issue. Despite being off! With the knocker switch! Destroyed my whole library (hundreds of books). mm-thick soot on all walls. And of course my beloved HP workstation, my monitor, and PC. It was quite traumatic. Not as traumatic as thieves then stealing most of what survived form the open apartment, and the police refusing to watch the security cameras for who did it. That was absolutely traumatic. Had to move out for months. Fortunately, my cat wasn't there at the time... Oh, and my data survived. But I wasn't back online as quick as you were....
All considered, while this sucks, you should consider yourself super lucky - much more worse have happened.
As for HDDs, I have more than a dozen of smart failing harddisks that had considerable overtemperature at one time. The ones that fail, however, never announce it before it happens. But maybe they have been cooked at >>100°... they will not like that. I'd be more concerned about any soot, for non-helium drives. That can cause problems to develop in weeks or months, from my own experience. But there is little you can do other than replace or back up.
Also, your priorities... nico1 was back up in record time, and I trust you have your priorities set correctly. Still quanting isn't that important as, you know, your home...

@RichardErkhov I finally have a list for you.
One reason it took so long is that I lost confidence is the meaning of the list, and the selection criteria - it sounded so good when I proposed it, now I am not so sure it is of much use. And it was rather more work to extract than I thought.
But it is on rich1:/202402-202501.txt
It's a list of urls, sorted by, uhm, importance (more "important" creators are listed first).
It is essentially the list of models I have been shown, minus the ones I tried to quantize. Many of them do not exist anymore, some have been renamed and/or recreatred, in which case both urls are listed.
It is highly filtered, according to way too many criteria to list, but the important ones are:
- models should have a chance of being complete(e.g. they should have a config.json file and some tensor files)
- they are not by "obvious" datahoarder uploaders, or known assholes
- they are (on paper) supported by convert_hf_to_gguf.py
- they do not look like obvious quantized models (but still many are)
The list contains the 11 months from feb 2024 till end of 2024 and contains 22235 urls.
This should give you a list where a high concentration of models are not obvious crap, but that were skipped by me, mostly based on not havingh a nice enough name.
Feedback appreciated.
Oh wow, thank you @mradermacher ! I will check it later, I have school right now. Scary and sorry to hear abour your library
@RichardErkhov my library burned 20 years ago. nicos house almost burned yesterday!
I assumed you wanted to switch gpus or so (because nico1 was paused)
I only paused the GPUs because I ran reasoning finetunes overnight and did not pause the nico1 host. Shortly after losing Threadripper the quant tasks unfortunately stopped due to a lack of internet despite there being enough storage. I assume because local scheduling is currently disabled. But it might be better that as even with this small backlog once internet was restored it started using so many connections maxing out the 10 Gbit/s connection that for 10 minutes during which the internet was unusable slow for anyone else.
be quiet are not cheap, but very low quality
It all made sense when I saw reviews opening them up and finding boards marked 150W in 300W psus.
What shady business practices. Do you have any recommendations for a good PSU that is at least 1200 Watt? I now obviously need a new one.
All considered, while this sucks, you should consider yourself super lucky - much more worse have happened.
I for sure was extremely lucky. The smoke was quite dangerous, and the entire house could have burned down.
Despite being off! With the knocker switch!
How is that even possible? I thought the switch physically disconnects the power.
Also, your priorities... nico1 was back up in record time, and I trust you have your priorities set correctly. Still quanting isn't that important as, you know, your home...
Getting Threadripper working again was not just a priority because of nico1. It hosts my router and having no internet is quite a big issue considering that I'm working from home most days. But I agree it probably wasn't the smartest idea to carry it out of a smoke-filled basement and spending the entire day working on it outdoors. But realistically there is not much else I could have done in the meantime anyways while waiting for the smoke to get blown out the basement windows.
started using so many connections maxing out the 10 Gbit/s connection
You could easily limit the download and upload speed to 5 GBit/s or so. I don't think that would be a limit in 99.9% of the cases or so.
I only paused the GPUs because I ran reasoning finetunes overnight and did not pause the nico1 host.
Ah, right. It just happened to have no jobs running. And yes, just pausing job starting is not the same as pausing the whole host, I noticed that. It was quite unnormal for you.
Do you have any recommendations for a good PSU that is at least 1200 Watt? I now obviously need a new one.
Not really. I have made good experiences with zippy emacs - but can't say anything about recent builds, because my almost 20 year old ones still work. They are industrial though, so not generally great efficiency and certainly not quiet. And very, very big, causing issues for many cases.
The ones I use (for efficiency) are these: https://seasonic.com/prime-tx/
Can't really say they have good (or bad) quality, but they have some good attention to detail (such as printing codes on their cables and telling you which cables can be interchanged between power supplies). And they can be relatively quiet.
If you have too much time on your hand, you can always try to consult tom's hardware guide. At least in the past they made relatively good tests, checking protection circuits, opening the case and seeing whats actually inside in many cases. But it varies between reviews, and I haven't check them for the last few years. Example: https://www.tomshardware.com/reviews/seasonic-prime-850w-titanium-psu,4761.html And you need to check the individual reviews, not their guide, I think.
Also: https://www.tomshardware.com/news/how-we-test
How is that even possible? I thought the switch physically disconnects the power.
I think (if I remember correctly, it's been decades), it was some kind of isolation issue. These things are also not PCs (https://en.wikipedia.org/wiki/SGI_Indy), so had custom power supplies who were wired god knows how. And since it only had a big hole where the power supply formerly was, there was no chance to find out.
while waiting for the smoke to get blown out the basement windows.
Quite the existential shock I imagine.
there is now a command llmc
in the path (or directly as /llmjob/share/bin/llmc
). It has a help subcommand (llmc help
) that lists what is possible. It doesn't
allow anything exciting, it's basically a frontend to the port 16713 hacks. But as soon as I can, I will allow submitting/removing/etc. jobs. It should work on both nico2 and rich1.
imatrix tasks no longer seam to use GPUs with error "ggml_cuda_init: failed to initialize CUDA: OS call failed or operation not supported on this OS" despite nvidia-smi still showing them.
Yes, I was experimenting with putting various steps into safer containers, since I realised that convert-hf-to-gguf.py will happily run code that comes with the repo. somehow the "you need to enable code execution" message I sometimes got made me think it wouldn't do so by default. The environment was a bit restricted before, but not in any way secure, which hopefully has improved now.
That didn't work so well with cuda, and in the end, I decided not to put llama-imatrix into a container, trusting llama.cpp developers to... ah well, one step at a time.
As a side effect, llmc now has a shell command that allows you to look at models on any machine:
llmc shell rain
Hopefully I didn't misconfigure anything - feel free to probe its security boundaries and notify me when I forgot something important. It should allow you to basically use normal shell commands, vi and so on.
I also exposed another interactive command, "llmc audit", which asks what to do with failed jobs. You can have a look by running "llmc audit" when there are red llmjobs (not imatrix). pressing enter at every prompt is the safe/no-op choice. Only caveat is that while llmc audit runs and waits for your input, the scheduler hangs globally, so don't let it idle and forget about it :)
I decided that instead of doing the sane "use ssh" (and deal with making that secure, which is not easy), I wanted to see how easy it would be to roll my own telnet-like interface. Total overdesign, and some programs misbehave as I simply switch into raw tty mode blindly, but it was fun and is quite usable, for a modest amount of code (<100 loc). Always wanted to see how much work that actually is.
Most jobs with nicelevel 11..15, btw., will fail, but I know how to fix most of them - those are re-queued older models with a specific tokenizer problem.
Adding models should now be possible, in theory. Try it out :) llmc help shows this:
llmc add [force] <nice> <type> <url> <token>...
queue a job with the given numerical nice level, type (s, si or i)
and hf-url. jobs are only queued, and only get submitted at the next push.
as long as a job is merely queued, ir can be modified by re-running add.
refuses if the model was submitted before, in which case "force" can be
prepended.
nice levels are typically: -2000 for user-requested models, 0 for
daily/ne wmodels and 1000 for low-priority background models.
extra tokens can be:
worker <host>,<host>,...
limit the model to run on the given comma-separated list of hosts
examples:
llmc add -2000 si https://huggingface.co/TheDrummer/Anubis-Pro-105B-v1 worker nico1,rich1
Hope that this is clear enough. If not, just ask. Feel free to queue whatever you like (well, start slow). If its a user request, just drop a note in the discussion and ferel free to close it (and feel free to log in as mradermacher).
You need to llmc push
or wait for the scheduler to pick up new models after adding.
Another new command: push-model - forces a push regardless of host limits:
llmc push-model nico1 chinese-alpaca-33b-merged
llmc push-model nico1,rich1 Anubis-Pro-105B-v1 Zurich-7B-GCv2-50k ...
Also, it seems my filtering excluded most models from even being shown (I filtered on library = transformers), which affects both daily models as well as historical ones. 86409 new models found, one for every second fo the day, and then some.
Sorry for the late response. This was such a crazy week. My PSU going up in flames was just the start. You for sure noticed how on Monday evening nico1 unexpectedly turned off and then was down for the entire evening until Tuesday 03:00am. On Monday evening the water pump case of my outdoor water-cooling loop exploded draining out a lot of cooling fluid and making the pump run dry and so no longer able to pump. This caused steam to build in the cooling system which then destroyed the CPU water block of StormPeak causing some cooling fluid leak over my RAM and mainboard until StormPeak overheated and turned itself off.
Getting StormPeak working again after this incident was really difficult which is why it unfortunately took me around 8 hours to get nico1 running again. The main difficulty was that without a pump I have no water-cooling for my CPU and my GPUs. It also is not possible to air cool the RTX 4090 GPUs. To get CPU cooling working again I was able to the AiO cooler from CastlePeak as that node was inoperative anyways due to donating its PSU to Threadripper. For the RTX 4090 GPUs I had to inevitable get the water pump tight and working again. I used a lot of waterproof tape and plumbing hemp fibers to get it somewhat tight again. During the next few days, I constantly had to fill in cooling fluid as it was still leaking quite a lot but at least I was able to keep nico1 running. I also reduced the clock speed to 3.5 GHz as that AiO cooler can otherwise not handle the load. On Wednesday, I finally managed to obtain watertight superglue to glue together all the Acryl shards of my pump making it hopefully watertight until the new pump arrives.
I already ordered a new PSU, Pump and CPU Waterblock. I will likely install them on Monday evening. In case you wonder I haven’t touched Threadripper since it finally booted last weekend and it is still laying on the floor somehow working.
The ones I use (for efficiency) are these: https://seasonic.com/prime-tx/
Thanks a lot for your recommendation. I ordered a Seagate PRIME PX-2200 2200W ATX 3.1 as it looks like a very high-quality PSU to me and could handle my CastlePeak including 3 high-end GPUs.
there is now a command llmc in the path (or directly as /llmjob/share/bin/llmc). It has a help subcommand (llmc help) that lists what is possible. It doesn't
allow anything exciting, it's basically a frontend to the port 16713 hacks. But as soon as I can, I will allow submitting/removing/etc. jobs. It should work on both nico2 and rich1.
Thanks a lot! This is awesome.
I was experimenting with putting various steps into safer containers
I realised that convert-hf-to-gguf.py will happily run code that comes with the repo. somehow the "you need to enable code execution" message I sometimes got made me think it wouldn't do so by default. The environment was a bit restricted before, but not in any way secure, which hopefully has improved now.
Great to hear. More security is always nice.
That didn't work so well with cuda, and in the end, I decided not to put llama-imatrix into a container, trusting llama.cpp developers to... ah well, one step at a time.
Should be fine. As soon we are dealing with GGUFs there should not really be any meaningful security risk. One would need to craft some crazy ROP chain buffer overflow exploit bypassing all the modern security features like stack cookies and do so in a way where it survives the hf_to_GGUF conversion. That is just not going to happen and even if nico1 is an unprivileged highly monitored LXC container.
As a side effect, llmc now has a shell command that allows you to look at models on any machine: llmc shell rain
Amazing. Thanks a lot for that. I will soon give it a try.
Hopefully I didn't misconfigure anything - feel free to probe its security boundaries and notify me when I forgot something important. It should allow you to basically use normal shell commands, vi and so on.
I will make sure about that if I ever have time for some fun searching for sandbox escapes. I did similar things in the past and already escaped misconfigured docker containers.
I also exposed another interactive command, "llmc audit", which asks what to do with failed jobs. You can have a look by running "llmc audit" when there are red llmjobs (not imatrix). pressing enter at every prompt is the safe/no-op choice. Only caveat is that while llmc audit runs and waits for your input, the scheduler hangs globally, so don't let it idle and forget about it :)
Great to know. That will be really useful. I will for sure give it a try soon.
I decided that instead of doing the sane "use ssh" (and deal with making that secure, which is not easy), I wanted to see how easy it would be to roll my own telnet-like interface. Total overdesign, and some programs misbehave as I simply switch into raw tty mode blindly, but it was fun and is quite usable, for a modest amount of code (<100 loc). Always wanted to see how much work that actually is.
Nice choose. Making SSH secure is difficult unless you basically just let it call a specific script without arguments depending on the keys used so I like your creative alternative.
Most jobs with nicelevel 11..15, btw., will fail, but I know how to fix most of them - those are re-queued older models with a specific tokenizer problem.
Good to know
Adding models should now be possible, in theory. Try it out :)
I will try tomorrow and let you know if it worked.
Hope that this is clear enough. If not, just ask. Feel free to queue whatever you like (well, start slow). If its a user request, just drop a note in the discussion and ferel free to close it (and feel free to log in as mradermacher).
You need to llmc push or wait for the scheduler to pick up new models after adding.
Seams all clear to me.
Another new command: push-model - forces a push regardless of host limits:
Ah nice so that’s how I can push them to a specific node. That I might use for my own models as there I can already put the source GGUF on nico1 so pushing the model to it would make sense.
Also, it seems my filtering excluded most models from even being shown (I filtered on library = transformers), which affects both daily models as well as historical ones. 86409 new models found, one for every second fo the day, and then some.
Oh wow that’s insane.
Sorry for the late response. This was such a crazy week.
No problem, I was quite expecting there to be an aftermath. Don't feel compelled to report when you are busy. I noticed the downtimes, but knew there was a good reason for it, and didn't want to impose myself on you with more stress.
On Monday evening the water pump case of my outdoor water-cooling loop exploded
exploded!
This caused steam to build in the cooling system which then destroyed the CPU water block of StormPeak causing some cooling fluid leak over my RAM and mainboard until StormPeak overheated and turned itself off.
That's indeed a crazy chain of events. I onbce had a non-metal cpu cooling black, and it cracked, and leaked (for months, apparently) on my gfx card. Never again will I use acrylic blocks. (my gfx card became unstable, but after cleaning, was fine again).
plumbing hemp fibers
sounds professional!
is still laying on the floor somehow working
Your dedication will, unfortunately, go mostly unnoticed in the world, as the mradermacher team seemes to pump out quants almost without interruption, owing to your efforts...
As soon we are dealing with GGUFs there should not really be any meaningful security risk.
Well, "meaningful" is a nice weasel word (https://www.wired.com/story/malware-dna-hack/). But yeah, it's less easy than just making it run python code.
One would need to craft some crazy ROP chain buffer overflow exploit
One would hope so. In many cases, there are much simpler exploits though, especially in code that was never written with ssafety in mind (as I suspect llama.cpp is).
I did similar things in the past and already escaped misconfigured docker containers.
Yes, please. I't's awfully hard to get this right.
Making SSH secure is difficult unless you basically just let it call a specific script without arguments
Exactly, and even then I don't trust trhe whole machinery. Of course, llmc shell internall does ssh just as well...
Ah nice so that’s how I can push them to a specific node. That I might use for my own models as there I can already put the source GGUF
Indeed - the safe way is to add with worker nico1, then push to nico1. And if the gguf has the right name, it will pick it up. In theory, there are a lot of other (non-exposed) switches such as "nohfd" and "hf_gui" (a variant of hf_token :*). If you ever miss something, drop me a note, maybe it's already there in some form or another.
86409 new models found
96k as it turns out after fixing things. That's... sounds overhwelming at first, but it turns out most of them are junk models - I already went through 20k of them, and it only resulted in queueing 200 more models. That's very good, because my first walk through all of the models was only 35k models, and it took me a month. Not sure I would have the energy to go through 96k more...
2025-02-07T04:46:03 10.28.1.6 add <-2025 si https://huggingface.co/cognitivecomputations/Dolphin3.0-R1-Mistral-24B>
2025-02-07T04:46:07 10.28.1.7 push <rich1 1 zefiro-7b-dpo-qlora-ITA-v0.7 noquant 0>
2025-02-07T04:46:32 10.28.1.6 add <-2020 si https://huggingface.co/cognitivecomputations/Dolphin3.0-Mistral-24B>
Nifty, using nice levels as marker, just like me. (I was just staring at the status and wondered, -2025, what did I do wrong this time :).
I've seen plenty of duplicated safetensor sets, but this is a new one (5 of 4):
-rw-rw-rw- 1 root root 4976698672 Feb 7 06:46 model-00001-of-00004.safetensors
-rw-rw-rw- 1 root root 4999802720 Feb 7 06:46 model-00002-of-00004.safetensors
-rw-rw-rw- 1 root root 4915916176 Feb 7 06:46 model-00003-of-00004.safetensors
-rw-rw-rw- 1 root root 1168138808 Feb 7 06:41 model-00004-of-00004.safetensors
-rw-rw-rw- 1 root root 1168138808 Feb 7 06:41 model-00005-of-00004.safetensors
I wonder if it is possible to replace the mradermacher user by an organisation of the same name without too much disruption?
probably ask huggingface stuff
that's actually a good idea! (except that in the past, they never ever reacted to anything I asked. still worth a try).
anyway, the idea would be to e.g. allow nico, who does take care of a lot of model requests, to close or otherwise manage this better, and also to represent the actual structure of the "mradermacher team" better (for all of us).
do you know if an organisation actually would allow us to do all that? thoughts?
I think in org everyone with right permission can add/edit models (if I am thinking right, it's just an assumption), so I guess it would be better. So I guess it would be a nice idea. If you could draft a simple message I could send, I will try to reach out to them myself
you can try creating an org just to test how it feels, but I think that would be a good idea to switch. Maximum that happens is probably access keys might be reset, but I am assuming it's 5-10 minute job to replace them
Thanks, I will take you up on that offer once I hear some input from nicoboss. Also, I will give some thoughts on how to best do this. I assume we won't be able to schedule this in a meaningful way. Hopefully not all hell will break lose.
The alternative would me to try doing it ourselves - in theory, it is a simple matter of renaming orgs and automating the movement of repos. Might be less work, more of a hassle, but would allow us to do it in a controlled way, without stopping the whole quantisation machinery for days maybe.
In good news, I finally, finally, am pretty confident I nailed down the "jobs get skipped sometimes" bug that haunts me for months. Unfortunately, I think that means we might have skipped some imatrix jobs unnoticed.
after seeing that CALM-405B was essentially imatrix'ing from disk, I mlocked it into memory (after checking /host/proc/meminfo), and this seems to have helped.
@nicoboss some llmc audit tips: if a model stops during noquant, because the architecture is not supported, tokenizer failed to load, bpe pretokenizer not supported or anything like that, and it can't be fixed (by e.g. editing the tokenizer config with the (s)hell or so), then "n"uke is the right approach - it will remove all jobs and files.
if it fails during static or imatrix creation, e.g. because of a hf network problem (repo create failure or sth. like that, i.e. not quantize failure), then "r"edo is the right thing to do - it will simply restart the job again.
if it fails because of another error, then likely the repo already exists, and "n"uking is not enough, because the repos must be deleted as well. i'll provide a separate command for that.
now that the "quantize job gets silently skipped" bug is found (i am pretty certain i found it), unfortunately there seems to be a new or another bug - sometimes jobs simply die without an exit status, as if killed. in that case, "r"edo will do the right thing as well. but i have an inkling of what the indirect cause is (or was, hopefully it's fixed).
I accidentally added llm-jp-3-13b-instruct3
as type si
instead of s
as I first overlooked that it is a primary Japanese model. How can I now get rid of it making imatrix quants? I figured out that to change the priority I can always just resubmit with the new priority but this approach doesn't seem to work with type. There also is no llmc remove
to get rid of it to then correctly add it.
I added some great models mainly in the priority range 1 to 9 yesterday and today. I used the following script on https://huggingface.co/models?pipeline_tag=text-generation&library=safetensors&sort=trending to discover some great models we missed and then manually reviewed each of the model cards, download count and config.json to decide which to add. Here the code in case you want to use it as well:
(async () => {
const baseXPath = '/html/body/div/main/div/div/section[2]/div[3]/div/article';
const texts = [];
for (let i = 1; i <= 30; i++) {
const xpath = `${baseXPath}[${i}]/a/div/header/h4/text()`;
const result = document.evaluate(xpath, document, null, XPathResult.STRING_TYPE, null);
if (result.stringValue) {
texts.push(result.stringValue.trim());
}
}
const checkUrlExists = async (url) => {
try {
const response = await fetch(url, { method: 'HEAD' });
return response.status !== 404;
} catch (error) {
return false;
}
};
const notFoundTexts = [];
for (const text of texts) {
const parts = text.split('/');
const extractedText = parts[parts.length - 1];
const url = `https://huggingface.co/mradermacher/${encodeURIComponent(extractedText)}-i1-GGUF`;
const exists = await checkUrlExists(url);
if (!exists) {
notFoundTexts.push(text);
}
}
console.log(notFoundTexts.join('\n'));
})();
I wonder if it is possible to replace the mradermacher user by an organisation of the same name without too much disruption?
Future storage limitation might hit organizations but not users. Not sure how much disruption switching to an organization would cause but it could be a lot. I don't think converting an existing user to an organization is technically possible as a lot of user activity like writing a message cannot be done by an organization as far I'm aware. I'm also not so sure if I trust the HuggingFace team to do such a migration without losing metadata such as model creation date, model modification date, git commits, likes, downloads, followers and discussions of models. If they are even willing to do this likely quite time-consuming manual process, I assume they have to create a new organization and clone over all the models which would remove such metadata. We also cannot do this by our own due to the same clone model restriction and more importantly the repo creation limit.
I clearly see the advantages of an organization especially when it comes to permission management and moderating discussions. I agree that the current situation of it being a regular user account is not idea.
I can access the mradermacher account in emergency situations thanks to your mail but I never tried it in case there are security checks that would be triggered if the same account gets accessed from different locations at the same time. Instead of an organization you could give me a VM or a VPN from using which I could access the account using the same IP or we just risk it and see what happens if we access it from multiple locations simultaneously. I'm sure I already accidentally accessed my account from Italy and Switzerland at the same time using remote desktop and nothing ever happened. But we need to be very careful not to violate their Terms of Service because the last thing we want is your account with all those models getting banned. So maybe this is not an option. But I also don't think I really need access beside emergency situations.
How can I now get rid of it making imatrix quants? I figured out that to change the priority I can always just resubmit with the new priority but this approach doesn't seem to work with type. There also is no llmc remove to get rid of it to then correctly add it.
You identified it: the "nuke" command that would remove a job is still missing. Ill add it once I find the time :()
I used the following script on
Yeah, I used to filter on library_name transformers, which is the reason I ended up with (now) 30k more models. I now don't filter by library_name anymore, so eventually should hit all the overlooked models.
I also started with a small script that got a bit larger. Mainly, it queues new models and then allows me to digets them in small portions, a killer feature :)
It looks like this for new models:
the red models are model names that already exist.
my main concern at the moment is that we don't keep up with new models - it's a combination of many effects, but mainly, it's because there are just so many potentially interesting big models coming out. If somebody could assassinate Tarek07, that would help (new 70b's every day :=)
Not sure how much disruption switching to an organization would cause but it could be a lot. I don't
Not encouraging.
Not sure how much disruption switching to an organization would cause but it could be a lot.
I thought we could just rename mradermacher to sth. else, and then rename the organisation. And repos can simply be "renamed" between users and orgs, which might or might not be rate-limited. But it's rather a big and uncertain process.
I can access the mradermacher account in emergency situations but never tried it in case there are security checks that would be triggered if the same account gets accessed from different locations at the same time.
Why would that be a thing - everybody accesses their accounts from multiple locations (home, phone, work...), and lots of residential homes get new ip addresses every day.
Do not worry about it, and if, we should be able to deal with it. The only drawbacks I see with account sharing is that we will use the same name, and you'd have to switch, which is a hassle. You can do most everything without logging in, fortunately.
But we need to be very careful not to violate their Terms of Service
Hmm... why would it be a ToS violation to access your account from different places? Let me put it differently: did you see anything in the ToS that would make this a possibly problem?
It sounds weird, and technically impossible to implement such a restriction (how would they even know your "location"? ip addresses don't work for that, and they have nothing else). Hard to implement, causing lots of issues with their customers, for no gain.
As for the organisation, you have currently discouraged me from pursuing this soon :-)
@nicoboss additions for llmc: add now supports hf_token (and hf_gui as a shortcut for guilherme).
it also has a new "nuke" command that tries to kill/remove any jobs and queue entries for the named model.
both untested, of course.
(In the past, I usually edited the queue file or job queue directly... thats why there was no command other than one that killed jobs)
In good news, I finally, finally, am pretty confident I nailed down the "jobs get skipped sometimes" bug that haunts me for months. Unfortunately, I think that means we might have skipped some imatrix jobs unnoticed.
Awesome to hear. I noticed some imatrix quants missing from relatively popular repos so I queued them yesterday.
after seeing that CALM-405B was essentially imatrix'ing from disk, I mlocked it into memory (after checking /host/proc/meminfo), and this seems to have helped.
Great it worked out. I wasn't really ready for a 405B model and still used around 50 GB RAM myself but cool that you were able to make it quant from RAM in Q8 despite this. I also liked how you were able to make it so small imatrix quants still get computed while 405B was beeing computed.
When we are at 405B models we really should do https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B as well soon.
some llmc audit tips: if a model stops during noquant, because the architecture is not supported, tokenizer failed to load, bpe pretokenizer not supported or anything like that, and it can't be fixed (by e.g. editing the tokenizer config with the (s)hell or so), then "n"uke is the right approach - it will remove all jobs and files.
Makes sense. I really wish we would at least store the type of error somewhere so when you try to queue the same model that failed again you get the reason why it failed before. I is frustrating having to force queue them just to see what the error was.
if it fails during static or imatrix creation, e.g. because of a hf network problem (repo create failure or sth. like that, i.e. not quantize failure), then "r"edo is the right thing to do - it will simply restart the job again.
Makes perfect sense and well do.
if it fails because of another error, then likely the repo already exists, and "n"uking is not enough, because the repos must be deleted as well. i'll provide a separate command for that.
OK I'm looking forward to that. Thanks for all the great audit hints. Your guidance is very appreciated.
unfortunately there seems to be a new or another bug - sometimes jobs simply die without an exit status, as if killed. in that case, "r"edo will do the right thing as well. but i have an inkling of what the indirect cause is (or was, hopefully it's fixed).
Maybe it got OOM killed? Does this happen on every host or only a specific one?
You identified it: the "nuke" command that would remove a job is still missing. Ill add it once I find the time :()
Thanks a lot. I'm really looking forward to that one as mistakes can happen.
It looks like this for new models
Wow you wrote like an entire application for this. Well makes sense given that you are likely spending houers every day searing for now high quality models. Your efford is highly appreciated!
my main concern at the moment is that we don't keep up with new models - it's a combination of many effects, but mainly, it's because there are just so many potentially interesting big models coming out. If somebody could assassinate Tarek07, that would help (new 70b's every day :=)
I see the issue as well. Too many new amazing models for us to keep up with our currently compute resources. I might be able to contribute more compute using Threadripper and CastlePeak once my datacenter recovered from the fire and the cooling loop leakage. If we do so we maybe should move cheaply to compute quants to the other nodes so I can increase compute without increasing my total internet data usage. I think we did so in the past back when nico1 just did the expensive quants due to my upload limitation. I technically do have unlimited internet without any fair use clause in the contract but I don't want to risk getting kicked out by my ISP due to to my insane internet bandwidth. @RichardErkhov now you see how much we rely on rich1.
I’m also really scared what we will do should https://github.com/ggerganov/llama.cpp/pull/11446 get merged. Requanting all the DeepSeek v3 and DeepSeek v2 based models is obviously unfeasible but we would at least have to redo the popular ones as requanting them offers a 2x inference performance increase.
Why would that be a thing - everybody accesses their accounts from multiple locations (home, phone, work...), and lots of residential homes get new ip addresses every day.
Hmm... why would it be a ToS violation to access your account from different places? Let me put it differently: did you see anything in the ToS that would make this a possibly problem?
It sounds weird, and technically impossible to implement such a restriction (how would they even know your "location"? ip addresses don't work for that, and they have nothing else). Hard to implement, causing lots of issues with their customers, for no gain.
I just made bad experience with other services that used this technique to detect account sharing and automatically banning for it. The way it usually works is that an account needs to be accessed by two different IP addresses not assigned to the same country at the exact same time. I never did account sharing but am constantly remoting into my PC from everywhere even my holiday. I agree that it is an absolutely terrible method to detect account sharing leading to so many false positives and unjust account bans. But you are right that there is no way HuggingFace would use that technique as there is no real incentive for them to go against account sharing of free accounts. I didn't even think account sharing is explicatively prohibited in their ToS.
Do not worry about it, and if, we should be able to deal with it. The only drawbacks I see with account sharing is that we will use the same name, and you'd have to switch, which is a hassle. You can do most everything without logging in, fortunately.
I obviously would use a dedicated VM just to access the mradermacher account as I do for every security critical account. Thanks to this I will not have to ever switch accounts. There is a reason all my PCs are running Proxmox. I'm using virtualization as additional layer of security. I have hundreds of VMs all dedicated for a specific purpose. That way if one of them gets compromised the I know what is affected and the damage is very limited. I usually even use full disk encryption so that the VMs are full encrypted when unused to even limit the damage of a host compromise. Important VMs and LXC containers I also replicate to all hosts and add to the backup server so I can continue to use them even if the server hosting them fails and have daily incremental backups.
additions for llmc: add now supports hf_token (and hf_gui as a shortcut for guilherme).
Thanks a lot that is awesome. I already encountered so many gated repositories I was unable to queue due to them being gated. They are such a pain to deal with.
it also has a new "nuke" command that tries to kill/remove any jobs and queue entries for the named model.
Thanks a lot for this! :D
both untested, of course.
Sounds promising but is likely how I would do it as well as nothing bad would happen if it doesn’t work. Well at least until it nukes everything or does some other crazy thing. I’m often overconfident in my code as well and sometimes regret it.
(In the past, I usually edited the queue file or job queue directly... thats why there was no command other than one that killed jobs)
Sounds almost as bad as having to edit the production DB usign SQL which to this day is still required for certain processes in my job.
Makes sense. I really wish we would at least store the type of error somewhere so when you try to queue the same model that failed again you get the reason why it failed before. I is frustrating having to force queue them just to see what the error was.
it actually does store the whole log, but not yte accessible. i'll have to implement some grep-type command. what isn't logged is imatrix generation failures where we end up having no static repo (because the model was broken, but did convert and quantize statically). repos where i kept the static ones but they have no imatrix repo usually have a "no_imatrix" meta data key with the reason (used by patchreadme).
Maybe it got OOM killed? Does this happen on every host or only a specific one?
it started on rich, and dmesg showed some oom kills, but it showed up on rain and marco once, too, and there were no oom kills. in any case, it didn't happen, and I think I know what might have caused it (namely "ID" leakage - the id that the ils/iwait/ikil etc. commands use.).
it hasn't happened since if fixed that, so keep your fingers crossed.
this was, btw., part of a larger rewrite of how jobs are started - no longer using screen (but I found out why I used screen in the first place - the tty causes programs to go line-buffered of course, which causes some slight fallout, but fortunately, both quantize and imatrix flush manually it seems), and local scheduling also is enabled everywhere now.
Wow you wrote like an entire application for this.
let's not get carried away, it's a simple script that writes a html page. the value of it is that it has grown over a year to get more and more filtering tweaks, as hf filtering is imho useless. and also to keep track of my manual decisions as good as it can, because that is hard work and i'd like to preserve that, if something goes wrong :)
currently i use this for filtering. i think there is value in you creating your own approach, though. to give some creative input, I currently use this as the raw filter, mostly seeing if certain files exist, and weeding out model names (-AWQ etc.) that are likely uninteresting. and one thing that comes in handy is that I keep a list ofd authorn names that tend to have interesting models - they simply get listed first, so they less likely drown in the sea of mediocrity. oh, and uppercase initial model names are also listed first. that's the single most important indicator of whether somebody wants the model noticed or not :)
grep +(grep $supported_models{$}, @{ $->{config}{architectures} }),
grep { grep $->{rfilename} =~ /(?:^|/)(?:config.json)\z/, @{ $->{siblings} } }
grep { grep $->{rfilename} =~ /(?:^|/)(?:model.*safetensors|pytorch_model.*bin)\z/, @{ $->{siblings} } }
#grep $->{library_name} ne "transformers", #d#
grep $->{id} !~ //[0-9-.]+\z/,
grep $->{id} !~ /\dbit-quantized\z/,
grep $->{id} !~ /(?:[-]exl2|[-]?(?:[2468][-]?bits?|[0-9.]+bpw)|-_?
-> GPTQ|GGUF|AQLM|Int[48]|NF[48]|FP8|MLX))\z/,
grep !exists $BLOCK_AUTHOR{$_->{author}},
IP addresses not assigned to the same country at the exact same time
(my reasoning is your reasoning) I would totally risk it :)
If we do so we maybe should move cheaply to compute quants to the other nodes so I can increase compute without increasing my total internet data usage. I think we did so in the past back when nico1 just did the expensive quants due to my upload limitation.
Well, there really aren't cheaply to compute quants if you think about it - they all cost roughly the same per byte. There are differences in I/O usage (and unfortunately, nico1 is also the best node for I/O intensive tasks) - basically, static-only models generate more I/O per byte. All we can tweak is latency in when a model finishes, and right now, since we are overloaded, I simply manually queue big models to rich1, nico1, marco (and kaos to some extent). That does a good job of keeping them busy while the others can chomp on smaller models.
But if you mediate on it for a while, as long as we are overloaded, there is not a zilch of difference on where a model is quanted - I can let rain chomp on calm-405b for a few weeks (or more?), and all it dfoes is let nico1 do more of the smaller ones. The only difference is whenwe run out of models, then nico1 might be idle earlier than rain (latency!), but when we are idle, we are no longer overloaded.
I might be able to contribute more compute
I was afraid to ask. I am torn in the issue - a few days ago, there was a one-day lull in 70bs, and that allowed us to clear the queue, actually, so it's not as horrible as it looks. I am also experimenting a bit with different scheduling, just to clear the queue more aggressively (more big models in the queue makes the queue shorter...).
But I was close to bring it up. Get yourself a tasmota power plug and let me power down the other node in the afternoon, or so :)
However, it might be a yearly thing - last year, around new year, things also exploded, so maybe it will quiet down a bit? On the other hand, models have tended to become larger (405b..., most noticably the 70bs have increased in prevalence).
I’m also really scared what we will do should https://github.com/ggerganov/llama.cpp/pull/11446 get merged
Yeah, I had the same thought when I saw it a week ago. But requanting lots of deepseek models is feasible. JUST NOT NOW :-)
@RichardErkhov now you see how much we rely on rich1.
Oh, that was in question? To be honest, rich1 was a very good to have host to eat through the high nice-level queue and keep latency down for people, but due to the many recent outages (e.g. when imatrixing big models, really a funciton of too many too big models) of nico1, without rich1, we would have exploded (which means I would have had to take drastic measures in quantizing less of the new models).
Well at least until it nukes everything or does some other crazy thing.
Yup. I did run a quick test after I wrote my message, though. And the nuke command did exist before and was used many times, it just didn't delete from the queue. Which is a totally safe thing. Oh, and a security bug, let me try to fix that.
Another model that can't be downloaded: S1-Reasoning-32B
huggingface_hub.errors.HfHubHTTPError: 403 Forbidden: None.
Cannot access content at: https://cdn-lfs-us-1.hf.co/repos/81/eb/81eb1575218917424d0e546240cbe0050eb5b52fe5d7d3cb90406c8e6c573654/7ad4145becf2b51f520acb9e5a9c63bf979961cfd62d6bcff7cab110593b8b57?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00023-of-00029.safetensors%3B+filename%3D%22model-00023-of-00029.safetensors%22%3B&Expires=1739067371&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczOTA2NzM3MX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzgxL2ViLzgxZWIxNTc1MjE4OTE3NDI0ZDBlNTQ2MjQwY2JlMDA1MGViNWI1MmZlNWQ3ZDNjYjkwNDA2YzhlNmM1NzM2NTQvN2FkNDE0NWJlY2YyYjUxZjUyMGFjYjllNWE5YzYzYmY5Nzk5NjFjZmQ2MmQ2YmNmZjdjYWIxMTA1OTNiOGI1Nz9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=ARvqLEDzF-YB2bT44~8CdRR4fPaZupPc2YKIhWiCVrMKm3YbMfXUtPfWFuHpukxWm0U29YIJnLNbKv4YO9i0joN4WXkBV2~YsiYx9m9Oz1Jw1E2SiflPD2yaeEVt5XxogBEr4Z-UnljI9lxEJTnc2NA93fqrvzbinan79~eTgfB69Viw6KWRweAJL3O3jRRKdUQnXvt~rE~9CTTmt8Wi0qmtdqxf7mjlOQKBqecm2girKMApYJ0fL~EWcHm99pEZPCdh-WMccTpclBMT-WDPhfh~VQDQEOksevyfOa-6qjBi888y6JFtRCGb3-eJepj3RMsDOMK2Ij7-92lLOyNmaw__&Key-Pair-Id=K24J24Z295AEI9
lol I remember trying to upload the fatllama ...
The way it usually works is that an account needs to be accessed by two different IP addresses not assigned to the same country at the exact same time.
Why is multiple IP at the same time even a question? Huggingface itself supports you upload from amazon, google, huggingface themselves to the account, and it supports doing so from multiple servers at the same time. is there even a clause in TOS about account sharing ??
nope, new mystery problem sitll there. jobs simply exit with exit status 1, right in the middle of quantize. i am stumped.
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type f16: 198 tensors
[ 1/ 339] output.weight - [ 3584, 152064, 1, 1], type = f16, size = 1039.500 MB
[ 2/ 339] output_norm.weight - [ 3584, 1, 1, 1], type = f32, size = 0.014 MB
Update: that's because the job is still running fully intact. What the hell. But it's a lead...
Update 2:pretty sure it's due to a workaround I have in place for the empty logfile problem I had earlier. unfortunately, now that the timing is pretty much tight to the millisecond, it's kind of hard to avoid. hmm... in any case, I think it is benign, the job will go green again once finished.
added two new llmc commands, nukerepo and nukeall. nukerepo can be used to easily delete mradermacher repos for a model, and nukeall does what nuke + nukerepo does, and additionally tries to remove the imatrix.dat - in other words, it makes the model basically nonexistant to us.
nukerepo is not that commonly useful, but nukeall is something i regularly use when e.g. the imatrix failed because of a broken model, in which case any static quants are also doomed.
Requanting all the DeepSeek v3 and DeepSeek v2 based models is obviously unfeasible but we would at least have to redo the popular ones as requanting them offers a 2x inference performance increase.
It is more than 2x, the difference grows with context length. It also has the benefit of dramatically lower KV sizes. I can't even launch the 128K KV it OOMS with me having 384 GB of RAM. 64K and 128K results below to show KV sizes.
n_ctx = 128000
MLA:
CPU KV buffer size = 16203.13 MiB
CPU compute buffer size = 64468.01 MiB
llama.cpp
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 656621568032
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
n_ctx = 64000
MLA:
CPU KV buffer size = 8101.57 MiB
CPU compute buffer size = 32343.01 MiB
llama.cpp:
CPU KV buffer size = 313101.56 MiB
CPU compute buffer size = 16279.01 MiB
Edit: I used your imatrix.dat, With my newer quants, I set the two missing tensors to Q8_0.
It is more than 2x, the difference grows with context length. It also has the benefit of dramatically lower KV sizes. I can't even launch the 128K KV alone takes up more than 384 GB of RAM. 64K and 128K results below to show KV sizes.
The large ones DeepSeek-V3
, DeepSeek-R1-Zero
and DeepSeek-R1
we have to redo anyways. While massive they are also not that much effort as nobody finetuned them so far - I tried and failed booth using 6x H200 and 6x MI300X so it really doesn't currently seam feasible to finetune DeepSeek R1 using axolotl. The biggest pain would be recomputing their imatrix which takes like 18 hours each and requires the RPC setup. Due to this model being so popular we likely should likely try to keep the old quants online while requanting.
@tdh111 Do you know if and to what degree this change llama.cpp change will impact the DeepSeek-R1-Distill models? We around hundred of them and requanting all of them would be a massive project.
I tried and failed booth using 6x H200 and 6x MI300X so it really doesn't currently seam feasible to finetune DeepSeek R1 using axolotl.
https://github.com/hiyouga/LLaMA-Factory seems to have support of Deepseek 671B.
Do you know if and to what degree this change llama.cpp change will impact the DeepSeek-R1-Distill models? We around hundred of them and requanting all of them would be a massive project.
There is zero impact. The changes are because of the Deepseek architecture which only the 671B has. All the "distill's" are just SFT on top of other models, they are no different then community finetunes in that regard, just with a better dataset. Quote from the paper "For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader research community."
The biggest pain would be recomputing their imatrix
Is this even required? Do the changes actually affect output as well?
Due to this model being so popular we likely should likely try to keep the old quants online while requanting.
I don't think I'd want that - to much hassle. But it would be easy to clone them tmeporarily.
Is this even required? Do the changes actually affect output as well?
As it stands it takes one tensor and decomposes it into two. (The initial tensor is still in the GGUF but is unused, and this is pending change). Those two do impact output and like I said I mitigated the fact that your imatrix does not include them by making them Q8_0. There is actually a bug (discovered here: https://github.com/ggerganov/llama.cpp/pull/11446#issuecomment-2644442963), that shows setting one of them to BF16/F32 is actually optimal because of a bug in llama.cpp where truncation is happening losing noticable quality.
Due to this model being so popular we likely should likely try to keep the old quants online while requanting.
I don't think I'd want that - to much hassle. But it would be easy to clone them tmeporarily.
They are still discussing how to handle the decision of two different implementations, the original one has better PP speeds, the new one has better TG and much lower KV size. In theory a single GGUF could handle both cases with it being toggled at launch, and that is how my preferred fork of llama.cpp handles it.
Hmm, I just encounterd a fascinating problem - Llasa-1B-Multilingual-German makes convert-h-to-gguf.py hang: http://ue.tst.eu/1748ccf580e05e130dae56c7a70e5cd3.txt
Definitely a new, formerly unseen, failure mode.
as some audit guidance, when I see error/1 bpe-pt missing (0e538944…)
(WARNING:hf-to-gguf:** chkhsh: 0e538944d67
) I instantly "n"uke in audit.
likewise, another very common instant nuke is error/1 ValueError Can not map tensor '
(ValueError: Can not map tensor 'embed_tokens.weight'
)
wrong shape (during noquant) is another instant "n"uke
duplicated tensor
, very often, is a duplicated set of tensor files (either two safetensor sets or picke/pytorch + safetensor). i usually look at the index.json file to see which one is the likely newest upload (or, in the case of pickle, i usually chose safetensors). that's what "s"hell is for.
so if you want to give llmc audit more of a try, you can just instantly nuke the first three without a second thought and e.g. skip the rest (just pressing enter).
also, two undownloadable files in two days. soon i will have to write special code for this hf bug.
@mradermacher
I installed a 20 TB BTRFS formatted Seagate Exos X20 and mounted it to /gpool
inside your LXC container. Please make use of it as much as you want to deal with those massive models. I will use some of the storage myself as well but make sure there are always a few TB available for you to use. On some other great news during this maintenance period, I was also able to get the full 128 GiB of RAM working again on Threadripper.
I will use some of the storage myself as well but make sure there are always a few TB available for you to use.
Nice - I don't think I will make any automatic use of it (at least at the moment I wouldn't know how(, so I can announce when I need it in advance. The only thing I currently can come up with will be converting Llama-3.1-Tulu-3-405B.
But yeah, it might be handy to have a few TB temp space for stubborn models with a lot of extra baggage, or maybe moving stuff out of the way when the next deekseek lands.
I was also able to get the full 128 GiB of RAM working again on Threadripper.
Good to hear - so, was it just the power supply you lost? Quite lucky!
There goes my IQ1 of snowflake :/
The maker of imatrix has made a PR https://github.com/ikawrakow/ik_llama.cpp/pull/202 in his fork that solves your problem, and would allow you to IQ1 arctic, and also imatrix with partial data which helps all the other quants.
The maker of imatrix has made a PR https://github.com/ikawrakow/ik_llama.cpp/pull/202 in his fork that solves your problem, and would allow you to IQ1 arctic, and also imatrix with partial data which helps all the other quants.
That is absolutely amazing to hear. Thanks a lot for informing us about this PR getting merged. I completely missed it. Snowflake Arctic is one of the models I really like. It is such an underappreciated model in my opinion. Thanks to using 256 tiny experts it is so fast and efficient to run on CPU and like DBRX and Nemotron-4-340B it is a very unique model providing different opinions/answers/data compared to the vast majority of other models making it perfect for synthetic data generation or get a second opinion.
@mradermacher Let's redo the imatrix quants of https://huggingface.co/Snowflake/snowflake-arctic-instruct and finally do https://huggingface.co/Snowflake/snowflake-arctic-base. Even after all this time I still have snowflake-arctic-instruct stored on hpool in the hope I find time to fix this issue myself. I can source GGUF it to spool or gpool at any time.
I see why I missed this pull request. They merged it into ik_llama.cpp and not official llama.cpp
@mradermacher We probably should port this fix to our own llama.cpp fork. I have no idea where we are even hosting out own llama.cpp fork. If you tell me, I could port this change for you.
We don't have a llama.cpp fork. Well, strictly sdpeaking, we have one, but that's just m,y private got repo with a few added lines of debugging.
I am wary of maintaining a fork, and I am also wary of switching to a potentially incompatible fork of unknown maintenance. Not that I am particularly happy with llama.cpp, but it is the de-facto standard, no matter how cool other forks are. I feel I need to be very conservative with mradermacher, and therefore strongly resist too much change.
I am open to arguments, but "this cool fork fixes things and also makes super-sota quants" is something I hear a lot. But I feel it is something for a more experimental group than mradermacher.
I am also open to one-time hacks, such as possibly one time using ikawrokows fork to do the imatrtix for snowflake, but it is a lot of business,a nd I would need to be convinced that it is completely compatible to llama.cpp.
If the imatrix files are compatible (and I don't see why not), maybe a solution would be for somebody else (hi, nico) to provide the correct llama install. IT should be easy to use a different llama-imatrix for one job, and that would keep everything else safe. But I am busy enough with day-to-day business, so I don't think I want this stress myself :)
This was a lesson in how we take technology for granted.
Before gping to sleep, I ran (the equivalent of) llmc audit. Remember how I told you it keeps the scheduler lock? Well, my internet failed, and so did mobile. And me, being extremely tired, had to go to sleep.
There is a new command in llmc now, called "killalll9". It tries to log in on all nodes and, well, kill -9's all llmjob processes. You can do that when the scheduler seems hung for an hour or so. It should generally be harmless, or at least the damage should be limited to race conditions between, say, cleaning up a job and writing that fact to disk.
Local scheduling can't do anything if the scheduler is already running on the node. Maybe a client-side timeout would be good to have. Would have been nice if CALM-405B or so had been quanting, but alas, only smaller models were running at the time.
I sandwiched the question prompt in llmc audit with an alarm 90/alarm 0. Might be good enough.
ERROR:hf-to-gguf:Error: Llama-3.1-Tulu-3-405B is not a directory
convert-hf-to-gguf.py no longer supports symlinks to directories??
since some people here are so ultra-competitive (except nico and me), let the world know that rain has made the race to the first nice 400 job.
We probably should port this fix to our own llama.cpp fork
I would recommend this over directly using the fork for all your needs (it does not support all the models, RPC support is a bit behind, etc.). Ideally you could create a PR with the original llama.cpp. ikawrakow is the person who brought basically all of the quants to llama.cpp (except for the very early, very basic ones that are legacy nowadays), the imatrix, etc. He no longer contributes to llama.cpp, preferring to maintain his own fork with sota quants, and a lot of CPU performance improvements, amongst other things. He no longer contributes to llama.cpp.
"this cool fork fixes things and also makes super-sota quants" is something I hear a lot.
I'm curious to what other forks you've heard this about. Ik_llama.cpp is my preferred fork but I understand that forks have downsides and I've ported things over to ik_llama.cpp where it did not support it, and since your group loves to support all models it is probably best for you to stay on llama.cpp which almost always gets model support faster if at all (The MLA version of Deepseek is an exception as ik_llama.cpp has the most mature implementation that is merged, llama.cpp is still in PR)
I am wary of maintaining a fork, and I am also wary of switching to a potentially incompatible fork of unknown maintenance.
It is just a patch to the imatrix that allows for MoE models that hit the issue you hit with snowflake to use partial expert data.
If the imatrix files are compatible (and I don't see why not)
They should be to my understanding, I use the fork and generally use your imatrix.dat files.
I'm curious to what other forks you've heard this about.
None, it's about other quant methods producing ggufs. I am not paying much attention to it.
It is just a patch to the imatrix that allows for MoE models that hit the issue you hit with snowflake to use partial expert data.
I would expect that - we discussed such a patch ourselves after snowflake. So it could be, could not be. This doesn't instill confidence.
I would expect that - we discussed such a patch ourselves after snowflake. So it could be, could not be. This doesn't instill confidence.
I say should be, as I read and understood the code, but I have not tested it. Ikawrakow did a full write up of this issue in the PR: https://github.com/ikawrakow/ik_llama.cpp/pull/202, he definitely tested and took this issue seriously when what happened to Arctic imatrix run was mentioned to him. The code is good to use, I have never made an imatrix, I've only ever downloaded imatrix.dat and said should for that reason.
Edit: Like I said above, you can make a PR for this in llama.cpp if you want it there, you'll have data showing it's usefulness.
You have to understand my situation, too - we are quanting dozens of models each day, without interruption. The goal is not to be at the forefront of the technology and experiment with things, but to provide a reliable repository of gguf quants. Every single hiccup can be disastrous, and llama.cpp is a very unstable and unreliable base. Experiments, regardless of how safe they seem, are not something I see me doing very much, and certainly not on a whim.
As I said, I am willing to try things, but by far my main goal is to provide a stable source of ggufs above all else.
@mradermacher I created the following patch based on the changes made to that llama.cpp fork to fix issues with MoE models where we encounter experts, we can't activate with our training data as it is the case for snowflake. From a first impression I can say that the original llama.cpp has a higher code quality compared to the fork and even in this relatively small change I had to fix compilation warnings.
Either clone my fork:
git clone -b imatrix-allow-partial-data https://github.com/nicoboss/llama.cpp.git
cd llama.cpp/
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
Or apply the following diff to the original llama.cpp:
@@ -32,6 +32,7 @@ struct Stats {
std::vector<float> values;
std::vector<int> counts;
int ncall = 0;
+ int n_as = 1;
};
class IMatrixCollector {
@@ -124,11 +125,15 @@ bool IMatrixCollector::collect_imatrix(struct ggml_tensor * t, bool ask, void *
if (e.values.empty()) {
e.values.resize(src1->ne[0]*n_as, 0);
e.counts.resize(src1->ne[0]*n_as, 0);
+ e.n_as = n_as;
}
else if (e.values.size() != (size_t)src1->ne[0]*n_as) {
LOG_ERR("%s: inconsistent size for %s (%d vs %d)\n", __func__, wname.c_str(), (int)e.values.size(), (int)src1->ne[0]*n_as);
exit(1); //GGML_ABORT("fatal error");
}
+ else if (e.n_as != n_as) {
+ LOG_ERR("%s: inconsistent n_as for %s (%d vs %d)\n", __func__, wname.c_str(), e.n_as, n_as);
+ }
LOG_DBGV(2, "%s[%d]: %32s, %s, %5d x %5d, %d\n", __func__, m_last_call, wname.c_str(), ggml_op_name(t->op), (int)src1->ne[0], (int)src1->ne[2], (int)src1->type);
// loop over all possible experts, regardless if they are used or not in the batch
for (int ex = 0; ex < n_as; ++ex) {
@@ -247,8 +252,38 @@ void IMatrixCollector::save_imatrix(int ncall) const {
}
if (n_zeros > 0) {
- LOG_WRN("%s: entry '%40s' has partial data (%.2f%%) - skipping\n", __func__, kv.first.c_str(), 100.0f * (n_all - n_zeros) / n_all);
- continue;
+ LOG_WRN("%s: entry '%40s' has partial data (%.2f%%)\n", __func__, kv.first.c_str(), 100.0f * (n_all - n_zeros) / n_all);
+ bool store_it = false;
+ if (kv.second.n_as > 1) {
+ int n_per_expert = n_all / kv.second.n_as;
+ std::vector<int> bad_experts;
+ bad_experts.reserve(kv.second.n_as);
+ for (int i = 0; i < kv.second.n_as; ++i) {
+ auto counts = kv.second.counts.data() + i*n_per_expert;
+ int nz_i = 0;
+ for (int j = 0; j < n_per_expert; ++j) {
+ if (counts[j] == 0) ++nz_i;
+ }
+ if (nz_i > 0) bad_experts.push_back(i);
+ }
+ LOG_WRN("%s: %d out of %d experts are missing data\n", __func__, int(bad_experts.size()), kv.second.n_as);
+ if (bad_experts.size() < round(kv.second.n_as * 0.05)) {
+ LOG_WRN("%s: %d out of %d experts are missing data - storing but be aware\n", __func__, int(bad_experts.size()), kv.second.n_as);
+ store_it = true;
+ for (auto i : bad_experts) {
+ auto counts = const_cast<int*>(kv.second.counts.data()) + i * n_per_expert;
+ auto values = const_cast<float*>(kv.second.values.data()) + i * n_per_expert;
+ for (int j = 0; j < n_per_expert; ++j) {
+ counts[j] = 1;
+ values[j] = 1;
+ }
+ }
+ }
+ }
+ if (!store_it) {
+ LOG_WRN("%s: Skipping expert with missing data!\n", __func__);
+ continue;
+ }
}
n_entries++;
I successfully tested my changes on snowflake arctic using:
cd build/bin/
./llama-imatrix -m /hpool/GGUF/snowflake-arctic-instruct.i1-IQ4_XS.HARDLINK.gguf -f /root/imatrix-with-rp-format-data.txt -ngl 0
hi @mradermacher , I hope you are fine! wanted to know if you have any stats for the rich1 and nico1? Like the total size (in TB) of contribution, amount of models produced etc. Just very interesting to see that lol, so can you please send me whatever you have on hands ?
@mradermacher The status page including the telnet one froze 5.5 hours ago. The scheduler and everything still works just the status page does not.
wanted to know if you have any stats for the rich1 and nico1?
Somewhere, somewhat, some could be generated, when I write code for it. I will keep it in mind and see if I can come up with something - mos tlikely, one would have tio paste repository file listings though for actual resulting counts. There is, to quite a certain politician, not a snowballs chance in hell that rich1 can compete with nico1, even with nico1's hands tied behind her back.
One of my todo's is to combine logs from all nodes (currently, nico1 and rich1 keep their own logs, making summarising harder).
(That is my punishment for shitposting)
The status page including the telnet one froze 5.5 hours ago.
That's a new one, but I prefer this failure mode over the opposite.
Hmm, llmstatusd has no children, but waits for one. I'm not sure that should even be possible without a kernel bug.
patch
So if I get this correctly, the "weights" are literally some kind of weight - if they are 0, quantisation likely crashes, so they are set to 1. Hmmhmm... It seems a simple enough patch with hopefully low maintenance required. I'll meditate over it.
Could you put the diff somewhere where it's binary-safe to download? hf mangles whitespace.
Could you put the diff somewhere where it's binary-safe to download? hf mangles whitespace.
Here the binary-safe diff: https://www.nicobosshard.ch/imatrix-allow-partial-data.patch
Alternatively just clone git clone -b imatrix-allow-partial-data https://github.com/nicoboss/llama.cpp.git
and git diff master
Seems we have lost rich1 once more. Sigh.
Seems we have lost rich1 once more. Sigh.
@mradermacher The host of rich1 crashed. I started the rich1 LXC container again. Everything is looking fine. Manual cleanup of tasks that were running while it crashed might be required.
@mradermacher Qwen2-VL-7B-Latex-OCR got wrongly scheduled to rich1. Not sure if that is why it crashed but in any case, it won't work there due to CUDA being required for this model and rich1 not having a GPU. I recommend we reschedule it to nico1. The only reason I haven't done so myself is because this is a potential cause of today’s rich1 crash and I don't want to temper with potential evidence.
Aha, so it crashed :/ I do not think that model would have anything to do with it, as the converter simply didn't load due to cuda not being available, so it basically crashed immediately. I probably forced this model to rich1 manually.
Since it happened directly after I pushed some models there, I suspect it was, once more, many concurrent downloads. I have reduced it by 2, but I don't think this can eliminate it, just reduce the chance of it happening. And while most of the time, 4 downloads is just fine, sometimes its actually a bottleneck. I wonder if I should do some rate limiting, and if that could fix it.
@nicoboss I had, uhm, "unusual" problems with the tulu imatrix yesterday. After it ran for a few hours (probably, when I got up it as at chunk 5) and had a projected time of about a week, I mlock'ed the source gguf (which is on /gpool). But three chunks later it was not considerably faster, so I decided to kill it for the time being (at chunk 10). Maybe you have an insight into what went wrong?
Update: giving it another try, with less other activity.
Update 2: seems to work way better. Weird. nothing should have changed.
@nicoboss
there is a new llmc why
command that greps the log file for a model and dumps what it finds. in other news, manual imatrix removals will hopefully also now be logged with logfile.
@nicoboss could you help me with an issue maybe?
https://huggingface.co/mradermacher/oh-dcft-v3.1-claude-3-5-sonnet-20241022-GGUF/discussions/1
Basically, it seems ollama pulls a chat template from somewhere (not the gguf) that keeps it from outputting anything but "safe". The discussion above has some analysis, but I don't know enough about ollama. It seems ollama uses some huggingface api to get the wrong chat template somehow? No clue what is going on.
@nicoboss I plan to generate imatrices for the 405B's currently queued on nico1 in the coming days, just so we have them when we need them, not becauise they urgently are needed.
But I don't want to do them when only one gpu is available, because obviously it blocks it for many hours. Do you have an idea of when you want to use it, so I can schedule it at a time when probably both are available?
@nicoboss I plan to generate imatrices for the 405B's currently queued on nico1 in the coming days, just so we have them when we need them, not becauise they urgently are needed.
Yes sure do them all in 8 bit but please wait with Nature-Reason-1-AGI
until we have the RPC setup so we can do it in 16 bit.
But I don't want to do them when only one gpu is available, because obviously it blocks it for many hours. Do you have an idea of when you want to use it, so I can schedule it at a time when probably both are available?
Richard was using it to train his face swapping AI for the past 2 days. This wasn't an issue as one RTX 4090 was able to keep up with the imatrix tasks. There is no reason Richard needs the RTX 4090 GPUs for his tasks, so I gave you back the GPU and told him to switch to the RTX 3080. The RTX 4090 GPUs are not needed in the following days so you can now run imatrix computation on booth of them allowing you to compute the imatrix for 405B models without disrupting the other imatrix tasks. Make sure to limit imatrix tasks on the GPU not doing 405B to small models or you will run out of RAM and start streaming from SSD. I will try to keep my own RAM usage as small as possible while you are doing 405B.
wow, if i mlock it early it's only 3 hours for a 405B. thats unexpected. not sure if it's true though :)
Make sure to limit imatrix tasks on the GPU not doing 405B
well, last time I had to mlock, despite ample ram being available, and i mlock'ed this time as well. in any case, the current rule is to use 490GB of ram total, which should be way below the total amount (512+2*24)
wow, if i mlock it early it's only 3 hours for a 405B. thats unexpected. not sure if it's true though :)
Wow and this despite me accidentally using like 24 GB myself because I forgot to turn off some things. Just always use mlock in the future.
well, last time I had to mlock, despite ample ram being available, and i mlock'ed this time as well. in any case, the current rule is to use 490GB of ram total, which should be way below the total amount (512+2*24)
I think the issue last time was that it usually takes a while for the OS to cache the right data and if you do the model from HDD this is a slow process. There for sure was more RAM available last time than this time.
Just always use mlock in the future.
It's a manual process, and doing it automatically, I think, is dangerous. But for 405B's + Q8_0, it's the way to go. Especially from /gpool.
157 minutes for a 405B imatrrix in Q8_0. just imagine how fast quantize would be if it used cuda. since it apparently just does brute force, it should lend itself to a cuda implementation. just... effort... needed...
@nicoboss that was pleasant, all 405B models now have an imatrix, sans "the AGI". thought that'd take a lot longer. now I only need less pressure on the queue. the big mdoels are piling up despite me not even i-quanting lots of models i normally would i-quant.
Today finally the Seasonic PRIME PX-2200 2200W PSU arrived. CastlePeak and with it all nodes in my Castle cluster will be fully operational again later today. This means we can soon use RPC to compute the imatrix of Nature-Reason-1-AGI in 16-bit precision.
2200W power supplies give me awe.
All my nodes are online again and the RPC setup updated and to latest llama.cpp and ready to be used for Nature-Reason-1-AGI. Please start imatrix RPC computation as soon as you can as it is currently preventing Richard from using the 2070s and 3080 GPUs.
2200W power supplies give me awe.
Yes especially when considering that I now have StormPeak with 3000W and CastePeak with 2200W going to the same power outlet rated for 2300W but no worries I life in a modern house so the 3000 Watt breaker is the bigger issue but I will take care of it if this problem ever materializes which will only happen if I buy some more GPUs.
What was concerning is that the PSU arrived with something loose moving around inside. I disassembled the entire thing didn't find anything and reassembled it and it miraculously was gone. No idea what it was but doesn't matter as everything is working perfectly fine with the new PSU. It inface is the most amazing GPU I bought so far so thanks a lot for your recommendation to go for Seasonic,
0 93 BiMediX-Bi run/imatrix (GPU-2d) 6/32 5.40s/c 112.5/31.6m(205.8-205.7) [192/351] 6.7716
0 65 DeepSeekR1-QwQ-SkyT1-32B-Fusion-715 run/imatrix (GPU-18) 21/64 5.16s/c 109.3/27.3m(218.5-213.3) [163/318] 9.5223
After I got CastlePeak online again Threadripper ran out of storage due to the massive backlog of replication jobs. This then led to it crashing and because Threadripper is hosting the OpenWrt VM this meant a short internet interruption breaking imatrix tasks. How to restart them? There is no log file and the source GGUF already got deleted as well. I tried deleting the imatrix file but this didn't reset them.
Please start imatrix RPC computation as soon as you can as it is currently preventing Richard from using the 2070s and 3080 GPUs.
... it would have been nice if this would somehow been coordinated or at least communicated with me before pointing the gun at my chest. you know, so I can prepare, or voice that I am not ready. sigh.
something loose moving around inside.
maybe a cable scraping against the case.
There is no log file and the source GGUF already got deleted as well. I tried deleting the imatrix file but this didn't reset them.
if the ggufs were already deleted (and/or there was an imatrix file) then the job finished successfully. when you then delete the result, the job is in permanent failure, because that's not a valid situation (the states are either: gguf has been created and job is not done, or gguf does not exist and job was done). it would also stay in error state if the connection fails, but the status would update and it could be recovered.
Cool that you started two quantization tasks beside the RPC imatrix task. I saw that one didn't max out the CPU and there was enough RAM available for a second one and just wanted to point it out when you already realized this yourself and started a second one.
... it would have been nice if this would somehow been coordinated or at least communicated with me before pointing the gun at my chest. you know, so I can prepare, or voice that I am not ready. sigh.
Thanks a lot for quickly starting the RPC imatrix computation. This is really appreciated. I didn't plan to immediately start with RPC computation the moment all my nodes are working again. I also didn't plan for Threadripper to run out of storage and crash which caused all of Richard's tasks to be interrupted minutes after he went to bed which led to a great opportunity to do RPC as GPUs where idle and ready to be used. I'm really sorry how terrible I worded my request. I should have nicely mentioned that spontaneously a great opportunity for RPC imatrix computation arose and asked you if it is possible for you to do the RPC setup instead of just telling you to do it. I was really tired after spending an evening working on getting everything running again and barely managed to get the RPC setup started and writing you before falling asleep. Feel free to always ignore such requests if you don't have time or it makes more sense to schedule other tasks first. I can always find another time window that works for everyone. I will try to better plan and organize RPC tasks in the future and informing everyone affected well in advance.
if the ggufs were already deleted (and/or there was an imatrix file) then the job finished successfully.
There was no screen window to attach to and no imatrix log files for me to check so I couldn't tell for sure. but there should have been no way for the job to finish successfully. There was an internet outage caused by Thredripper crashing when they booth where around half-way through. imatrix jobs are reading their training data from remote storage so they really should have failed.
when you then delete the result, the job is in permanent failure, because that's not a valid situation (the states are either: gguf has been created and job is not done, or gguf does not exist and job was done). it would also stay in error state if the connection fails, but the status would update and it could be recovered.
I left the result files there for quite a while before deleting them and they didn't get submitted and the status stayed at half-way done. Sorry next time I will leave them.
how terrible I worded my request.
Well, with your explanation, it makes a lot more sense, regardless of the wording. And I guess you got as fair share of my grumpyness in the past, so let's call it even. Ehem.
On the positive side, we had an unprecedented lull today (that means not another 5-10 70b's), so it is, after all, an ideal time. I did hold back about 30 70b imatrix jobs so far, though.
There was no screen window
Yeah, I mentioned it a while ago in passing, there is no screen anymore (and with imatrix jobs, the screen was running on kaos anyway).
imatrix jobs are reading their training data from remote storage so they really should have failed.
Once, when they start. imatrix jobs normally finish successfully (meaning they produce an imatrix file) even if the connection dies at chunk #1, fortunately. What does happen is that they go into a failure state, because the scheduler doe snot know what really happened, but once started, they run independently.
I left the result files there
Just to be clear, the MODEl.imatrix~ is only an intermediate result. If you mean that with result, then the job wasn't finished, and deleting it wouldn't cause any issue, as the job would have to run again anyway. If there was a MODEL.imatrix, then the whole job did finish sucessfully.
I assume you deleted the MODEL.imatrix~ file, so no harm done. If there is a MODEL.imatrix file, though, the next time the job would start it would skip the call to llama-imatrix.
new kind of hf bug, hf claiming a repo to be gated when it isn't: Cannot access gated repo for url https://huggingface.co/wanlige/li-14b-v0.4/resolve/28003038d56fc3a65f3d807e8c4a527b437075dc/.gitattributes.
new kind of hf bug, hf claiming a repo to be gated when it isn't: Cannot access gated repo for url https://huggingface.co/wanlige/li-14b-v0.4/resolve/28003038d56fc3a65f3d807e8c4a527b437075dc/.gitattributes.
The model is not gated but private. Just strange how we can even see a private repository. You could ask the author why it is private and if he intends to make it public.
right, that explains it. sounds like a security bug to me, i shouldn't be able to see private repos' README.md.
and no, i had no indication that it is private, so i only wanted to s-quant it because it might potentially look interesting. not a big deal.
I feel almost 10 times the time for an f16 405B imatrix over Q8_0 is likely not worth it.
right, that explains it. sounds like a security bug to me, i shouldn't be able to see private repos' README.md.
Yes this is clearly a security critical HuggingFace bug. No files of a private repository including the README.md should be publicly accessible. Quite shaking that they messed this up for this specific model.
I feel almost 10 times the time for an f16 405B imatrix over Q8_0 is likely not worth it.
I agree. Doing it over RPC is only worth it if it is a base model, an extremely popular model or one of our team members really likes and personally uses it like in this case Guilherme and me. While I never measured it, I’m almost certain Q8 will not in any measurable way impact the imatrix quality as in the model quality measurements I was unable to see any real-world difference between BF16 and Q8.
We however should always use RPC if Q8 doesn’t fit and it is a somewhat popular model because Q6 will likely lead to some minor quality degradation. Maybe I will one day measure this to make sure that this is the case.
No idea why this time it takes a few hours longer than usual for a 405B model. Maybe llama.cpp did some changes that make RPC slower or maybe it is because of the way I distributed quants to make sure StormPeak has as much free RAM as possible putting more layers to slower nodes or maybe it is just because we max out CPU on nico1 using quantization tasks.
In the end as long the RPC imatrix computation completes before any of the workers run out of works everything should be fine as using 2x4090 nico1 should easily be able to catch up with the imatirx queue. Especially now that nico1 can still work at full power during imatriy computation the impact on our operations is much less severe than in the past when we had nico1 completely idle. Even for R1 RPC imatrix where RAM is even tighter one quantization task on nico1 might still be possible.
I wrote a nice python script to make finding great models easier. It goes through the first 10 pages of trending models and checks for each of them that we are missing if their architecture is supported by llama.cpp and applies some further filters to then give you a final list of trending models to manually review and queue.
What is cool that it obtains all this information dynamically. It gets all the supported architectures by phrasing convert_hf_to_gguf.py on llama.cpp master, it gets the models we already have by phrasing repolist.gz and it automatically detects the models architecture by phrasing and caching each model's config.json
import requests
from lxml import html
import os, json, gzip, io, re
def extract_text_from_page(page_content):
tree = html.fromstring(page_content)
texts = []
for i in range(1, 31):
#xpath = f"/html/body/div/main/div/div/section[2]/div[2]/div/article[{i}]/a/div/header/h4/text()"
xpath = f"/html/body/div/main/div/div/section[2]/div[3]/div/article[{i}]/a/div/header/h4/text()"
result = tree.xpath(xpath)
if result:
texts.append(result[0].strip())
return texts
def download_and_cache(url, cache_dir="cache"):
if not os.path.exists(cache_dir):
os.makedirs(cache_dir)
filename = "_".join(url.split("/")[3:])
cache_path = os.path.join(cache_dir, filename)
if os.path.exists(cache_path):
print(f"Loading {filename} from cache.")
with open(cache_path, 'r', encoding='utf-8') as f:
data = f.read()
else:
try:
response = requests.get(url)
response.raise_for_status()
data = response.text
with open(cache_path, 'w', encoding='utf-8') as f:
f.write(data)
print(f"Downloaded and cached {filename}.")
except requests.exceptions.RequestException as e:
print(f"Error downloading {url}: {e}")
with open(cache_path, 'w', encoding='utf-8') as f:
f.write('{}')
return None
try:
json_data = json.loads(data)
return json_data
except ValueError as e:
print(f"Error parsing JSON from {url}: {e}")
return None
url = "https://hf.tst.eu/repolist.gz"
response = requests.get(url)
if response.status_code == 200:
with gzip.GzipFile(fileobj=io.BytesIO(response.content)) as f:
file_content = f.read().decode('utf-8')
existingModels = {line.split('/')[-1] for line in file_content.splitlines()}
else:
print(f"Failed to download the repolist. Status code: {response.status_code}")
exit(1)
url = "https://raw.githubusercontent.com/ggml-org/llama.cpp/refs/heads/master/convert_hf_to_gguf.py"
response = requests.get(url)
if response.status_code == 200:
script_content = response.text
# Regular expression to find all registered models
pattern = r'@Model\.register\("([^"]+)"(?:, "([^"]+)")*(?:, "([^"]+)")*\)'
matches = re.findall(pattern, script_content)
supportedArchitectures = set()
for match in matches:
for model in match:
if model:
supportedArchitectures.add(model)
else:
print(f"Failed to download the convert_hf_to_gguf.py. Status code: {response.status_code}")
exit(1)
for supportedArchitecture in supportedArchitectures:
print(supportedArchitecture)
modelsDict = {}
for page in range(11):
response = requests.get(f"https://huggingface.co/models?pipeline_tag=text-generation&library=safetensors&p={page}&sort=trending")
models = extract_text_from_page(response.content)
for model in models:
if model.split('/')[-1] in existingModels:
print(f'Skipping {model}...')
continue
url = f'https://huggingface.co/{model}/raw/main/config.json'
data = download_and_cache(url)
if data and "architectures" in data and len(data["architectures"]) > 0:
modelsDict[model] = data["architectures"][0]
for entry in modelsDict.items():
print(entry)
toDo = list()
for model, architecture in modelsDict.items():
modelLower = model.lower()
if architecture in supportedArchitectures:
if "-gguf" in modelLower:
print("Skipping GGUF:", model)
continue
if "-awq" in modelLower:
print("Skipping AWQ:", model)
continue
if "-4bit" in modelLower:
print("Skipping 4bit:", model)
continue
if "-8bit" in modelLower:
print("Skipping 8bit:", model)
continue
if "-mlx" in modelLower:
print("Skipping mlx:", model)
continue
if "unsloth/" in model:
print("Skipping unsloth:", model)
continue
if "mlx-community/" in model:
print("Skipping mlx-community:", model)
continue
print("ToDo:", model)
toDo.append(f'https://huggingface.co/{model}')
else:
print("Unsupported:", model, "due to", architecture)
print("===================================================")
for item in toDo:
print(item)
I wonder if there is a command to transfer tasks to different nodes. There doesn't seem to be one listed inside llmc help
but on the status page it states:
.../hfd
download from huggingface or transfer of source model between machines
Is this only used for imatrix quants computation that all must happen on nico1
or is there a possibility to move some of the tasks waiting for imatrix from rain
to nico1
so that rain
has enough budget to take on some other tasks until the required imatrix is ready? I have the feeling that if we don't do something within the next few hours rain
might run out of tasks due to its budged being filled with blocked tasks.
I already wrote that it isn't the extra quantisations, because I stopped them for almost an hour to see if it has an influence. Even without them the time would be ~1500min.
In the end as long the RPC imatrix computation completes before any of the workers run out of works everything should be fine as using 2x4090 nico1 should easily be able to catch up with the imatirx queue.
imatrix generation is not the bottleneck, quantising is.
My definition of fine is that we are on the path to normalcy. And this sets us back by a day, probably even more (a day of doing only static jobs leaves 3+ days of imatrix jobs in the queue). The nodes only don't run out of work because they accumulate more and more imatrix jobs that they have to work on later. We were just at the point where we could work on some of the non-ultra-urgent models (for example, lumikabra on nico1 is in the queue for 20 days now).
I wonder if there is a command to transfer tasks to different nodes.
No, because the task of moving all the related (and big) files over the network is a lot of business to implement, and it's practically never needed. I would have had some need for it in the past because some nodes are faster, but have a smaller disk, to allow them to quantize bigger models. Keep in mind that moving models incurs a lot of I/O for the source node, especially for a node like rain that has rotating disks and is somewhat limited by them,.
Dumb question: are you responsible for all these imatrix models on rain? Why would you do that when you know we are in trouble due to being blocked for a day. The queue as it was should have been fine as rain would pick predominantly static-only models. Remember that queuing a single day of imatrix models causes three or more days of queue to work on after nature-AGI-1 is through.
As a side node, you should really stick to the 2000/0/1000 categories(or at least broad categories such as "10" if you really want a lower priority than 0) - by assigning these distinct nice levels you essentially force a total order, leaving little freedom for the scheduler to shuffle things around. We are only having so many nice levels right now because we are basically not keeping up, to make things slightly more efficient - the 400 (now 40) models are essentially delayed 0-level models.
I really start to hate nature-agi-1.
(script)
I think there is great value in having a different script and selecting things differently, which is why I deliberately didn't expose you to mine. My script also marks what was already queued for me, so that shouldn't cause interference between us.
As a minor sidenote, library=safetensors filters skips about half of the models, not sure about pipeline_tag. The hf tags are not useful for filtering, IMHO. Also, convert_hf_to_gguf.py can actually print the list of supported architectures, but my script also parses the source (well, not the filter script, the script that installs a new llama.cpp version). because that option seems to be new.
And regarding rain, if for some reason a node is in this specific state (queue is full) it will not automatically accept models, but you can still push-model stuff to it (e.g. static-only ones). And if you know what you are doing, you could either force running them with .nobudget and .force, although it is questionable to go beyond the budget limits I set on nodes not under your control. But that works for nodes where you know what is going on (e.g. rich1 and nico1). You can see the actual disk-free value (free X), which is generally the real limit where things start to fail. If free space goes dangerously low, you can nuke and requeue. Or you could delete the gguf file via the shell commmand. That will recreate them when the next job starts (i.e. it works while "blocked" and obviously will error the job in other states, but audit can be used to get out of that).
You can also "move" models by nuke'ing them and add'ing them with a specific worker, when blocked, potentially push-model'ing them. Of course, that creates a lot of I/O, too, but at least not on the source node. The main reason why this is bad is because quite a few models need manual fixing, so there is no way to do what imatrix does with hfdprep fully automatically, without an rsync.
Also, lastly, some thoughts on more memory vs. less on nico1 while rpc jobs run. It is really hard to say. Right now, nico1 fortunately has some very big models to crunch on, so being able to do that is good for overall queue efficiency when tight. It's not necessarily the models I want to work it on, but at least it does good work.
If we ever reach our normal steady-state again, the time where we can't do imatrix tasks is probably more important though.
However, given that one 4090 has 5GB free memory, I was close to trying to imatrix some smaller (say, 30B and smaller) on it while the rpc job was running. (the other is chock-full). Would have been a lot of hassle to try out - I would have to lie creatively about the available budget among other things (it's really hard to model the rpc job), so I did not do that, but that might be even better, i.e. leave as much space available on nico1 so some imatrix jobs can be done concurrently. I even considered fixing my older imatrix scheduler so I can calculate imatrices at home again, but that would be similarly limited (96GB RAM).
In the end, however, I suspect that the solution is completely different: realise that mradermacher is doing factory-line quantisation for the broad masses, not necessarily for fringe models or models that need a lot of manual tweaking - we could leave this to other people.
Not sure who these other people will be, but it seems bartowski has acquired a lot more powerful hardware than us, for example, so he would be ideally suited for these big models, and that would be fine.
I'm not saying it isn't exciting to try, but at the end of the day, there is no point in letting pride get in the way of the mission :-)
And personally I really need to see some progress on the queue - we didn't have any for a month now. I really dread every time we use the rpc set-up at the moment.
My definition of fine is that we are on the path to normalcy. And this sets us back by a day, probably even more (a day of doing only static jobs leaves 3+ days of imatrix jobs in the queue).
We would have to do thus static quants at some point anyways so it doesn't really set us back. I guess it depends how you define normality. For me normality is if I see zero lownice quants in the queue. Currently there are 489 and all work the workers performed today went towards that goal. But I also see your perspective of normality where all the workers have a low task backlog so you can queue your daily 0-priority quants without hitting budget.
As a side node, you should really stick to the 2000/0/1000 categories(or at leats broad categories) - by assigning these distinct nice levels you essentially force a total order, leaving little freedom for the scheduler to shuffle things around.
Good point. I'm mostly queueing trending models. I don't really want them to impact your highly desired 0 priority models but basically want them to be done as soon we run out of 0 priority models. The reason I don't put them as a 40 model is because if it takes weeks/month for them to be processed then by the time we get to them the models will no longer be trending and our work is kind of useless as everyone already moved on to the latest trending models. Maybe the right approach would be to queue them as 0-priority anyways but I trust your model selection more than mine. What I will for sure do in the future is queueing them all at the same priority to give the scheduler at least some flexibility.
Dumb question: are you responsible for all these imatrix models on rain? Why would you do that when you know we are in trouble due to being blocked for a day. The queue as it was should have been fine as rain would pick predominantly static-only models.
Oh shit I haven't even thought about that. I indeed queued a few trending models today where most of them where "si". Back when I queued them we still had some of the priority 8, 9 and 10 models so I didn't expect them getting assigned to workers anytime soon as it usually takes days if I put something above 1 but now when thinking about it makes sense that in the current situation new tasks are getting picked up much quicker than usual as everyone is starving for work.
I really start to hate nature-agi-1.
RPC imatrix computation is a massive pain and it makes me really worried about having to requant the DeepSeek V3/R1 models for which we will likely have to recompute all the imatrix over RPC as the new GGUF files now contain a kv_b_proj layer for MLA but we should at least try if we can apply the old imatrix files on the new source GGUF. For DeepSeek V3/R1 we are not only much tighter but also takes around 30 hours per imatrix to quant. We could do imatrx computation on Q5_K_M with 475.5 GB which should fit on nico1 but it seams like the wrong thing to do for what is the most popular and best openly released model.
I think there is great value in having a different script and selecting things differently, which is why I deliberately didn't expose you to mine. My script also marks what was already queued for me, so that shouldn't cause interference between us.
I feel the same. I’m quite happy that we use different scripts and factors to decide which models to queue. We booth value slightly different kind of models. While I’m mainly chasing down trending models, models good for single turn Q&A, uncensored models and medical models you seam to mainly focus on popular models and models focusing on roleplay and story writing.
As a minor sidenote, library=safetensors filters skips about half of the models, not sure about pipeline_tag. The hf tags are not useful for filtering, IMHO.
I noticed that as well. You for sure noticed the commented out xpath. That is exactly for when I run it without a filter. Currently I’m only queuing models that pass the filter but soon my filtering will have improved enough that I will no longer need it.
Also, convert_hf_to_gguf.py can actually print the list of supported architectures, but my script also parses the source (well, not the filter script, the script that installs a new llama.cpp version). because that option seems to be new.
Nice to know. I wasn’t aware of this. But in any case, just phrasing the information out of the source file seams easiest for my use case.
And regarding rain, if for some reason a node is in this specific state (queue is full) it will not automatically accept models, but you can still push-model stuff to it (e.g. static-only ones). And if you know what you are doing, you could either force running them with .nobudget and .force, although it is questionable to go beyond the budget limits I set on nodes not under your control. But that works for nodes where you know what is going on (e.g. rich1 and nico1). You can see the actual disk-free value (free X), which is generally the real limit where things start to fail. If free space goes dangerously low, you can nuke and requeue. Or you could delete the gguf file via the shell commmand. That will recreate them when the next job starts (i.e. it works while "blocked" and obviously will error the job in other states, but audit can be used to get out of that).
You can also "move" models by nuke'ing them and add'ing them with a specific worker, when blocked, potentially push-model'ing them. Of course, that creates a lot of I/O, too, but at least not on the source node. The main reason why this is bad is because quite a few models need manual fixing, so there is no way to do what imatrix does with hfdprep fully automatically, without an rsync.
Thanks a lot for the valuable hints! I will keep them in mind. For rich1 and nico1 I can probably almost always temporarily move things to some external disk and softlink them to make some storage as I can control what resources to allocate to those workers. I don’t think I would be comfortable bypassing the budged on other nodes but the nuking method while not perfect might be a great choice in this case as the alternative would be a node just sitting around idle.
Also, lastly, some thoughts on more memory vs. less on nico1 while rpc jobs run. It is really hard to say. Right now, nico1 fortunately has some very big models to crunch on, so being able to do that is good for overall queue efficiency when tight. It's not necessarily the models I want to work it on, but at least it does good work.
If we ever reach our normal steady-state again, the time where we can't do imatrix tasks is probably more important though.
I also felt that currently working on quants was more important than having slightly faster RPC imatrix computation because at some point all this work would have needed to be done anyways. It’s also nice to finally get those 405B models done. I don’t like having so many massive models simultaneously but luckily the 20 TB HDD is doing great work to avoid a total storage nightmare.
When we are at that 20 TB HDD I saw a lot of btrfs-endio-write btrfs_work_helper processes reaching the hung_task_timeout_secs after 125 seconds when I downloaded a model to it today at 14:22. No idea why it happened but the issue fixed itself and the HDD is working without any issues or S.M.A.R.T indications since then so likely just some random BTRFS bug I encountered. On some good news the 3x 18 TB HDDs I ordered in early January might arrive in 2 to 3 weeks if they don’t change the delivery data again which is the soonest they ever showed.
However, given that one 4090 has 5GB free memory, I was close to trying to imatrix some smaller (say, 30B and smaller) on it while the rpc job was running. (the other is chock-full). Would have been a lot of hassle to try out - I would have to lie creatively about the available budget among other things (it's really hard to model the rpc job), so I did not do that, but that might be even better, i.e. leave as much space available on nico1 so some imatrix jobs can be done concurrently. I even considered fixing my older imatrix scheduler so I can calculate imatrices at home again, but that would be similarly limited (96GB RAM).
It would have worked. One of the 4090 GPUs just has a specific number of layers offloaded and so there will always be enough GPU memory for you to run imatrix without GPU offloading. So doing small models while RPC is running would for sure be possible. We would just need to be really careful not to run out of RAM.
In the end, however, I suspect that the solution is completely different: realise that mradermacher is doing factory-line quantisation for the broad masses, not necessarily for fringe models or models that need a lot of manual tweaking - we could leave this to other people.
Thing is that those massive models are kind of what I’m personally most interested in using myself and I feel like if I put all this resources into quantizing thousands of models the ones, I use the most, better be quantized by our team and at the highest quality possible. Currently I’m indeed 90% of the time using Nature-Reason-1-AGI mainly due to it being the largest uncensored reasoning model I can run at reasonable speed as without MLA DeepSeek R1 on llama.cpp is painfully slow.
And personally I really need to see some progress on the queue - we didn't have any for a month now. I really dread every time we use the rpc set-up at the moment.
Now that everything infrastructure wise is working again we could maybe soon setup with nico2 and nico3. I’m quite scared about internet bandwidth usage unless we make them CPU intensive IQ-quants only but we could also risk it and hope my ISP doesn’t kick me out if we exceed 500 TB/month bandwith.
I'm mostly queueing trending models. I don't really want them to impact your highly desired 0 priority models but basically want them to be done as soon we run out of 0 priority models.
The trending models are probably much more desired than the daily ones. In the end, I would assume there should be considerable overlap, but since out methods are distinct, they should complement each other.
The reason I don't put them as a 40 model is because if it takes weeks/month for them to be processed
I agree. Right now, the 40 priority models are mostly models with high download count that I had missed earlier (due to filtering by library=transformers), plus modpels I have skipped the last month because... we just couldn't. I recently started to only statically quantize some bigger models, intending to queue imatrix jobs at a later date.
RPC imatrix computation is a massive pain and it makes me really worried about having to requant the DeepSeek V3/R1 models for which we will likely have to recompute all the imatrix over RPC as the new GGUF files now contain a kv_b_proj layer for MLA but we should at least try if we can apply the old imatrix files on the new source GGUF.
Well, if the output changes, we should make new imatrix ones. If the imatrix has missing entries for tensors that are quantized to few bits, then llama-quantize will likely crash or abort.
You for sure noticed the commented out xpath.
Nope. Xpath, cool. Why would you even need... ah, you are essentially parsing the html page. Right... the trending info is not available via the api? That totally escaped me.
For rich1 and nico1 I can probably almost always temporarily move things to some external disk and softlink them to make some storage as I can control what resources to allocate to those workers.
I need to add those external locations to the llmjob runner though, so they are visible inside the sandbox. Maybe it's slowly starting to get time for a config file...
It’s also nice to finally get those 405B models done
Love your optimism, but they are far from done, and any time we could get more :) I plan to have one running at all times, maybe in addition to the two normal jobs, leaving the source on /gpool, which essentially gives it a background priority.
hung_task_timeout_secs
That in itself is just a friendly notice by your kernel, not an indication of a bug. It's not uncommon to increase it on a busy server. When a lot of complicated transaction have been made or simply there is a lot of bulk I/O, transactions can indeed hang for a long time. And it is being used quite a bit at the moment.
While we are at it, could you mount my / without discard and have a daily or so fstrim cronjob instead? Deleting large ggufs regularly causes long hangs. It's really just a minor optimisation, but it bugs me a lot for some reason :) I even moved some deletes into the background :)
Currently I’m indeed 90% of the time using Nature-Reason-1-AGI
I totally understand, and I am as willing to do large quants in the future as I was in the past. Just trying to put things into perspective (and getting priorities straight). I don't see the queue growing due to overlooked models much in the future.
better be quantized by our team and at the highest quality possible
Yes yes, aim high, not low.
I’m quite scared about internet bandwidth usage unless we make them CPU intensive IQ-quants only but we could also risk it and hope my ISP doesn’t kick me out if we exceed 500 TB/month bandwith.
Unfortunately, nico1 is also the best host for static quants. rain/back/kaos/marco are often I/O limited, for example. rich1, too. leia less, because I queue smaller jobs on it.
For example, I was even thinking about splitting jobs by quants, so that fast quants (Q*, essentially) are done by nico1 and slow ones (IQ* essentially) are done by other hosts. To get more throughput. At the expense of disk wear and increased I/O. Didn't seem appealing enough for me so far, but I did think of it :)
nico2 and nico3. I’m quite scared about internet bandwidth usage unless we make them CPU intensive IQ-quants only but we could also risk it and hope my ISP doesn’t kick me out if we exceed 500 TB/month bandwith. I really can't know how your ISP will react.
If we only modestly increase quanting throughput we should also be able to stay under 500 TB/month.
I would be ok with continuing as it is. I hope the current fad of kicking out 20 70b's per day will die down when everybody has distilled and sce-merged enough. Maybe they will start doing that to 405Bs instead, but hopefully not.
In any case, it would be, hopefully, a one-time thing (the current queue). If I ever get to queue 2023 models, it will be far, far, less of them, because it will only be models I will personally recognize somehow :)
Currently I’m indeed 90% of the time using Nature-Reason-1-AGI mainly due to it being the largest uncensored reasoning model I can run at reasonable speed as without MLA DeepSeek R1 on llama.cpp is painfully slow.
If you want to use MLA now this shows two inference engines you should be able to use with MLA support https://www.reddit.com/r/LocalLLaMA/comments/1iq6ngx/ktransformers_21_and_llamacpp_comparison_with/ . If you can't use a GPU you can only use Ik_llama.cpp and you'd have to convert it (and hopefully this is the same format llama.cpp lands on, it does have a "redundant" tensor that is small, that allows you to run either configuration with just a flag), but if you can use CUDA ktransformers if you can use is definitely faster ( I wish I could, but I can't as my GPU isn't in my server). MLA is definitely nice though.
ValueError: Can not map tensor 'model.layers.0.mlp.down_proj.weight_scale_inv'
Is there something special I need to do to convert a deepseek model, or does the above just indicate that I am trying to convert a broken model? (it's Z1-Zero in /gpool).
Is there something special I need to do to convert a deepseek model, or does the above just indicate that I am trying to convert a broken model? (it's Z1-Zero in /gpool).
All the DeepSeek V3/R1 based models are a pain to convert to a GGUF. You first need to convert them from FP8 to BF16 which makes them use 1.3 TB of storage and then convert them to a 1.3 TB GGUF. It is not possible to directly convert the source model to GGUF without going over BF16. But keep in mind that the source GGUF will be useless once https://github.com/ggml-org/llama.cpp/pull/11446 gets merged so I'm really not sure if we should spend much resources on quantizing them before this is merged but I guess static quants might be justified. I recommend to keep the FP8 model so we can properly GGUF them once it is merged as long, we have enough HDD storage to do so. Not having any idea when this will be merged makes planning incredibly difficult.
Thanks! Takes longer than I thought to get this merged. But I guess I'll hold back then and keep the downloaded model on /gpool for the time being.
Although the error (can not map tensor) is not normally indicative of a format problem - the converter simply doesn't recoghnize that tensor. Anyway, I think I will crap out and kindly ask you to convert it for me when the patch is merged :)
@mradermacher
I created a new node named nico2. You can access it over SSH using [email protected]:2111
as I copied the autorrized_keys from nico1. For WireGuard I opened UDP port 7104 as port 7103 was already in use by nico1.
nico1 currently has the following specifications:
CPU: AMD Ryzen Threadripper 3970X (32 cores 64 threads)
RAM: 256 GB DDR4 ECC
Storage: ADATA SX8200PNP 2 TB ZFS formated with 88% empty.
PSU: Seagate PRIME PX-2200 2200W ATX 3.1
Cooler: be quiet! Silent Loop 2
LAN 0: 10 Gbit internet access and intranet access to nico1 over 10 Gbit internet switch
LAN 1: 10 Gbit intranet access to nico1 over 10 Gbit intranet switch (recommended to use for transfer between nico1 and nico2)
GPU: RTX 3080 (currently not attached to your container)
OS: Debian 12
Thanks! Takes longer than I thought to get this merged.
And what is even worse: The pull request is still in draft stage with a lot of fundamental discussions still going on so it doesn't look like it is getting merge anytime soon. And this despite there not beeing a single code change for the past 3 weeks.
But I guess I'll hold back then and keep the downloaded model on /gpool for the time being.
Let's at least give it a try if it works with BF16 because FP8 not working is absolutely expected. I'm now using the following command to convert it. I expect this to take a long time and it will make use of one of the GPUs but not enough that it would conflict with imatrix computation:
CUDA_VISIBLE_DEVICES=1 venv/bin/python fp8_cast_bf16.py --input-fp8-hf-path /HDD/Z1-Zero --output-bf16-hf-path /HDD/Z1-Zero-BF16
Although the error (can not map tensor) is not normally indicative of a format problem - the converter simply doesn't recognize that tensor. Anyway, I think I will crap out and kindly ask you to convert it for me when the patch is merged :)
Well unless fp8_cast_bf16.py
processes this exact tensor...
for weight_name, weight in current_state_dict.items():
if weight_name.endswith("_scale_inv"):
continue
elif weight.element_size() == 1: # FP8 weight
scale_inv_name = f"{weight_name}_scale_inv"
try:
# Get scale_inv from the correct file
lucky that i sneaked in a "normally" there anyway, the fp8 tp b f16 step is what i was missing
but how did you even get the idea of checking the safetensor against the base model? you looked at the commit history?
somewhat unrelated, I've cleaned up llama.cpp usage, and it should now be possible to use any custom llama.cpp variant per-job. i'd even support if somebody else (cough) would take over maintaining any llama.cpp forks we might want to use. all that is required is to have some llama.cpp source directory with a build directory under it where it was built (I use cmake).
Is there something special I need to do to convert a deepseek model, or does the above just indicate that I am trying to convert a broken model? (it's Z1-Zero in /gpool).
All the DeepSeek V3/R1 based models are a pain to convert to a GGUF. You first need to convert them from FP8 to BF16 which makes them use 1.3 TB of storage and then convert them to a 1.3 TB GGUF. It is not possible to directly convert the source model to GGUF without going over BF16.
https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446/discussions/1#67a327570051a98a96ded9e6 This is another method that saves you a step.
@mradermacher
I just got wake on LAN working for nico2 so you can execute /root/wakeNico2.sh
on nico1 to turn on nico2 should it be off.
To shut down the nico2 host execute /root/shutdownHost.sh
You can access nico2 from nico1 using ssh [email protected]
.
You can access nico1 from nico2 using ssh [email protected]
.
Like on nico1 there is /host/proc/meminfo
, /host/proc/stat,
/host/proc/uptimeand
/host/proc/vmstat` to check stats of the host. But given that currently your LXC container is usualy the only thing running on CastlePeak it is unlikely you need that.
memlock is set to unlimited and nproc to 4000 like on nico1
I changed the sysctl.conf
on CastlePeak as following to match what we set on StormPeak:
# allow TCP with buffers up to 128 MiB
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
# set default buffer size during socket creation to 2 MiB
net.core.rmem_default = 2097152
net.core.wmem_default = 2097152
# increase TCP autotuning buffer limits
net.ipv4.tcp_rmem = 4096 4194304 67108864
net.ipv4.tcp_wmem = 4096 4194304 67108864
# Sets maximum size of the network interface's receive queue
net.core.netdev_max_backlog = 30000
# Use improved TCP congestion control algorithm
net.core.default_qdisc=fq
net.ipv4.tcp_congestion_control=bbr
I now know the probable cause why the RPC setup was so slow. When I checked the CastlePeak BIOS settings to enable wake on LAN I realized that PCie3 port currently hosting ther RTX 3080 GPU was set in x4x4x4x4 mode instead of x16 due to me previously plugging in 3x Intel ARC GPUs using a PCIe bifurcation into that PCIe port a few weeks ago and I forgot to change it back.
but how did you even get the idea of checking the safetensor against the base model? you looked at the commit history?
I for sure did not believe him that he was able to finetune DeepSeek R1 as none of the public framework currently supports it on a setup with less than 1500 GB of GPU memory which is only possible to get on AMD in a single node and AMD is not DeepSeek R1 compatible so I obviously wanted to know if he did finetune it or is just a fraud. The commit message "Duplicate from deepseek-ai/DeepSeek-R1" visible in the model card and the commit history instantly gave it away that he just cloned DeepSeek R1. But even without it I obviously would have compared hashes with all other publicly released V3/R1 models.
What is much more interesting is what he uploaded now. Did he really accidentally claimed a DeepSeek R1 clone to be his DeepSeek R1 Zero finetune and now upload a real model he created using his own custom finetuning code or is it yet another lie. At a first glance it looks real so I downloaded it to gpool but I only spent like a minute investigating it so far as I was busy setting up nico2.
The trending models are probably much more desired than the daily ones. In the end, I would assume there should be considerable overlap, but since out methods are distinct, they should complement each other.
Great to know so I will queue them using priority 0 as well.
I recently started to only statically quantize some bigger models, intending to queue imatrix jobs at a later date.
Just make sure not to forget about them
Well, if the output changes, we should make new imatrix ones. If the imatrix has missing entries for tensors that are quantized to few bits, then llama-quantize will likely crash or abort.
Yes we unfortunately likely have to compute the imatrix of all the DeepSeek V2/V3/R1 models again. For V3/R1 we will unfortunately need RPC to do so in Q8.
Nope. Xpath, cool. Why would you even need... ah, you are essentially parsing the html page. Right... the trending info is not available via the api? That totally escaped me.
There is https://huggingface.co/models-json?pipeline_tag=text-generation&library=safetensors&p=1&sort=trending&withCount=false which should contain the same information and to which I might switch soon. Keep in mind that before my Python script I copied some JavaScript code into the Firefox development console so getting the data out of HTML using XPath was easier. I often prefer getting data out of HTML. HTML has usually less rate limit issues and XPath is a well-defined well established standard while JSON Path got standardized not even a year ago.
I need to add those external locations to the llmjob runner though, so they are visible inside the sandbox. Maybe it's slowly starting to get time for a config file...
That would be cool especially should we decide to add external storage to rich1 as well.
Love your optimism, but they are far from done, and any time we could get more :) I plan to have one running at all times, maybe in addition to the two normal jobs, leaving the source on /gpool, which essentially gives it a background priority.
Having one with gpool as source running all the time additionally the two normal jobs make a lot of sense as HDDs are much slower than SSDs.
That in itself is just a friendly notice by your kernel, not an indication of a bug. It's not uncommon to increase it on a busy server. When a lot of complicated transaction have been made or simply there is a lot of bulk I/O, transactions can indeed hang for a long time. And it is being used quite a bit at the moment.
That explains why it happened when I was hfd massive models with 8 threads to it... I was quite worried about it and relieved to know that this is normal for BTRFS HDDs.
While we are at it, could you mount my / without discard and have a daily or so fstrim cronjob instead? Deleting large ggufs regularly causes long hangs. It's really just a minor optimisation, but it bugs me a lot for some reason :) I even moved some deletes into the background :)
I will do so the next time I we have to reboot nico1 and rich1. I read there is discard=async in BTRFS which makes even more sense. Because only trimming once a day might mean slower write speed due to trying to write on non-trimmed blocks.
I totally understand, and I am as willing to do large quants in the future as I was in the past. Just trying to put things into perspective (and getting priorities straight). I don't see the queue growing due to overlooked models much in the future.
I believe and hope this is a one-time thing and once we finally are done with the massive backlog things should relax. It was a crazy past 4 month.
Unfortunately, nico1 is als o the best host for static quants. rain/back/kaos/marco are often I/O limited, for example. rich1, too. leia less, because I queue smaller jobs on it.
I wonder if we could do something about rich1 being io limited. Currently we are using the 2 TB NVMe SSD but there is another 1 TB SATA SSD we could use. I already thought about RIAD 0 them together but RAID 0 a fast NVMe SSD with a slower SATA SSD seem like a bad idea. We could also disable BTRFS compression and see if that helps. I likely have to disable discard on rich1 as well anyways.
For example, I was even thinking about splitting jobs by quants, so that fast quants (Q*, essentially) are done by nico1 and slow ones (IQ* essentially) are done by other hosts. To get more throughput. At the expense of disk wear and increased I/O. Didn't seem appealing enough for me so far, but I did think of it :)
I really couldn't care less about disk ware. If we continue at the current rate, they will last another 3 years and I wouldn't mind if they break earlier as then I have a reason to replace them with high quality 4 TB SSDs. I currently filled all 8 NVMe slots of StormPeak so I can't add any more of them without replacing an existing one. Currently they are at 25% and 21% wear.
But only doing non-IQ quants would be an internet bandwidth concern. While there is no fair use clause in my contract testing out how much my ISP is willing to tolerate before kicking my out is not the smartest idea given that all other ISPs use the unreliable fiber network maintained by Swisscom instead of the stable high-quality fiber network maintained by Quickline. But I guess nico2 is worth risking pushing the limits at least a bit.
If we only modestly increase quanting throughput we should also be able to stay under 500 TB/month.
I think so as well. This is not really a hard limit anyways. It's just what their competitor Init7 put as fair use into their contract so if they complain and I'm below 500 TB/month I could tell them that their competitor would be fine with me using as much traffic as I use.
I would be ok with continuing as it is. I hope the current fad of kicking out 20 70b's per day will die down when everybody has distilled and sce-merged enough. Maybe they will start doing that to 405Bs instead, but hopefully not.
I wouldn't count on it so we better slightly increase our throughput so we can finally catch up with the latest models and work on our massive backlog.
In any case, it would be, hopefully, a one-time thing (the current queue). If I ever get to queue 2023 models, it will be far, far, less of them, because it will only be models, I will personally recognize somehow :)
That's what I'm thinking as well. do have the resources required for us to scale up so let's do it.
somewhat unrelated, I've cleaned up llama.cpp usage, and it should now be possible to use any custom llama.cpp variant per-job.
That's cool so we can finally redo the snowflake arctic models because we all miss massive models.
i'd even support if somebody else (cough) would take over maintaining any llama.cpp forks we might want to use. all that is required is to have some llama.cpp source directory with a build directory under it where it was built (I use cmake).
Sure if you create a pull request with your own llama.cpp patches to https://github.com/nicoboss/llama.cpp with "mradermacher" as target branch I can do so. Then we also finally have a place to merge my imatrix-allow-partial-data branch to.
https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446/discussions/1#67a327570051a98a96ded9e6 This is another method that saves you a step.
Awesome! Thanks a lot for your recommendation. Do you have any idea if the GGUF produced that way will be equal to what you get from converting the downloaded DeepSeek V3/R1 model to BF16 and then the BF16 model to GGUF? Will a source GGUF produced by this code be compatible with the official llama.cpp?
Did he really accidentally claimed a DeepSeek R1
I was also bit suspicious after he uploaded again. Too bad we don't have the original repository (with the un-downloadable file), but he was awfully quick in re-uploading a new repo.
I just got wake on LAN working for nico2
Uh, ah, oh, wow - I'll try to set it up tomorrow.
To shut down the nico2 host execute /root/shutdownHost.sh
Does that mean I should shut it down automatically before nightfall or so? I can probably devise a strategy (like, start at 7, interrupt/stop at 1600 and shut down). Of course, it will be a challenge :)
I now know the probable cause why the RPC
Sounds like a probable cause indeed. We'll find out soon enough :)
"intending to queue imatrix jobs at a later date." Just make sure not to forget about them
Some have already gone away :-)
JSON Path got standardized not even a year ago.
But web pages not intended for scraping are not standardized at all, unlike an api. In any case, I don't care, whatever made is easy for you to you come up with the script wins.
I will do so the next time I we have to reboot nico1 and rich1.
Fine with me, although it surely is remountable.
Having one with gpool as source running all the time additionally the two normal jobs make a lo
I have to hand-designate such jobs, but there is now some hacky code to do exactly that, in use for Tulu.
Because only trimming once a day might mean slower write speed due to trying to write on non-trimmed blocks.
fstrim should be more efficient than online trimming and, for some reason, less blocking. But I have no real experience with specifically your nvme disks, and its very disk dependent. I just noticed that deletes are surprisingly slow, and overwriting tends to be more efficient overall, even if individual writes may be a bit slower. As long as there is either some trimming or enough spare.
RAID 0 a fast NVMe SSD with a slower SATA SSD seem like a bad idea.
I agree. Espedfially if it's a non-enterprise sata ssd, we migzht end up at <<200_MBps write speed, maybe much less.
In any case, rich1 is not always I/O-limited, only when it is doing lots of static quants, or when it is converting to gguf. It's by far not as much of a problem as on rain/back/kaos/marco.
I could tell them that their competitor would be fine with me using as much traffic as I use.
Between you and a big corporation, you usually end up at the losing end of any argument.
Sure if you create a pull request with your own llama.cpp patches
I have no functional changes. But if somebody would maintain a fork, I might be bothered to add e.g. timestamps to imatrix output. I brought it up mainly because I didn't want to maintain (and regularly merge) the imatrix patches. I'd be happy to use you as upstream.
If I ever get to queue 2023 models
I just got a taste of it by looking at bluemoonrp-13b - repo consists of 4bit safetensors, sdome ggml files and the original pytorch files in a subdirectory. And bluemoonrp-30b seems to be lost other than a 4 and 5 bit version. I was susprised it did convert fine, though - I have the feeling we might have to resort to some early 2024 version of convert.py.
https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446/discussions/1#67a327570051a98a96ded9e6
wrong thread?
@nicoboss did you do something special on nico1 to let me set btrfs properties (such as compression)? it's because my rsync's currently fail on nico2 because they can't set extended attributes (had to play around with it before going to sleep :)
Uh, ah, oh, wow - I'll try to set it up tomorrow.
Awesome! I'm looking forward to nico2.
Does that mean I should shut it down automatically before nightfall or so? I can probably devise a strategy (like, start at 7, interrupt/stop at 1600 and shut down). Of course, it will be a challenge :)
Shut it down if there is no work for it left but this will likely not be the case anytime soon. For now you could keep it running and if you ever find time to implement it maybe turn it off from 17:00 to 22:00. CastlePeak seems to be the node with the lowest idle CPU power consumption of only around 44 watt and I have no idea why as cpupower frequency-info
shows the maximum CPU frequency as current frequency. StormPeak uses around double that on idle power consumption. But maybe it is a case of turbostat
wrongly estimating the power usage. I really should measure and compare the actual power draw at some point.
Sounds like a probable cause indeed. We'll find out soon enough :)
The next time we do RPC imatrix computation we will figure it out as I now set the PCIe slot to run in x16 mode again.
But web pages not intended for scraping are not standardized at all, unlike an api. I
True but if things break you can just copy the xpath from your browser to your script to fix it. But I agree that using JSON would be a nicer solution long-term and I might switch soon.
Fine with me, although it surely is remountable.
You mean remounting it while your container is using it? This seems like a terrible idea but maybe there is a way to gracefully remount a file system while in use I'm not aware of but this really sounds like something that shouldn't be possible. Once the container is turned off, I can likely remount it without rebooting the host as nothing is using the file system anymore.
I have to hand-designate such jobs, but there is now some hacky code to do exactly that, in use for Tulu.
Nice!
fstrim should be more efficient than online trimming and, for some reason, less blocking. But I have no real experience with specifically your nvme disks, and its very disk dependent. I just noticed that deletes are surprisingly slow, and overwriting tends to be more efficient overall, even if individual writes may be a bit slower. As long as there is either some trimming or enough spare.
Interesting so I will change it and we see how things perform.
I agree. Especially if it's a non-enterprise sata ssd, we might end up at <<200_MBps write speed, maybe much less.
In any case, rich1 is not always I/O-limited, only when it is doing lots of static quants, or when it is converting to gguf. It's by far not as much of a problem as on rain/back/kaos/marco.
In that case let's use it as additionally mounted disk should we ever run low on storage or have a massive model to process there.
Between you and a big corporation, you usually end up at the losing end of any argument.
They are a local ISP and so not what I would consider a big corporation. They are usually willing to find a solution booth parties agree on instead of letting go of customers. They are the ISP with the best customer service rating in Switzerland and the first one to get rid of any fair use traffic limitations in their “unlimited” contract over 10 years ago. Back when I had issues with my coaxial internet they tried everything to help me and later worked together with me to install a fiber cable to my house 3 times faster than they usually do. I really couldn't be any happier with the service they are currently providing.
I have no functional changes. But if somebody would maintain a fork, I might be bothered to add e.g. timestamps to imatrix output. I brought it up mainly because I didn't want to maintain (and regularly merge) the imatrix patches. I'd be happy to use you as upstream.
Just send me or create a pull request with your non-functional changes and I will add them.
I just got a taste of it by looking at bluemoonrp-13b - repo consists of 4bit safetensors, sdome ggml files and the original pytorch files in a subdirectory. And bluemoonrp-30b seems to be lost other than a 4 and 5 bit version. I was surprised it did convert fine, though - I have the feeling we might have to resort to some early 2024 version of convert.py.
Wow impressive this even converted. Such old models will for sure be a challenge. I tried the original Llama 65B model a few weeks ago and was really impressed how it was able to answer some questions much better than any modern model could so there are for sure some hidden gems in those old models.
wrong thread?
The code posted in that thread looks correct but I have not tested it yet but if true it should be able to directly convert the DeepSeek R1 model to GGUF without the intermediate step over BF16.
@nicoboss did you do something special on nico1 to let me set btrfs properties (such as compression)? it's because my rsync's currently fail on nico2 because they can't set extended attributes (had to play around with it before going to sleep :)
nico2 unfortunately uses ZFS instead of BTRFS as it currently only contains a single SSD and changing the boot SSD to BTRFS would require me to reinstall the entire host and make using Proxmox on this host quite inconvenient. This is not really something I can change without installing another SSD. I know how this will be terrible for splitting models so maybe I will soon move one of the SSDs from Threadripper over or buy another SSD.
ZFS
Right, that explains it (and is fine, don't bother), I'll have to look into making rsync behave and ignore those extended attributes.
You mean remounting it while your container is using it? This seems like a terrible idea but maybe there is a way to gracefully remount a file system while in use I'm not aware of but this really sounds like something that shouldn't be possible.
Is there a way to un-gracefully remount a fs? I've never heard of remounting an fs causing an issue, it should be transparent to anything using it. The only problem is that it is sometimes not clear how to disable certain features. In any case, "mount -oremount,nodiscard /mountpoint" would do the trick for discard.
Wow impressive this even converted.
Less impressive is that it even converted without doing anything first, resulting in a nonfunctional model, until I deleted the safetensors file and moved the pytorch files around. This brings up an issue I shell to discuss with you another time (better model verification).
castlepeak 44W
For the time being, I'll implement similar time-based restrictions as on nico1 then. Switching on/off.... seems a bit more daunting.
(And 44W for the whole system would be impressively low. 44W for an idle CPU still sounds 44W too high. Oh my, deja vu)
Anywas, night. I queuex some tets jobs and an actual user request on nico2. Somehow, I have misplaced my install instructions, but mostly doing the automated stuff (/llmjob/share and /llmjob/llama.cpp*) did seem to get it working nowadays.
Getting wireguard to work properly is an issue for another time. And maybe ssh access from outside might be useful in case I can't reach nico1 (how about port +1 - for wireguard, I also use port+1=7104 for nico2, although it kind of works after a while without port forwarding).
You might be awake before (or after me), so you need to keep watch over nico2. Maybe should have installed it tomorrow....
Getting wireguard to work properly is an issue for another time. And maybe ssh access from outside might be useful in case I can't reach nico1 (how about port +1 - for wireguard, I also use port+1=7104 for nico2, although it kind of works after a while without port forwarding).
As mentioned in https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/3#67bb2c56981d135cc3229a6c you can use [email protected]:2111
to SSH nico2 from the internet and use port 7104 for WireGuard.
https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/3#67bb2c56981d135cc3229a6c
Judging from the link not working until I manually show more messages, that's probably also why I overlooked it. I wonder how much else I didn't see because of hf hiding new messages.
btw. llmc should work just the same as on nico1, and I will create a new pause script as well. and maybe an autoshutdown script.
and maybe to save some power till then, I simply freeze > -1700 jobs from 1700-0700. I think with even a modest increase in quant power we will make significant progress. Or maybe stop starting them at 16:00 or so.
And I will change the <= -1700 rule to < -1400, so we have a nice round -1500 "category".
Update: won't work well, given that the 70b I started at midnight is not finished 10 hours later. wow, does a 70b really take that long? nico2 should be 3 times as fast as rich1. but clearly is so. i need to recalibrate my intuition drastically.
update 2: ah, maybe pause the remaining running jobs from 17:00 to 22:00 might be a good compromise.
I will provide /llmjob/share/bin/host-pause and -resume scripts everywhere, replacing nico1-pause in /root etc. pretty much untested, as usual, but they are shells cripts.
Update: won't work well, given that the 70b I started at midnight is not finished 10 hours later. wow, does a 70b really take that long? nico2 should be 3 times as fast as rich1. but clearly is so. i need to recalibrate my intuition drastically.
I'm not really sure why but somehow CPU utilisation on nico2 is really low:
CPU and IO wait percentage in past hour:
CPU and IO wait percentage in past day:
Edit: CPU and IO wait percentage in the hour after posting this message:
Great. Gone for a few hours, and then the scheduler decided to act up and fill nico2.
As for the nico2 utilisation, I also don't understand it. At least the first half of the day, there were essentially just two quant jobs running in parallel. I was sleeping m,ost of the time.
Right now, nico2 is I/O limited ~50% of the time. Apparently, the CPU is too fast for IQ4_XS + Q3_K_S. Maybe the solution is to run more jobs. Or rewrite llama.cpp for better parallelisation. Or accept it, limit the cpu frequency to 2/3rd of the max and appreciate the power saving that brings.
I also wondered about more or less randomising nice levels a bit, so one job preferably runs while the other is not using the cpu.
But it doesn't seem to be an issue on nico1, which has faster disks, but also ~30% faster cpu.
I don't know if we have a problem at all, btw. maybe it's all working fine?
But no, on nico1, it seems we have little difficulty achieving 100% cpu usage most of the time, and I don't see much more I/O.
Anyway, much is wrong with the overall job queuing at the moment, I'll try to fix that first :)
Maybe it's splitting... just now, nico2 had ~0%cpu, 1GBps read, 0.5GBps write for 5 minutes while bigsplit was running. Not sure why the I/O was so unsymmetric though.
But afterwardsa, at was ~1GBps read, ~150MBps write at ~12% cpu.
If splitting is the issue we could avoid nico2 having to split by only queuing 45b and smaller to it. I'm also wondering if we can somehow make use of the 256 GB of RAM nico2 has to safe on IO operations. Currently ZFS ARC is limited to 2 GB. Maybe I should set it to 150 GB and see if it is intelligent enough to cache the source GGUF.
Here the current CPU plot:
loop mount a btrfs/xfs/ext4 image might be an option as well (if bigsplit is the only issue which it might not be). 250gb "cache" would be somewhat inconvenient, since the big thing about nico2 is that it can do 70b and 123b's, both of which would only fit if we run a single job. queueing only 45bs would be a minor disaster.
also, i see very little iowait time - so little that i would normally assume i just don't see any in my container, but soemtimes i do see it, so maybe it's real. mostly I see idle times I can't explain, but, again, I probably don't understand how being in a container affects that.
also, as an experiment, if I md5sum Behemoth*gguf, I get more read I/O, so it doesn't look as if I am I/O limited, at least not to the extent I am. And I can't believe zfs could be an issue. Can't be that bad, really.
in any case, it might just be the reality that I see on other nodes - rain has a way slower cpu, but still often is I/O-limited due an equally slow disk. nico1 is just very well matched...
and lastly, feel free to experiment with ram - the quantize jobs are limited to 32gb each...
but something weird might go on. i feel the I/O read is much higher than needed. even during bigsplit, where quantize was essentially frozen, I had double the read I/O vs. write I/O. something feels very fishy. nico1 seems to do a lot more with less I/O.
https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446/discussions/1#67a327570051a98a96ded9e6 This is another method that saves you a step.
Awesome! Thanks a lot for your recommendation. Do you have any idea if the GGUF produced that way will be equal to what you get from converting the downloaded DeepSeek V3/R1 model to BFS and then the BF16 model to GGUF? Will a source GGUF produced by this code be compatible with the official llama.cpp?
The upcasting should be equal, as up casting is deterministic. The GGUF produced will have the extra MLA tensors as the llama.cpp it pulls includes PR #11446. You can either adapt the triton dequant (which is what lets you skip that step) to work with normal llama.cpp or wait to use this method once PR #11446 is merged in assuming there is no further change to the format ( I hope there isn't one as the "benefit" of removing the redundant tensor prevents you from being able to have MLA toggled based on runtime arguments, all to save you a very small amount of space). This method has worked for others I reccomended it to, but they all used it to test the MLA implementation in the PR.
There are now also this cool resource usage peaks.
Pure techno poetry :)
Wow I checked zfs-stats and somehow the ARC cache is super intelligent. It started to use a mixture of most recently and most frequently cache lanes. It uses most recently cache algorithm to cache the output quants, so we don't have to read them when we upload them and it caches the source quants using the most frequently cache algorithm.
There also is ZFS ARC prefetching which seems to play a huge role. ZFS prefetch attempts to read more blocks than initially requested into the ARC in case they are needed in the near future. This somehow seams to compensate for llama.cpp's terrible IO code.
ARC efficiency:
hits.value 97.55
misses.value 2.44
actual_hits.value 97.55
data_demand_efficiency.value 99.75
data_prefetch_efficiency.value 4.05
Cache hit by cache list:
cache_list_anon.value -.11
cache_list_most_rec.value 43.81
cache_list_most_freq.value 56.18
cache_list_most_rec_ghost.value .05
cache_list_most_freq_ghost.value .05
yes, that's pretty good - for example, when we had a big model and a small model (today), the small model was cached but the big model wasn't, pretty much halving the I/O requirements for reading. Didn't see much of an effect on read-ahead, though (which I would expect from any filesystem, the question is, how much).
@mradermacher
Today I for the first time used one of my own read-only tokens to queue some gated models using llmc add force -777 si <url> hf_token <token>
however first I had to force it because the meta data checker failed due to it being gated but much worse the imatrix node failed downloading the model due to it not having the token. Can you maybe just add the token as a shortcut one so it works? I put it into /root/NicoGlobalReadToken2025.txt
imatrix queue:
-777 ? Bio-Medical-Llama-3-8B-CoT-012025 error/11 hfd/12=0%
marco:
-777 17 si Bio-Medical-Llama-3-8B-CoT-012025 run/static 2/12,Q4_K_S (hfu f16)
Ah, that's a difficult problem. Unfortunately, I already removed those models - you should probably communicate such experiments to me, because I had no idea what is going on, saw that it is gated and wrongly assumed it is now gated (which is not uncommon).
In the meantime, it's easiest to queue such models on a worker that does not require hdfprep, i.e. !marco and !rich1, then this can't happen.
Bio-Medical-Llama-3-2-1B-CoT-012025
/root/cvs/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:437: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
(side note, why is there still /root/ in the path, weird)
That brings me to something I wanted to discuss with you. Right now, a certain amount of models fail during imatrix generation, tensor shape mismatch, vocabulary problems etc.
It would be nice if we could check loading via llama.cpp after conversion, before we even start to quant. But running inference is prohibitive on most nodes, so I have not done so. My, uhm, knee-jerk reaction to this problem would be to create a sparse copy of the gguf with the tensors being holes, and see if that loads. But that's a pretty brutal method :) Maybe you have a good idea on how to validate ggufs without actually reading them fully?
And also, since you seem to currently queue a lot of models that will fail, remember to use llmc audit to nuke those once they fail :)
side note, why is there still /root/ in the path, weird
Turns out the cuda512 variant is outdated and was never updated. Dangit.
Update: typo in script. Good that we caught that :)
It would be nice if we could check loading via llama.cpp after conversion, before we even start to quant. But running inference is prohibitive on most nodes, so I have not done so. My, uhm, knee-jerk reaction to this problem would be to create a sparse copy of the gguf with the tensors being holes, and see if that loads. But that's a pretty brutal method :) Maybe you have a good idea on how to validate ggufs without actually reading them fully?
The easiest way to test if a model is llama.cpp compatible without loading its layers it is to build llama.cpp using RPC and then just RPC offload all layers to a non-existing RPC server so crashes after having validated the entire model but before loading any layers. The only check I can think of that would happen after that is the NaN/Inf check which is a non-issue for as due to it being checked during quantizing - I know because I disabled NaN/Inf checks during quantization in the past but the performance difference turned out to be neglectable.
Excample of working model:./llama-cli -m /root/OS-Atlas-Base-7B.IQ4_XS.gguf --rpc 127.0.0.1:1234 -ngl 999
(...)
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: CUDA0 KV buffer size = 120.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 104.00 MiB
llama_init_from_model: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.58 MiB
Failed to connect to 127.0.0.1:1234
/root/llama.cpp/ggml/src/ggml-backend.cpp:1488: GGML_ASSERT(ggml_backend_supports_buft(backends[b], sched->bufts[b])) failed
Excample of broken model:./llama-cli -m /apool/Dequant/SenecaLLM_x_gemma-2-9b-CyberSecurity-BF16.gguf --rpc 127.0.0.1:1234 -ngl 999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 4784 (b95c8af3) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device RPC[127.0.0.1:1234] (RPC[127.0.0.1:1234]) - 0 MiB free
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 2797 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 2800 MiB free
llama_model_load: error loading model: tensor 'token_embd.weight' data is not within the file bounds, model is corrupted or incomplete
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/apool/Dequant/SenecaLLM_x_gemma-2-9b-CyberSecurity-BF16.gguf'
main: error: unable to load model
Excample of model with wrong tensor shapes:./llama-cli -m /root/llama.cpp/SenecaLLM_x_gemma-2-9b-CyberSecurity-BF16_full.gguf --rpc 127.0.0.1:1234 -ngl 999
(...)
print_info: EOG token = 1 '<eos>'
print_info: EOG token = 107 '<end_of_turn>'
print_info: max token length = 39
load_tensors: loading model tensors, this can take a while... (mmap = true)
Failed to connect to 127.0.0.1:1234
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 3584, 4096, got 1, 7340032, 1, 1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/llama.cpp/SenecaLLM_x_gemma-2-9b-CyberSecurity-BF16_full.gguf'
main: error: unable to load model
It does so in almost no time:time ./llama-cli -m /root/llama.cpp/SenecaLLM_x_gemma-2-9b-CyberSecurity-BF16_full.gguf --rpc 127.0.0.1:1234 -ngl 999
real 0m0.792s
user 0m0.237s
sys 0m0.524s
And also, since you seem to currently queue a lot of models that will fail, remember to use llmc audit to nuke those once they fail :)
I queued quite a lot (around 250) but relatively small medical models so everyone can run them on their phone. They are all really small so they hopefully shouldn't take that long. No worries I do and will continue to run audit and proactively nuke models I realize are not supported. Maybe not as you as the time from error appearing to you clearing it is insane. I might just check once per hours and audit if I see some failures. It would be really useful to have the original models HuggingFace page in the status page right-click menu to continue reevaluating them while already queued.
Turns out the cuda512 variant is outdated and was never updated. Dangit.
Update: typo in script. Good that we caught that :)
Oh wow good you caught that.
The easiest way to test if a model is llama.cpp compatible without loading its layers it is to build llama.cpp using RPC and then just RPC offload all layers to a non-existing RPC server so crashes after having validated the entire model but before loading any layers.
That's even more depraved than my method. I was under the impression it would load the whole model before that step. Will have to test.
(test)
Hmm, for some reason, I always get this when I try rpc:
load_tensors: loading model tensors, this can take a while... (mmap = true)
llama_model_load: error loading model: vector::_M_range_check: __n (which is 1) >= this->size() (which is 1)
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/gguf/mistral-7b-instruct-v0.1.Q4_0.gguf'
main: error: unable to load model
relatively small medical models so everyone can run them on their phone.
I said that because many of the models you queued are adapters and will immediately fail (and probably shouldn't have been queued). And small models tend to have customized tokenizers that are not supported by llama.cpp - this is why I queue small (in my case, that means <1B) models with a higher priority, because I know 80% of them will fail and I can clean up while I still remember what I queued :) It's less comon with 1.5B+, but the tendency is there.
Also, you passed the test - the Homer* model was failed because tensor files were missing in the original repo, so I left it in the queue till they would likely show up. I think you restarted it (once the tensor files were there).
Didn't leave it as a test, of course, but you passed nevertheless :-)
llama_model_load: error loading model: vector::_M_range_check: __n (which is 1) >= this->size() (which is 1)
Ah, this happens long after it failed to connect, so I guess this is the "successful" output?
@nicoboss would you want to become the "official" mradermacher "upstream" for llama.cpp? I could build form the mradermacher branch at https://github.com/nicoboss/llama.cpp and you even get to decide when I should make a new build.
@nicoboss would you want to become the "official" mradermacher "upstream" for llama.cpp? I could build form the mradermacher branch at https://github.com/nicoboss/llama.cpp and you even get to decide when I should make a new build.
That would be great. I just synched the branch with upstream and merged "imatrix : Allow partial data in imatrix" into it. Please create a merge request with your own llama.cpp changes as well.
Ah, this happens long after it failed to connect, so I guess this is the "successful" output?
Yes and that is the issue. llama.cpp doesn't fatally crash for you if RPC connection fails causing the model to load into RAM. But no worry I will add a better way to test models in ouer llama.cpp in the evening.
Also, you passed the test - the Homer* model was failed because tensor files were missing in the original repo, so I left it in the queue till they would likely show up. I think you restarted it (once the tensor files were there).
It wasn't the only model I had to use the "Redown" option yesterday. There was at least one other model where some tensors just didn't download and "Redown" fixed it. I haven't thought about us queueing before they are fully uploaded as at least I personally always upload models in a single commit but now that you mentioned it, it makes sense.
Didn't leave it as a test, of course, but you passed nevertheless :-)
Thanks! :D
I just synched the branch with upstream
nicoboss/mradermacher is the new upstream form now on, I am currently rebuilding just to be3 in sync.
Please create a merge request with your own llama.cpp changes as well.
I'll try to live without any changes, as I didn't need my changes for months (my only changes atm. are to print gguf keys before writing, so I can see which key is wrong, and to add another bpe hash for olmo).
That's something that bugs me a lot of python backtraces, that you don't see the arguments, and somehow python libs don't seem to want to include anything useful in exception messages either (what I mean is "json parse error in line 5" - yeah, but ion what file, damnit. Or "type mismatch for key" - yeah, what key, what value).
There was at least one other model where some tensors just didn't download and "Redown" fixed it.
That should never happen (and I haven't seen such a case) - either the downlaod succeeds and you have all files, or it should loop, or fail. But it is the right thing, indeed, when files get added to the repo.
I personally always upload models in a single commit
I don't know why it happens (sometimes, it's not a very common problem), but with the state of how hf is, uploading non-trivial amount of data in a single commit is suicide. But huggingface-cli now has some kind of incremental upload option that probably allows you to retry and still get a single commit only, but it's not the default. I am very forgiving when it happens :)
But no worry I will add a better way to test models in ouer llama.cpp in the evening.
That sounds very exciting.
Update: sorry, I meant "That sounds very exciting1111!" :)
@nicoboss if you are looking for a minor masterwork, an option to display memory usage (ram, vram) without fully loading the tensors (or ideally not even using a backend) might be a killer feature :) i suspect it will be horrors, though.
@nicoboss if you are looking for a minor masterwork, an option to display memory usage (ram, vram) without fully loading the tensors (or ideally not even using a backend) might be a killer feature :) i suspect it will be horrors, though.
@mradermacher
I implemented a way for llama.cpp to go through all the steps of loading a model without loading a model in order to validate the model and compute the memory required to load the model. Everything is working and the only thing left is some cleanup. If you want to give it a try just checkout the load-model-dry-run
branch or take a look at https://github.com/nicoboss/llama.cpp/pull/3 and make sure to specify --no-mmap
and the exact context for which the required storage should be computed. Implementing this was surprisingly complicated but I’m really happy I achieved our goals and was able to determine within seconds how much memory would be required to run the 1.4 TB large Z1-Zero source GGUF with 10000 context.
./build/bin/llama-cli -m /HDD/Z1-Zero.gguf --no-mmap -c 100000
ggml_backend_cpu_buffer_type_alloc_buffer: Skip allocateing buffer of size 1342267742208
load_tensors: offloaded 0/62 layers to GPU
ggml_backend_cpu_buffer_type_alloc_buffer: Skip allocateing buffer of size 499712000000
llama_init_from_model: KV self size = 476562.50 MiB, K (f16): 285937.50 MiB, V (f16): 190625.00 MiB
ggml_backend_cpu_buffer_type_alloc_buffer: Skip allocateing buffer of size 517120
./build/bin/llama-cli -m /HDD/Z1-Zero.gguf --no-mmap -c 10000 -ngl 5
ggml_backend_cpu_buffer_type_alloc_buffer: Skip allocateing buffer of size 1227176363008
ggml_backend_cuda_buffer_type_alloc_buffer: Skipping allocating 69054827520 bytes on device 0
ggml_backend_cuda_buffer_type_alloc_buffer: Skipping allocating 46036551680 bytes on device 1
load_tensors: offloaded 5/62 layers to GPU
ggml_backend_cuda_buffer_type_alloc_buffer: Skipping allocating 2461532160 bytes on device 0
llama_init_from_model: KV self size = 47732.50 MiB, K (f16): 28639.50 MiB, V (f16): 19093.00 MiB
ggml_backend_cpu_buffer_type_alloc_buffer: Skip allocateing buffer of size 517120
Implementing this was surprisingly complicated
Wow, that's amazing, you accomplished far more than "just" a lint. I assume you would still need cuda (and a graphics card), but it wouldn't use big amounts of vram or ram, but one could basically add up the buffer allocations and get a very accurate picture. I wonder if this can somehow be incorporated into the model page. We could download the gguf header, create a mock gguf and measure usage. We could even get fancy with some caching system where we lump similar models (e.g. architecture + tensor sizes) together.
load-model-dry-run branch
Unless you are unsure that you broke something important, you should merge this the "official" mradermacher branch, so I can build and try it out on the actual nodes without issues. It's not a big deal if you make some cleanups later and I have to rebuild.
Oh, wait, you have't implemented a switch yet, I assume :)
Wow, that's amazing, you accomplished far more than "just" a lint.
Thanks. Skipping all the memory allocation and disabling all the code that would use that memory was quite challenging, but the result is really rewarding. In case you wonder I coded this all in the terminal using nano/grep without any debugger so finding all the segfaults cause by accessing unallocated memory was such a pain.
I assume you would still need cuda (and a graphics card)
Not really. If you just want to know how much RAM loading the memory without a GPU would require you don't need one. Currently you do need one if you want to know how much GPU memory would be required but I don’t yet know if this is a hard requirement as without any GPU llama.cpp currently doesn’t even try to use them. The CUDA backend is written in CUDA but not sure if we ever execute any GPU code during a dry run so maybe llama.cpp could be tricked into thinking a GPU exists.
but it wouldn't use big amounts of vram or ram
It should use none as no memory is ever allocated. It just uses whatever you would use when starting llama.cpp without loading any model.
but one could basically add up the buffer allocations and get a very accurate picture
That is exactly how it works. You add up all the RAM and GPU memory allocated and then know exactly how much you need for RAM and each GPU for a specific setup. All arguments like context size and offloaded layers are correctly accounted for as it simulates loading a model without loading it.
I wonder if this can somehow be incorporated into the model page.
We easily could just run a dry run for every new quant we generate. Doing so takes barely any resources and only a few seconds. Not only does it tell you how much resources would be required to load the model but also verifies that the model will load in llama.cpp. There are quite a lot of variables, so we need to always measure the same setup using a well-defined context length.
We could download the gguf header, create a mock gguf and measure usage. We could even get fancy with some caching system where we lump similar models (e.g. architecture + tensor sizes) together.
We have a bit many models to do it for all the ones we ever created. I'm not sure if it will be worth it to do for part models but if you really want to a mock would likely work but it needs to be a good one as llama.cpp verifies a lot of things.
We could even get fancy with some caching system where we lump similar models (e.g. architecture + tensor sizes) together.
The dry run is so fast cashing is likely not even required. We could investigate how llama.cpp calculates how much memory it needs to be allocated but the code for that is relatively complicated which is why I decided to fake the allocations and make it not crash with the least changes possible instead.
Oh, wait, you have't implemented a switch yet, I assume :)
The branch still needs some clean-up and indeed a switch, so the code is only active when intended. I will probably do so using an environment variable as I had to modify too many places to make using a command line argument feasible without having to heavily modify llama.cpp which I need to avoid so we don't encounter merge conflicts in the future.
We easily could just run a dry run for every new quant we generate.
I think the number of useful combinations is very high (0, 1, 2 gpus for example, offloading variable number of layers, context length, other parameters such as flash attention...). And ideally, one wants to know how many layers to offload.
We could investigate how llama.cpp calculates how much memory it needs to be allocated but the code for that is relatively complicated
That would be ideal, but that has been attempted a few times already, with not so good results - and the code is ever changing. Using llama.cpp itself is the only reaosnable thing to do, imho.
I will probably do so using an environment variable
Suits me, no polishing required. Although you'd have to have access to getenv in every file you use it.
I assume you wanted to move Tulu to /bpool?
@mradermacher I'm quite confused about this error. All I did was set the override and interrupt flag to stop after a quant, moved the source GGUF from HDD to SSD, changed the soft link and let it continue. There is no reason it wouldn't find the source GGUF.
gguf_init_from_file: failed to open GGUF file './Llama-3.1-Tulu-3-405B-DPO.gguf'
llama_model_quantize: failed to quantize: llama_model_loader: failed to load model from ./Llama-3.1-Tulu-3-405B-DPO.gguf
Another issue is dbrx-tiny
. We need to give it a token with access to databricks/dbrx-instruct
. You should have one as we quantized DBRX in the past but not sure how we could specify one now that the task is already started.
I assume you wanted to move Tulu to /bpool?
Yes exactly because I will be deleting gpool tomorrow morning as I lost patience with my HDD supplier after waiting for 2 month and the delivery data getting again pushed 1 month into the future so I express ordered 3x 20 TB today from a different supplier using same day delivery and am now creating a 4x 20 TB ZRAID5 pool. bpool is a ZRAID0 pool of nvme-Samsung_SSD_990_PRO_4TB_S7DPNJ0X121874D
and nvme-KINGSTON_SFYRD4000G_50026B7686F7CD2D
and so much faster as well.
I'm quite confused about this error.
Nothing bad happened, other than me being puzzled for quite a while until I closely looked at the symlink - but you should inform me about these things. The job had no access to /bpool (it has now).
Another issue is dbrx-tiny.
Indeed, the job has neither access to the token, nor the network. Again, for security reasons. Could the files somehow be provided locally? It's bad enough that llama.cpp runs foreign code without asking, but downloading stuff from random locations feels weird... :)
Also, FYI, the rpc trick does not work for me. llama-cli simply crashes the same way for me (vector::_M_range_check) regardless of whether the model is normally loadable or not.
(dbrx-tiny) Alternatively, for the very few models that need network access, the download and conversion could be done manually.
Nothing bad happened, other than me being puzzled for quite a while until I closely looked at the symlink - but you should inform me about these things. The job had no access to /bpool (it has now).
Usually moving source models between different pools and changing the symlink never was an issue. Did this change or is it just that every storage pool needs to be whitelisted, and I just never happen to move a model to a newly added storage pool before you whitelisted it?
It was relatively unplanned. I ordered the new HDDs today early afternoon and got them in the evening and now need to move all data away from gpool as quickly as possible as all data still there when I create the new storage pool will be lost. First thing I did was pausing your task after the next quant so I can copy away data faster. I them moved Llama-3.1-Tulu-3-405B-DPO.gguf to bpool so I can continue running it. We really didn’t lose much time. You fixed the issue only minutes after I encountered it. I agree and will try to better communicate such operations the future. I saw no reason this wouldn’t just work and so saw no reason to inform you before doing it. To be fair I started informing you the moment the issue occurred, and I was unable to fix it myself.
Indeed, the job has neither access to the token, nor the network. Again, for security reasons. Could the files somehow be provided locally? It's bad enough that llama.cpp runs foreign code without asking, but downloading stuff from random locations feels weird... :)
Yes totally understandable but would be sad if we can't do dbrx-tiny for this reason.
(dbrx-tiny) Alternatively, for the very few models that need network access, the download and conversion could be done manually.
Ah so it is only about the HF to GGUF conversion. No worries in that case I will just do it manually.
Also, FYI, the rpc trick does not work for me. llama-cli simply crashes the same way for me (vector::_M_range_check) regardless of whether the model is normally loadable or not.
It no longer works in latest llama.cpp for me as well. They now fall back on using CPU if the RPC server is not reachable instead of crashing. Just use my load-model-dry-run branch for now. As long you use a dedicated llama.cpp build doing so should be fine. I will finish cleaning up and merging this branch tomorrow so if it doesn’t hurry for you just wait a day.
Turning off nico2 lead to http://hf.tst.eu/status.html getting stuck.
Edit: Nice it seams to be fixed.
I fixed it using the equivalent of llmc killall9. The pause flag only tells the scheduler to not schedule jobs, it does not make it ignore the host. Which would also not be the best thing, as then yo0u couldn't see it anymore to judge whether it's safe to switch off. Also, the local scheduler doesn't know about this, either, and will happily start jobs. Have to think about that and make it a transferable flag. But the pause flag is not meant to make things safe for removal, only to make it safe to use your host for something else.
In theory, there are timeouts everywhere that should prevent stalls, but it seems rsh and ssh also sometimes get stuck. And timeouts are in the hours range. Not sure there is a good solution, other than actually make the scheduler ignore the host (which actually is also implemented, but it's not the pause flag). And, arguably, a timeout is a bad thing, because we just don't know what happened in that case, expecially if the connection fies in the middle of some operation.
Did this change or is it just that every storage pool needs to be whitelisted
Every pool needs to be whitelisted, for some time now, as the jobs are run inside a container (pretty much what you get when using llmc shell).
You did nothing wrong with moving, other than not knowing the job would fail. But no time was lost, in the sense that no work was wasted.
Ah so it is only about the HF to GGUF conversion.
Hmm, I don't know, but if not that would mean that using the gguf with llama.cpp would also require network access. Can't be.
They now fall back on using CPU if the RPC server is not reachable instead of crashing.
How is that ever a good idea.
I will finish cleaning up and merging this branch tomorrow so if it doesn’t hurry for you just wait a day.
It doesn't. It is pretty rare, except for all the small models you queued, of which maybe 15 or so required manual cleanup :)
I want this mostly from a quality standpoint (not generating obviously broken static ggufs) and work waste (generating static models before imatrix catches it).
I want this mostly from a quality standpoint (not generating obviously broken static ggufs) and work waste (generating static models before imatrix catches it).
Oh, and you give me way more to play with than I can easily handle anyway (ram measurements).
@mradermacher
What is the opposite of llmc pause llmjob.nico1
? I tried llmc resume llmjob.nico1
but that just returns fail
.
@mradermacher
Please unpause llmjob
for nico1
. I tried everything and nothing I do seems to work.
That's the right command, I'll have a look.
Should be fixed and therefore work in the future. Also, it's unpaused.
wonderful. all model card uploads fail. And I do not think it has anything to do with us:
BadRequestError('Bad request for commit endpoint:\n[31m------------------------------------------------------------------------- Unexpected internal error hook: yaml. (Request ID: Root=1-67c58a3d-55c4e98b47b8948c61143a8f;a7008891-9698-4001-83c0-06b1f54a85da) ------------------------------------------------------------------------- [0m\n\x1b[31m-------------------------------------------------------------------------\nUnexpected internal error hook: yaml. (Request ID: Root=1-67c58a3d-55c4e98b47b8948c61143a8f;a7008891-9698-4001-83c0-06b1f54a85da)\n-------------------------------------------------------------------------\x1b[0m')
@mradermacher
Why is almost every worker idle and stuck at run/static README.md upload
? Does this mean we reached repository creating rate limit? But wouldn't it then be stuck at creating repositories instead?
No, it means huggingface fucked up, I think uploads were gobally down. They seem to have fixed it.
The pause and llmjob. flags should now be communicated to the host, so once set (and the scheduler has contacted it successfully, which should normally be almost immediate), it should reliably prevent hosts from starting new jobs. does not solve the problem of the scheduler trying to contact hosts, but that has to be a separate thing.
the upshot is that host-pause should now reliably be able to stop activity on a host. the next step would be to stop activity, and then set another (already existing) flag that keeps the scheduler from contactting that host. but thats for another time.
Just for your information I manually downloaded, GGUF converted and forcefully queued to nico1 all the previously failed jais-family and jais-adapted type of models.
Is there any way to check why some random quant of jais-family-6p7b
failed and a llmc audit redo fixed it? The audit just showed llama.cpp starting normally and llmc why jais-family-6p7b
doesn't show anything.
The pause and llmjob. flags should now be communicated to the host, so once set (and the scheduler has contacted it successfully, which should normally be almost immediate), it should reliably prevent hosts from starting new jobs. does not solve the problem of the scheduler trying to contact hosts, but that has to be a separate thing.
Awesome that this is now fixed. pause being broken was annoying as it required to manually interrupt and overriding all the tasks.
the next step would be to stop activity, and then set another (already existing) flag that keeps the scheduler from contactting that host. but thats for another time.
Now that I know that llmc killall9
solves this issue, fixing it properly is luckily not that urgent.
How can I make it upload the SOURCE GGUF to HuggingFace for models where generating the GGUF is a massive pain? I'm currently requanting following model where optaining the source requires multible convert_hf_to_gguf.py
modifications and transformers==4.44.2
:
- https://huggingface.co/THUDM/chatglm3-6b
- https://huggingface.co/THUDM/chatglm3-6b-128k
- https://huggingface.co/Walterjin/ChatGLM3-6B-Chat-ChatMed
I added them like this:
llmc add force 8 si https://huggingface.co/THUDM/chatglm3-6b worker nico1
llmc add force 8 si https://huggingface.co/THUDM/chatglm3-6b-128k worker nico1
llmc add force 8 si https://huggingface.co/Walterjin/ChatGLM3-6B-Chat-ChatMed worker nico1
NVidia finally released another stable driver: "570.124.04". We are currently on the unstable "555.42.02 BETA" driver from 2024-5-21 as the previous stable 550-series of drivers where not compliable with the Proxmox kernel which is too new for such an old driver. If on your side everything supports the latest stable driver, I recommend we upgrade soon.
How can I make it upload the SOURCE GGUF to HuggingFace for models where generating the GGUF is a massive pain?
make a MODEL-GGUF repo, move the MODEl.SOURCE.gguf file inside, maybe split it (bigsplit --rm ...-GGUF/...SOURCE.gguf
), then run hfu MODEl.SOURCE.gguf --include \*.SOURCE.\*
, then delete the file/dir.
You could, with some care, also script this. It's an exceedingly rare thing for me to do so. You can upload at any time, too, you don't have to wait for the job to be in a specific state.
I recommend we upgrade soon.
Is there any tangible benefits, other than likely major hassle everywhere?
Update: you can probably see where I am going with this. Problem is, beginning with 565 or so, nvidia started a major rework of their whole architecture and packaging, and their packages are in a major state of brokenness. Maybe just the cuda runtime is fine, though, but I am loathe to try for likely zero benefits other than, maybe, a slowdown on 4xxx cards because they improve stuff for the 5xxx cards, as usual.
No clue what's wrong with these, but if it's the glm tokenizer, i.e.
TypeError: ChatGLMTokenizer._pad() got an unexpected keyword argument 'padding_side'
then usually al you have to do is add a single line to the failing .py file:
pad_to_multiple_of: Optional[int] = None,
return_attention_mask: Optional[bool] = None,
+ padding_side: Optional[bool] = None,
+ **kwargs
) -> dict:
worked for me for every failing model with glm in the name, but I don't think I have tried those, specifically.
When queing A LOT of models of a potentially problematic same family (such as jais), you should test a few models first. Likely, we'll have to patch all the jais jobs to avoid Qx_0/1 quants, in which case there is a way to queue these right in the first place. Ah, and queuing them on nico1, as you did, is also a great idea, at least it giuves you options. btw.,, don't bother doping anythhing about the jais jobs, when they fail, i can semi-automaticlaly patch them. II think I can even give you a copmmand to do that on nico1.
Also, I again ask you not to queue many models with "non-standard" nice levels unless you really need them to be done in a specific order. The scheduler has absolutely no chance to do anything useful if is forced to do them in order (and it gets worse if the nice levels are in the high priority range and are off by one, as lower nice levels are an absolute override).
Also, unrelated, I will be so relieved to not see the lumikabra model anymore on nico1. It's been there for a month :)
Is there any tangible benefits, other than likely major hassle everywhere?
I have some hope NVidia might fixed their random hardware reset bugs where you sometimes need to reboot the entire host if the device fails to reset when passing between Host/VM but nothing was mentioned about it in the changelogs so likely it is still not fixed. Other than that, VRR or all 3 monitors sounds awesome but I'm currently only using VMs for display output and so doesn’t matter for me.
I only suggested updating as I thought it might be as easy as to apt upgrade to the latest official package. For me it’s not really any additional work to switch to the latest version the next time I have to reboot the host as every time I update the host and a new kernel is available, I have to rebuild, secure boot sign and reinstall the NVidia driver again anyways.
worked for me for every failing model with glm in the name, but I don't think I have tried those, specifically.
Those seemed a bit more diffrent. All the following changes, installing latest tiktocken
and downgrading to exactly transformers==4.44.2
where required for me to get it working. The reason this was so complicated is because they are not HF models and use custom python code to load themselves:
convert_hf_to_gguf.py
523,524c523,524
< vocab_size = self.hparams.get("vocab_size", len(tokenizer.vocab))
< assert max(tokenizer.vocab.values()) < vocab_size
---
> vocab_size = self.hparams.get("vocab_size", len(tokenizer.get_vocab()))
> #assert max(tokenizer.vocab.values()) < vocab_size
528c528
< reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.vocab.items()}
---
> reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.get_vocab().items()}
4627c4627
< vocab_size = hparams.get("padded_vocab_size",hparams["vocab_size"])
---
> vocab_size = hparams.get("padded_vocab_size",hparams["padded_vocab_size"])
4650c4650
< self.gguf_writer.add_block_count(self.hparams.get("num_layers", self.hparams["num_hidden_layers"]))
---
> self.gguf_writer.add_block_count(self.hparams.get("num_layers", self.hparams["num_layers"])
When queing A LOT of models of a potentially problematic same family (such as jais), you should test a few models first. Likely, we'll have to patch all the jais jobs to avoid Qx_0/1 quants, in which case there is a way to queue these right in the first place.
I converted them all manually to GGUF and tested at most of the source GGUFs using llama.cpp and they all loaded and gave good output. I then tested if jais-family-6p7b
works which it did after I ran a llmc audit redo for some random quant that didn't work but after a redo it then worked. Why would we need to patch all the jobs to avoid Qx_0/1 quants? If this is a model where a certain type of quant doesn't work why would a retry without me changing anything fix the issue?
Ah, and queuing them on nico1, as you did, is also a great idea, at least it giuves you options. btw.,, don't bother doping anythhing about the jais jobs, when they fail, i can semi-automaticlaly patch them. II think I can even give you a copmmand to do that on nico1.
The main reason I queued them all on nico1 is because I manually provided the source GGUF for all of them. But having them on nico1 obviously makes a lot of sense also in case something goes wrong. Unfortunately without screens I have no longer any idea how I would even monitor them should an error during quantization occur.
Also, I again ask you not to queue many models with "non-standard" nice levels unless you really need them to be done in a specific order. The scheduler has absolutely no chance to do anything useful if is forced to do them in order (and it gets worse if the nice levels are in the high priority range and are off by one, as lower nice levels are an absolute override).
Does it even matter if they are all queued to nico1? I likely should have just pushed them directly to nico1 as all the source GGUFs are already there anyways. Regarding the order I gave them all a 7
except the 3 to which I still have to somehow add the source GGUF and so gave 8
to make sure you have time to respond before they are done. I probably should have just queued them all as 0
. But you are right I should try to use standard nice values more often. I'm abusing them as an indicator way too often and am messing with your poor scheduler.
Also, unrelated, I will be so relieved to not see the lumikabra model anymore on nico1. It's been there for a month :)
Oh wow what a long time. I noticed as well as currently models tend to spend a really long time on hosts. But we are making progress. Soon we will be done with all the priority 40 models and the 400 one. I will celebrate when there is not a single model below 1000 left in the queue.
Other than that, VRR or all 3 monitors
Yup. The moment my TV was on 570, my hdmi output simply showed corruption whenever I paused a movie. Sigh.
Those seemed a bit more default.
Indeed :/
Why would we need to patch all the jobs to avoid Qx_0/1 quants? If this is a model where a certain type of quant doesn't work why would a retry without me changing anything fix the issue?
Because they don't support those quants. Or at least all the ones that failed today - not all have odd tensor sizes.
If this is a model where a certain type of quant doesn't work why would a retry without me changing anything fix the issue?
It wouldn't. The job would need to be changed to exclude the non-working quants, and it would have to be restarted. Worse, the temp file would need to be manually deleted, but I hope the new quantize script will take care of that in the future.
It's a bit difficult - apparently not all jais models have odd tensor sizes, so the best way might be to wait for them to fail and then patch the job. If it happens again, I'll make a command for us to do that.
The main reason I queued them all on nico1 is because I manually provided the source GGUF for all of them.
That's an even better reason :) If that becomes a common occurence and we want to provide a SOURCE, I could probably add a dummy SOURCE quant, or some flag, that would automatically upload it. I think I can probably count the number of source quants on one hand, but the glm models would qualify for "hard to come by", although originally I mostly meant "not on huggingface" or something similar. But the smaller the model, the lower the threshold, I think :)
Does it even matter if they are all queued to nico1?
Actually no. I think the bigger issue today is that nico1 was full and couldn't even start quanting job anymore.
But we are making progress.
Yes, nico2, and the fact that I skipped around 70 imatrix 70b's... Also, the nicelevls are currently a bit mucked up, becauase the level 400 models really were delayed level 0 models that mostly had thousands of downloads, and I didn't want them to be delayed even further by being treated like archive models. But level 40 has different scheduling rules, which I had to change a bit, causing the queue-up. I also tried to move some lower prio models up, so that they have a chance (lumikabras for example).
But long queued have to happen - basically every time imatrix computations are delayed or not done due to time of day restrictions. They should reduce now, but they didn't really have a chance in the last month.
Actually no. I think the bigger issue today is that nico1 was full and couldn't even start quanting job anymore.
Or rather, it's again one of those "nico filled the disk to the brim without thinking about consequences" days :)
Anyway, going to sleep - I changed things around to avoid block by priority inversion to clean stuff up. Or rather, embrace the priority inversion and do the big jais models before the nice 0 models. Please be careful what you do tonight.
Actually no. I think the bigger issue today is that nico1 was full and couldn't even start quanting job anymore.
Or rather, it's again one of those "nico filled the disk to the brim without thinking about consequences" days :)
It wasn’t that bad before or I would have put the large ones on /bpool. It is only so bad right now because Zireal-0 has done its Q8 quants and is uploading them while working at its Q6_K quants which together are over 1 TB. I completely forgot about Zireal-0 being 684B due to a lack of size indication in the name. I really need to be more careful about this in the future. This mistake was so avoidable especially given how on /bpool there is over 5 TB of free SSD storage ready to be used.
and the fact that I skipped around 70 imatrix 70b's...
Oh shit... so we will never be done with the queue, I guess.
But long queued have to happen - basically every time imatrix computations are delayed or not done due to time of day restrictions. They should reduce now, but they didn't really have a chance in the last month.
For sure. Especially for when we do RPC imatrix computation.
interesting case: leia was pinging (even over the vpn, which is implemented by a userspace daemon, mlock'ed into memory), but ssh, connect but no response - kernel fine, userspace dead. no clue what the problem is (it is currently rebooting), but since the initial ping works, it tries to connect, and while that correctly timeouts after 90s, its's still a failure and the scheduler can't do anything useful.
there is always one more way that things can fail.
It wasn’t that bad before or I would have put the large ones on /bpool.
It was - Zirael is in the queue for quite a while now. Again, the free disk space is deceptive and it can be eaten up in no time, especially when we lie about it by using it up without the scheduler knowing it. If you had push-model'ed ALL jais jobs, are you sure the situation would have been fine? If not, then it was too many jobs for the budget.
You have to watch the free budget, but also the job queue itself. It might be especially difficult on nico1, because all other nodes are limited to smaller jobs, so there are fewer surprises, but that's the world we live in - nico1 is the only one that can deal with terabyte sized jobs, and it only works because those are specially treated and taken care of.
PS: I have adjusted zirael's size by a factor of ten, to kind of reduce its scheduling weight (the space is not really allocated on the working disk, it being on /bpool), but it is a bit risky.
ok, leia was taken down by a 7B's mmproj extraction. a vision model falling through the cracks. damn shit, i'll add some protection inside quantize.
(or rather, damn linux for not oom-killing the job. no weird nvidia drivers involved, and it didn't even have swap activated at the time)
update: sigh, years of uptime gone. yay, new kernel with security fixed
there is now a new disable/enable llmc verb, and it takes out the host from the scheduler. it is automatically used at the end of host-pause, so host-pause really is host-disable now. maybe I will change that, but I guess it doesn't really matter. the uptick is that host-pause will completely and safely take a host out, safe to reboot and other things.(*)
- in theory
jais-family-2p7b-chat is an interesting case - most of the failing jais models fail because of odd tensor sizes, and they cannot fall back to IQ4_NL or another format that handles those.
but that model fails on IQ4_NL because of odd tensor sizes. that smells like a bug,
or IQ4_NL always fails, too, and it's just the model sizte that triggers it.
there is a new command in nico1, /mlock-imatrix MODEL (e.g. ```/mlock-imatrix Quasar-3.3-Max```). It locks the model in /tmp in memory for imatrix computation. it will be cleaned up automatically atz the end.
I have it because recently I had some models take really really long for no reason (other than not being cached, that is). locking them into memory allowed them to finish quickly. But it is rare enough to not risk automatic mlocking (although that would now be trivial to add as well).
hmm, I could even mlock as a separate job, so loading times disappear. but that's a headache.
@mradermacher
http://hf.tst.eu/status.html is already stuck for 3.5 hours so I decided to execute llmc killall9
. I got the following output:
nico1 ~# llmc killall9
back
nico2
leia
rich1
kaos
rain
nico1
NICO2
marco
llmjob: no process found
llmjob: no process found
llmjob: no process found
Killed llmjob(2401108) with signal 9
Killed llmjob(1381201) with signal 9
Killed llmjob(690406) with signal 9
Killed llmjob(1756300) with signal 9
Killed llmjob(1757562) with signal 9
llmjob: no process found
llmjob: no process found
However the status page still doesn't work and when I try llmc audit
I get the following:
nico1 ~# llmc audit
: Unknown host
: wrong protocol magic (), skipping.
Can't use an undefined value as a subroutine reference at /llmjob/share/bin/llmjob line 606.
So I tried llmc killall9
again.
But still no successs and audit is still broken.
I even tried llmc killall9
from nico2 without any success:
Yes this seams like a common issue with multible llmc commands. Even trying to pause/resume a host results in this:
nico2 /llmjob/share/bin# host-pause
pausing host...
: Unknown host
: wrong protocol magic (), skipping.
Can't use an undefined value as a subroutine reference at /llmjob/share/bin/llmjob line 606.
nico2 /llmjob/share/bin# host-resume
enabling host...
: Unknown host
: wrong protocol magic (), skipping.
Can't use an undefined value as a subroutine reference at /llmjob/share/bin/llmjob line 606.
What I'm not getting is the issue on /llmjob/share/bin/llmjob on line 606:
for my $worker (values %worker) {
(delete $worker->{handshake})->();
$worker->{req_}(flags => $FLAGS); #Line 606
}
Interesting I messed a bit around with the system by directly llmjob audit
as I can't really break it much more anyways:
nico2 ~# /llmjob/share/bin/llmjob audit
: Unknown host
10.9.0.26: Connection timed out
10.9.0.27: Connection timed out
10.0.0.21: Connection timed out
10.0.0.19: Connection timed out
10.28.1.6: No route to host
^C
nico2: empty update, not applying
rich1: empty update, not applying
kaos: empty update, not applying
But sometimes the response is different:
nico2 ~# /llmjob/share/bin/llmjob audit
: Unknown host
/llmjob/share/bin/llmjob~ syntax OK
10.28.1.6: No route to host
nico1: wrong protocol magic (), skipping.
Can't use an undefined value as a subroutine reference at /llmjob/share/bin/llmjob line 606.
nico2: empty update, not applying
kaos: empty update, not applying
I added some debugging code and the worker object looks fine:
: Unknown host
$VAR1 = {
'maxq' => 90,
'maxdu' => '1200000000000',
'qsize' => 8,
'pri' => 90,
'maxnice' => 99999,
'name' => 'rich1',
'drain' => sub { "DUMMY" },
'req' => sub { "DUMMY" },
'skip' => 0,
'ip' => '10.28.1.7',
'hfulimit' => 80,
'limit' => {
'hfd' => 4,
'quant' => 2
},
'req_' => sub { "DUMMY" },
'submit_score_modifier' => 80,
'extradu' => '700000000000',
'maxm' => '300000000000'
};
/llmjob/share/bin/llmjob~ syntax OK
10.28.1.6: No route to host
Wow this garbage explains the "wrong protocol magic" error:
/llmjob/share/bin/llmjob~ syntax OK
10.28.1.6: No route to host
: Unknown ho: Unknown host
$VAR1 = {
'name' => 'kaos',
'maxdu' => '8000000000000',
'pri' => 5,
'submit_score_modifier' => 2,
'maxm' => '50000000000',
'req_' => sub { "DUMMY" },
'qsize' => 6,
'ip' => '10.28.1.1',
'maxq' => 12,
'req' => sub { "DUMMY" },
'ionice' => '-c3',
'drain' => sub { "DUMMY" },
'extradu' => '500000000000',
'skip' => 0,
'maxnice' => 99999
};
$VAR1 = 'ieya...';
$VAR1 = {
'req' => sub { "DUMMY" },
'req_' => sub { "DUMMY" },
'limit' => {
'hfd' => 2,
'quant' => 2
},
'hfulimit' => 16,
'drain' => sub { "DUMMY" }
};
$VAR1 = undef;
: wrong protocol magic (), skipping.
Can't use an undefined value as a subroutine reference at ./llmjob line 610.
So strange using llmc shell
I can access them all without any issues:
llmc shell back
llmc shell rain
llmc shell kaos
llmc shell leia
llmc shell marco
llmc shell nico2
llmc shell rich1
llmc shell nico1
sorry for making you jump through all these hoops.
i was trying to improve interactivity for marco on his development machine when doing llm jobs, and just as we were finished, my cat walked over the keyboard. apparently, he pressed some number key (e.g. 4), , ä and a few other things. all perfectly fine vi commands in my config, that caused nico2 in the host definition to become uppercased (4) and saved (ä). that caused all the behaviour you saw, as we then had an additional host called NICO2 without a hostname/ip-address (thus the ": Unknown host").
the other side of llmc is a different program, so was not affected, and since uppercase nico2 is not a syntax error, llmjob worked as long as it didn't try to contact all hosts.
when working, I have little capacity to monitor things, and the editor was open on another page, so it's only now that I saw it. and it was quite puzzling.
Interesting I messed a bit around with the system by directly llmjob audit as I can't really break it much more anyways:
You simply shouldn't have permissions to do anything, even when running llmjob, since all goes via ssh/rsh. You can mess up the local host, but that's unlikely. Basically, llmc contacts kaos to invoke certain llmjob commands that only make sense on kaos (well, if you had shell access, you could do quite a bit on any host, the only thing you don't have is the imatrix files and the additional job queue).
You did get to the root of the issue, though, or extremely close, namely the worker "object" without a {name}. the NICO2 was kind of fine, other than for the uppercasing, but I manually overwrite settings for "nico2" later, which resulted in a nonfunctional host.
It seems, though, that at least this time local scheduling did its thing - no lock was held.
I hope I interpreted your intent correctly and removed the pause.nico2. Apologies if that wasn't your intent.
Also, llmc disable NICO2 and/or nico2 might have helped in this situation, because that removes those workers early on. But I don't think this knowledge (if it even is true :) will be of much use in future desasters.
And while we are at it, my tentative plan is this:
now that we are through the nice level 40 models, and the queue is finally looking a lot cleaner, I want to look into maybe letting nico2 only do archive models automatically, and then actually switch it off at ~1700, and automatically on at ~0700:
- The chaotic 10-70b's per day time is over. At least for a week now. Still more than before, but definitely more reasonable.
- I started to delay imatrix quants of large models (such as 70b's that don't look like immediate hots). Currently, ~60 of them are still in some list here, and if they don't gather, say, 50 downloads, I think keeping them static only is the sane thing to do.
- nico1 is mostly through its super big models.
So unless things change drastically, this seems most reasonable.
@RichardErkhov @Guilherme34 @nicoboss @mradermacher congratulations on 30k repos - there, fixed it for you :)
To be honest, I didn't even notice it. I did notice the 1PB barrier, long ago when that happened, though...
Update: repos, not models, too
@nicoboss you can, btw., remove the ability of using fuse on my containers, and we can review what else you allowed that, by now, is no longer needed
@mradermacher
How can I queue a model so that it does imatrix in Q8 or provide static Q8 quants for imatrix? snowflake-arctic-base.gguf
is 964 GB. For now I just softlinked the source GGUF from bpool to /tmp/quant and queued it using llmc add 0 si https://huggingface.co/Snowflake/snowflake-arctic-base worker nico1
@nicoboss you can, btw., remove the ability of using fuse on my containers, and we can review what else you allowed that, by now, is no longer needed
Great to know. I will remove it the next time I have to reboot your container.
now that we are through the nice level 40 models, and the queue is finally looking a lot cleaner
I'm so happy that we are finally done with all the lownice models! What an event to celebrate!
@RichardErkhov @Guilherme34 @nicoboss @mradermacher congratulations on 30k repos - there, fixed it for you :)
What an insane acivement. And almost 10K imatrix models. We have by far the most repositories from all users on HuggingFace followed by @RichardErkhov on second place as can be seen on the huggingface-leaderboard.
How can I queue a model so that it does imatrix in Q8 or provide static Q8 quants for imatrix?
You can't. Currently, when a job wants imatrrix quants it will simply queue an imatrix job with essentially default parameters. And the only things that the imatrix scheduler can do is download a quant from hf or copy the source gguf from another host.
And you can either try to steal the quant while its being generated (i.e. hardlink the Q8_0.gguf~), or, realistically, run quantize manually (you can use llama llama-quantize
to get the default llama binary in the container), which will be more efficient than downloading it from hf.
Since the imatrix job will not start (too big), there will be time for all that. I can either do the q8 quantize and edit the job, or you can provide the quant (as MODEL.Q8_0.gguf, preferably) and I just edit the job.
(to tickle that competitive nerve) Hmm, but since @RichardErkhov has one repo per model, and we have almost 2, it means it is unclear who has quantized more models. Should be pretty close :)
Nah, we have 20400 static repos alone. Not even close.
@nicoboss I've configured snowflake as background job, configured imatrix for Q8_0 and overrode it. I didn't make a Q8_0 because I wasn't sure what your plans are.
(in theory, you could provide any quant as /tmp/MODEL.gguf, but by configuring the job the imatrix file will keep the Q8_0 in the name).
Just so you know the 7777
priority models scheduled to nico1 are all the missing models from Cognitive Computations/Eric Hartford. I already have the source model locally for all of them and will manually convert into GGUF to /bpool and softlink them tomorrow. Once the source GGUF is there I will reprioritize them to 40
I guess.
sorry for making you jump through all these hoops.
No worries it was fun. I knew very well I could have just waited a few hours for you to fix it.
i was trying to improve interactivity for marco on his development machine when doing llm jobs
Oh you mean interrupt latency? I'm also sometimes having audio stutter on StormPeak and I don't really know why as I have not allocated all cores/threads to nico1 and I gave my main VM 5000 cpuunits while nico1 only has 100. I never really noticed it until I installed a GTX 980 as new video and audio card as RTX 3080 is currently in CastlePeak and RTX 2070s in Threadripper in case we need the RPC setup again. GTX 980 has a bug that if interrupt latency is too high it starts playing all audio at what can be described as 0.5 speed until restarting the VM or monitors. But this really isn't an important issue for me as it happens relatively rarely and I have a display port switch so it is just a matter of pressing a button to nullroute the display port cables originating from StormPeak for a second to fix it for me. If I care in the future, I will try to follow https://gist.github.com/thiagokokada/ce23ef73b0585950c7f825beae646e9e as pinning CPUs to my main VM would for sure fix it.
and just as we were finished, my cat walked over the keyboard. apparently, he pressed some number key (e.g. 4), and saved (ä). that caused all the behaviour you saw, as we then had an additional host called NICO2 without a hostname/ip-address (thus the ": Unknown host").
Wow never thought a cat would have caused this mess. That's quite funny.
the other side of llmc is a different program, so was not affected, and since uppercase nico2 is not a syntax error, llmjob worked as long as it didn't try to contact all hosts.
I noticed the uppercase NICO2 dubplate and was confused about it but didn't though that it is related to the issue.
when working, I have little capacity to monitor things, and the editor was open on another page, so it's only now that I saw it. and it was quite puzzling.
Same for me. During work time (usually from 09:17 - 17:47 from Monday to Friday) I'm often quite busy as I focus on my job.
You simply shouldn't have permissions to do anything, even when running llmjob, since all goes via ssh/rsh. You can mess up the local host, but that's unlikely. Basically, llmc contacts kaos to invoke certain llmjob commands that only make sense on kaos
That is good because that way I can safely debug issues without messing anything up.
well, if you had shell access, you could do quite a bit on any host, the only thing you don't have is the imatrix files and the additional job queue.
I see the additional job queue on the status page and can even interact with it using "nuke" and "add" so I don't really need the raw queue.
You did get to the root of the issue, though, or extremely close, namely the worker "object" without a {name}. the NICO2 was kind of fine, other than for the uppercasing, but I manually overwrite settings for "nico2" later, which resulted in a nonfunctional host.
It was really satisfying that I was able to identify the issue despite never haven written perl before and also wasn’t aware that the script needs to be copied to /root/s2/llmjob – I first kept trying editing llmjob/share/llmjob.pm and changes kept getting reverted.
It seems, though, that at least this time local scheduling did its thing - no lock was held.
nico1 worked perfectly fine but nico2 stayed idle and if I remember correctly even before I tried to pause it.
I hope I interpreted your intent correctly and removed the pause.nico2. Apologies if that wasn't your intent.
Also, llmc disable NICO2 and/or nico2 might have helped in this situation, because that removes those workers early on. But I don't think this knowledge (if it even is true :) will be of much use in future disasters.
Damn so with trying to pause it I was so close on fixing it. Obviously you did the right thing with unpausing it. I just messed around with it and was unable to see the status because the scheduler was down but quite strange as I’m quite sure I resumed it before going to bed but resume was broken of this issue so maybe it simply didn’t work.
now that we are through the nice level 40 models, and the queue is finally looking a lot cleaner, I want to look into maybe letting nico2 only do archive models automatically, and then actually switch it off at ~1700, and automatically on at ~0700:
That sounds like a great idea. Feel free to play around with turning it on and off. No hurry while this week I had awesome wetter and during daytime produced way more energy than required next week is rainy. We might also soon need to adopt timeofday times due to days getting longer and switching to summer time in 3 weeks.
The chaotic 10-70b's per day time is over. At least for a week now. Still more than before, but definitely more reasonable.
The DeepSeek R1 hype is finally kind of over but maybe it is just early year always being busy as it was like this every year since ChatGPT release so far. So many companies somehow decide to release their models in Q1.
I started to delay imatrix quants of large models (such as 70b's that don't look like immediate hots). Currently, ~60 of them are still in some list here, and if they don't gather, say, 50 downloads, I think keeping them static only is the sane thing to do.
I agree. We should mainly do imatrix quants for popular/important/high-quality models. For random merges with no description, I find I tend to not queue them for imatrix unless they are popular or use a very interesting model combination.
nico1 is mostly through its super big models.
That is great. That is what made me finally go back to the Snowflake models that now thanks to our llama.cpp modifications we should be able to properly do despite our imatrix not always activating all experts.
So unless things change drastically, this seems most reasonable.
Sounds reasonable to me as well.
@nicoboss I've configured snowflake as background job, configured imatrix for Q8_0 and overrode it. I didn't make a Q8_0 because I wasn't sure what your plans are.
(in theory, you could provide any quant as /tmp/MODEL.gguf, but by configuring the job the imatrix file will keep the Q8_0 in the name).
Awesome thanks a lot! I will try to hardlink snipe Q8 if it is getting generated while I'm awake and if I miss it, I will just manually provide it as you specified. Maybe I should write a hardlink snipe script at some point that automatically hardlinks a specific quant…
Snowflake models that now thanks to our llama.cpp modifications we should be able to properly do despite our imatrix not always activating all experts.
Were they already in when I switched upstream to you? BNecause rebuilding is something I need to trigger manually.
While I have you, did you make progress on the llama.cpp model verification hack? (switch, env var...)
Oh you mean interrupt latency?
No, he complained that his firefox sometimes takes a few seconds to reeact, e.g. when pausing a video. Turns out it happens just as well when no llmjob is running. I mock him for years that his box is always busying two cpu cores for steam, firefox, chromium, thunderbird and a bunch of other shit that should be at 0% cpu but happily "idles" at full cpu usage. For example, default "avatar animation" or whatever it is called in steam takes 1-2 cores of cpu even when the windows aren't mapped. Modern software. Marvels of technology.
Turned out restarting firefox fixed that problem, but I was experimenting with idle or batch priority to see if that makes it better when running cpu-intensive things such as games.
Sound works for him. Well, since we disabled his internal usb sound card and plugged a cheap external usb sound card in.
Wow never thought a cat would have caused this mess. That's quite funny.
Well, I am not 1000% sure it happened this way, but I remember that he stopped full on the keyboard, something he doesn't normally do, and the resulting editing is consistent with something like that.
I first kept trying editing llmjob/share/llmjob.pm and changes kept getting reverted.
You can copy it wherever (else) you want, but since it's the single most edited file, it auto-copies itself to remote machines when it connects. The mechanism is way overdesigned, but for me it's fun to code to reduce latency.
It's not designed to run anywhere but on kaos, except some stuff (such as safe-exec or edit) that should in theory be in another script. It's also neither published nor in any way polished code :)
resume was broken of this issue so maybe it simply didn’t work.
resume should work - the server unlinks a file and then invokes llmjob (which then might fail in spectactular ways), but the unlinking should be independent on llmjob. it might have other issues, but I am not totally unhappy with uptime, given how much I play with the scripts while the stuff is running.
timeofday times due to days getting longer and switching to summer time in 3 weeks.
Shouldn't be so hard to somehow tie it to calculated sunset/sunrise times, actually...
For random merges with no description
I am talking about potentially interesting looking models with some description :/
Maybe I should write a hardlink snipe script at some point that automatically hardlinks a specific quant…
some job option to save a specific quant atomatically might make more sense if this happens more often - hardcoding it to work only on nico1 might simplify things, since realistically thats the only place where we will do those models. they are not common enough at the moment, though, to warrant that handling, and they do cause considerable scheduling issues to warrant manual oversight/hacks.
Once the source GGUF is there I will reprioritize them to 40 I guess.
Hmm, I just removed the scheduling hacks for prio 40, so they are treated like high priority models again. The problem is that the scheduler tries to reduce latency by providing static quants first, and also doesn't limit the queue very much for what should essentially be background models, so the queue will always be full.
We could queue such jobs at >50, but then they would not run at night. Which might be the right thing to do for these -. the only reason speaking against it is that they already reserve disk space.
Or I could make scheduling classes/in addition instead of nice levels. but it seems like a big complication. Maybe we should make ti so that 50 is the cut-off for jobs handled as interactive jobs, and, say, 500 is the cut-off for jobs that can run at night on nico1.
I've made it so that >= 10 jobs do not run imatrix jobs at nice+1, and documented the nice levels in llmc help. that doesn't fix the queue scheduling, but we'll see if it works out.
If I care in the future, I will try to follow https://gist.github.c
Ha, I actually visited that page. It frustrates me to no end that seemingly all the pages that talk about qemu (that I use a lot) confuse it with libvirt (that I don't use), pretending they are somehow the same. The pollution on google makes it impossible to find useful kvm information anymore.
another interesting problem:
string_parse_kv_override: malformed KV override 'general.url=str:https://huggingface.co/mradermacher/llama3_circuit-breaker_lorra-10_target-10-20_lora-16-16_lr-1e-04_batch-8_checkpoint-8-i1-GGUF', value cannot exceed 127 chars
Don't we just LOVE arbitrary low limits (gguf can do strings up to 2**64 octets).
I'm doing the q8_0 for snowflake.
Were they already in when I switched upstream to you? Because rebuilding is something I need to trigger manually.
Yes that was part of my very first version.
While I have you, did you make progress on the llama.cpp model verification hack? (switch, env var...)
Same status as last weekend. It is done and fully functional but I still have to merge it for which I did not had time so far due to other priorities but I will try to make some time this evening.
I mock him for years that his box is always busying two cpu cores for steam, firefox, chromium, thunderbird and a bunch of other shit that should be at 0% cpu but happily "idles" at full cpu usage.
I never really saw the appeal on keeping so much shit running in the background for no reason.
Sound works for him. Well, since we disabled his internal usb sound card and plugged a cheap external usb sound card in.
Internal sound cards always break if there is even the slightest amount of latency. I always just use the GPU as sound card and then connect my speakers to my monitor.
Shouldn't be so hard to somehow tie it to calculated sunset/sunrise times, actually...
Or ideally I just take the time to extract the cryptographic keys from the solar inverter. Getting them over UART is not hard but getting to the UART port is a pain.
I am talking about potentially interesting looking models with some description :/
Oh sad then maybe we should do the most promising of them. But if almost nobody downloads them it might not justify the resources.
they are not common enough at the moment
Yes fair. By the way snowflake-arctic-instruct.gguf will be ready in 2 hours as well. This will allow us to redo its imatirx quants which currently are incomplete and below our quality standards due to the broken imatirx we used.
Hmm, I just removed the scheduling hacks for prio 40, so they are treated like high priority models again. The problem is that the scheduler tries to reduce latency by providing static quants first, and also doesn't limit the queue very much for what should essentially be background models, so the queue will always be full.
True local queue always being full is suboptimal but as long it only affects nico1 is should not be that big of an issue. How are they filling up the local queue anyways? Their source GUUFs are all stored on bpool and so should not take up any storage budget.
We could queue such jobs at >50, but then they would not run at night. Which might be the right thing to do for these -. the only reason speaking against it is that they already reserve disk space.
I want them done in reasonable time as they are taking up dpool storage. They are not urgent but for sure of higher priority than all the historical models.
Or I could make scheduling classes/in addition instead of nice levels. but it seems like a big complication. Maybe we should make ti so that 50 is the cut-off for jobs handled as interactive jobs, and, say, 500 is the cut-off for jobs that can run at night on nico1.
Nice levels are super useful. They allow much more control than classes would. They have far greater use than just indicating the priority. The additional tasks list is sorted by nice level so they can be abused to batch together certain tasks for further processing. I queued the Congnitive Computation tasks all as 7777 so the ones that successfully queued were together. I then tried converted the source model to GGUF for each of them and repreoretizing all the ones that where successful to 40 while nuking the remaining 7777 llama.cpp incompatible ones.
I've made it so that >= 10 jobs do not run imatrix jobs at nice+1, and documented the nice levels in llmc help. that doesn't fix the queue scheduling, but we'll see if it works out.
Awesome and thanks for documenting them.
Ha, I actually visited that page. It frustrates me to no end that seemingly all the pages that talk about qemu (that I use a lot) confuse it with libvirt (that I don't use), pretending they are somehow the same. The pollution on google makes it impossible to find useful kvm information anymore.
It indeed is super annoying. Proxmox obviously also doesn't use libvirt so all the XML they put is completely useless for me as well. All I can do is pass command line arguments directly to KVM or use a script that gets executed when a VM starts so I would have to do something like:
#!/bin/bash
vmid="$1"
phase="$2"
if [[ "$phase" == "post-start" ]]; then
main_pid="$(< /run/qemu-server/$vmid.pid)"
cpuset="$(< /etc/pve/qemu-server/$vmid.cpuset)"
taskset --cpu-list --all-tasks --pid "$cpuset" "$main_pid"
fi
And then on your LXC container I could do something like: lxc.cgroup2.cpuset.cpus: 0-50
. I might play around a bit with it if I ever find time but as long it is a relatively rare issue and can be fixed by pressing a single button it doesn't even seem worth the effort trying to fix it.
Don't we just LOVE arbitrary low limits (gguf can do strings up to 2**64 octets).
So stupid. No idea why they keep setting arbitrary limits. Luckily, they actually now removed the one that prevented FatLlama 1.7B to work so it now should work on official llama.cpp. So us just patching it out was the right decision after all.
I'm doing the q8_0 for snowflake.
Awesome! Thanks a lot!
Yes that was part of my very first version.
While we are at that, did you ever finalise the (computational) performance measurements and have results I could use? Likewise the model evaluations? The only thing you gave me so far was the qwen measurements preview.
(I might be confused)
Same status as last weekend. It is done and fully functional but I still have to merge it for which I did not had time so far due to other priorities but I will try to make some time this evening.
Last you said it was not fully functional (no way to switch it on/off), and you'd work on it monday or so. No pressure, just wondering.
I think I crashed nico1 with snowflake. I think it might be too tight, even when everything else is paused (I think a noquant job was running).
While I have you, did you make progress on the llama.cpp model verification hack? (switch, env var...)
@mradermacher It's done and merged to the mradermacher branch now! I made it togglable using the existence of a DRYRUN environment variable.
I also implemented support for fake allocating and tracking pinned memory which is memory located in RAM but is used by CUDA which is for example used when we do -ngl 0
. Beside that I now made it so a successful dry-run makes llama.cpp exit with code 0 so checking the return code should be enough to check if any validation error occurred.
I made the output way easier to programmatically process by always following the same format:
#!/bin/bash
for file in /transfer/*.gguf; do
echo $file
DRYRUN="" ./llama-cli -m $file -c 2000 -ngl 10 2> >(grep -i '\[DRYRUN\]')
done
I expected there to be 2 GPUs but one of the RTX 4090 somehow disappeared despite still being connected and used by the nvidia driver. In any case here how it looks with 1 GPU:
/transfer/Dolphin-2.9.1-Phi-3-Kensho-4.5B.gguf
[DRYRUN][PINNED]: 10813796352
[DRYRUN][GPU0]: 2265169920
[DRYRUN][GPU0]: 247726080
[DRYRUN][CPU]: 128256
/transfer/ExpTinyDolphin-2.8-1.1b.gguf
[DRYRUN][PINNED]: 1319329792
[DRYRUN][GPU0]: 880967680
[DRYRUN][GPU0]: 20643840
[DRYRUN][CPU]: 128008
/transfer/ExpTinyDolphin-2.8.2-1.1b.gguf
[DRYRUN][PINNED]: 1319329792
[DRYRUN][GPU0]: 880967680
[DRYRUN][GPU0]: 20643840
[DRYRUN][CPU]: 128008
/transfer/Samantha-1.1-70b.gguf
[DRYRUN][PINNED]: 120842518528
[DRYRUN][GPU0]: 17113415680
[DRYRUN][GPU0]: 82575360
[DRYRUN][CPU]: 128000
While we are at that, did you ever finalise the (computational) performance measurements and have results I could use? Likewise the model evaluations? The only thing you gave me so far was the qwen measurements preview.
We did complete the quality measurements project and almost completed the performance measurement project. There is just a single performance measurement run missing which I kept delaying as it requires StormPeak to be paused for around 2 nights and there just never was a good opportunity to block StormPeak as the quants we did were far more important. We really should do it soon as the performance measurement project is still taking up 4 TB of SSD space.
I think I crashed nico1 with snowflake. I think it might be too tight, even when everything else is paused (I think a noquant job was running).
Perfect timing. That crash fixed this stupid NVidia issue that made on of the RTX 4090 GPUs disappear I mentioned in my previous post. Well now you have plenty of RAM as nothing else is running anymore for sure so please try again. We know it will fit as we did snowflake Q8 in the past. I also used this opportunity to disable FUSE for your LXC container.
I stopped the quantisation tasks so you can soon try again without them getting in your way.
All quant tasks are paused now. How do I start the snowflake-arctic-base
imatrix tasks? There doesn't seem to be an llmc command for it and .force
and .nobudget
files seam to not work as well. I'm generally confused how it is in a blocked/override
state without there not being an .override
file for it unless it would look at the one of its quants.
Nice awesome I see you started it! Thanks a lot. :D
snowflake is fucking fast
It is indeed extremely fast. Only 150 minutes ETA for a 480B model is insane. This is really surprising. It being MoE must for sure play a role. It is also surprisingly fast during inference despite its size.
This is the first time we are testing my llama.cpp modification allowing it to store partial imatrix data and everything is looking great so far:
save_imatrix: entry ' blk.0.ffn_down_exps.weight' has partial data (87.50%)
save_imatrix: 16 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry ' blk.0.ffn_gate_exps.weight' has partial data (87.50%)
save_imatrix: 16 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry ' blk.0.ffn_up_exps.weight' has partial data (87.50%)
save_imatrix: 16 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry ' blk.1.ffn_up_exps.weight' has partial data (98.44%)
save_imatrix: 2 out of 128 experts are missing data
save_imatrix: 2 out of 128 experts are missing data - storing but be aware
save_imatrix: entry ' blk.1.ffn_down_exps.weight' has partial data (98.44%)
save_imatrix: 2 out of 128 experts are missing data
save_imatrix: 2 out of 128 experts are missing data - storing but be aware
save_imatrix: entry ' blk.1.ffn_gate_exps.weight' has partial data (98.44%)
save_imatrix: 2 out of 128 experts are missing data
save_imatrix: 2 out of 128 experts are missing data - storing but be aware
save_imatrix: storing only 382 out of 385 entries
[20]4.9080,[21]5.0727,[22]4.9414,[23]4.7952,[24]4.7523,[25]4.7700,[26]4.8153,[27]4.8050,[28]4.8800,[29]5.0040,
I will by asleep by the time snowflake is done. Once snowflake is done and you are available please resume llmjob.nico1 and rm /tmp/quant/*.override
the overrides which probably wouldn't even be needed as llmjob.nico1 is paused.
also, it's not even reporting partial tensors anymore:
[20]4.9080,[21]5.0727,[22]4.9414,[23]4.7952,[24]4.7523,[25]4.7700,[26]4.8153,[27]4.8050,[28]4.8800,[29]5.0040,
save_imatrix: entry ' blk.0.ffn_down_exps.weight' has partial data (95.31%)
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry ' blk.0.ffn_gate_exps.weight' has partial data (95.31%)
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: entry ' blk.0.ffn_up_exps.weight' has partial data (95.31%)
save_imatrix: 6 out of 128 experts are missing data
save_imatrix: Skipping expert with missing data!
save_imatrix: storing only 382 out of 385 entries
[30]5.1739,[31]5.1239,[32]4.9211,[33]4.7335,[34]4.6702,[35]4.7207,[36]4.6996,[37]4.5234,[38]4.3851,[39]4.3477,
save_imatrix: entry ' blk.0.ffn_down_exps.weight' has partial data (96.09%)
save_imatrix: 5 out of 128 experts are missing data
save_imatrix: 5 out of 128 experts are missing data - storing but be aware
save_imatrix: entry ' blk.0.ffn_gate_exps.weight' has partial data (96.09%)
save_imatrix: 5 out of 128 experts are missing data
save_imatrix: 5 out of 128 experts are missing data - storing but be aware
save_imatrix: entry ' blk.0.ffn_up_exps.weight' has partial data (96.09%)
save_imatrix: 5 out of 128 experts are missing data
save_imatrix: 5 out of 128 experts are missing data - storing but be aware
[40]4.3175,[41]4.3034,[42]4.2968,[43]4.2471,[44]4.2497,[45]4.2217,[46]4.2225,[47]4.2511,[48]4.2841,[49]4.3631,[50]4.4170,[51]4.3116,[52]4.2104,[53]4.1270,[54]4.0686,[55]3.9928,[56]3.9960,[57]3.9820,[58]4.0313,[59]4.0980,[60]4.1769,[61]4.1423,[62]4.2102,[63]4.2486,[64]4.2864,[65]4.3346,[66]4.3636,[67]4.4270,[68]4.4747,[69]4.5066,[70]4.5360,[71]4.5895,[72]4.5785,[73]4.5662,[74]4.5537,[75]4.5753,[76]4.6067,[77]4.6239,[78]4.6004,[79]4.6077,[80]4.6247,[81]4.6102,[82]4.6116,[83]4.5931,[84]4.6066,[85]4.6164,[86]4.6175,[87]4.6238,[88]4.6391,[89]4.6282,[90]4.6241,[91]4.6300,[92]4.6154,[93]4.6024,
Wow amazing so seems like we somehow covered all experts. Not really sure if this is because of our imatirx training data, some llama.cpp change that fixed an issue with the expert router or if all the partial data together managed to create a full imatrix. In any case I'm very excited to test this models imatrix quants.
No idea why the status page shows this:
0 511 snowflake-arctic-base error/255 (GPU-2d) 0/35 24.66s/c 91.8/150.0m(184.8-145.0) [231/365] 7.4162
The imatirx task seams to still be running perfectly fine without any issues and the XX/365 even still increases on the satus page. I tend towards blaming this on an error of the status page and not the imatrix task.
I suspect that the old jopb was still running and eventually exited. Only looks scary though.
However, something else, you queued dolphin-2.5-mixtral-8x7b, but we already have static and dynamic quant repos of that. I set .override for it for the time being. Also, I saw that only by chance, so I can't say if that is true for other jobs.
Samantha-1.1-70b and dolphin-2.1-70b have the same issue - and those are the only three I checked.
I relied on llmc add to skip adding already existing models. I thought it would only add dublicates if I force them.
I nuked dolphin-2.5-mixtral-8x7b as we already have it. No idea why llmc failed to detect that. Samantha-1.11-70b and Samantha-1.1-70b are strange. They all have imatrix quants but no static quants which is likely what messed with the llmc existence check. They are also really old, so I recommend we requant them. Not sure why you believe that dolphin-2.1-70b has the same issue. I don't think you ever created quants for it or at least I can't find them on HuggingFace so I unbloakced it.
I checked all the other CognitiveComputations models I queued and no other had this issue. So it really was just 3 models where the llmc existence check failed me from which two lacked static quants.
Do we somewhere collect imatrix computation logs? llmc why
only shows quantization logs. It would be interesting why we are missing experts for this one:
Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
The result will be garbage, so bailing out
llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization
main: failed to quantize model from './BlackSheep-MoE-4x3B.gguf'
Is there a way for me to check the log of running quantisation tasks? I'm curious if snowflake-arctic-base imatrix.dat has any tensors with missing importance matrix.
I just synched the mradermacher branch with latest llama.cpp master so once we update the next time we not only have the GGUF validation and memory usage simulation but are also based on latest llama.cpp.
They all have imatrix quants but no static quants which is likely what messed with the llmc existence check.
The existance check is just a grep check against the submit log, which only exists since june or so. It does not care about parameters, but it does care about the full URL. IT was only meant to avoid submitting models already in the queue, unfortunately.
llmc why only shows quantization logs.
It shows both quantisation and imatrix logs, but only when we have them and only when they failed.
Is there a way for me to check the log of running quantisation tasks?
If you have filesystem access enough, they are in /llmjob/sdir/MODEL.log - for simplicity, this is /dev/shm on most nodes, which is why its not visible in llmc shell, but moving it is on the todo (somewhere). We have similar issues with /tmp being the default working dir. And /tmp., on marco, seems to be his main document storage directory, which is why it is not accessible via llmc shell either.
once we update the next time we not only have the GGUF validation and memory usage simulation but are also based on latest llama.cpp.
Let's do it at once :)
We did complete the quality measurements project and almost completed the performance measurement project.
I thought you did a lot more than Qwen2.5 (the results I have). I loosely plan to let people chose the model to compare with, and the default will be a trimmed mean of all models.
Also, will create a new thread soon, it twakes chromium a minute to keep up with typing when the posts are unfolded :)
@nicoboss when booting nico2, it seems to get a different internal ip address (i saw ...110, .111) - would it be possible to give it a static address (e.g. 111, what it uses right now)?
@nicoboss when booting nico2, it seems to get a different internal ip address (i saw ...110, .111) - would it be possible to give it a static address (e.g. 111, what it uses right now)?
All IP addresses are static. The second last digit represents the subnetwork (1,2=internet, 200=intranet) and the last digits always represents the VMID which is a unique identified for every Container/VM in the entire cluster. You likely mixed up nico1 and nico2 when checking IP addresses. eth0 and eth1 are physically separated networks using separate network equipment including dedicated 10 Gbit switches and 10 Gbit cables and so should be used for their intended purpose which you actually seam to do when transferring data from nico2 to nico1 which I highly appreciate. Here a list of the most important IP addresses to make things clear:
Node 'StormPeak'
eth0: 192.168.2.100
eth1: 192.168.200.100
Node 'CastlePeak'
eth0: 192.168.2.200
eth1: 192.168.200.1
Node 'Threadripper'
eth0: 192.168.1.200
eth1: 192.168.200.2
Container 108 (nico1) on node 'StormPeak'
eth0: 192.168.2.108
eth1: 192.168.200.108
Container 111 (nico2) on node 'CastlePeak'
eth0: 192.168.2.111
eth1: 192.168.200.111
Container 201 (RPC-GPU) on node 'CastlePeak'
eth0: 192.168.2.201
eth1: 192.168.200.201
Container 202 (RPC-GPU) on node 'StormPeak'
eth0: 192.168.2.202
eth1: 192.168.200.202
Container 203 (RPC-GPU) on node 'StormPeak'
eth0: 192.168.2.203
eth1: 192.168.200.203
Container 204 (RPC-GPU) on node 'Threadripper'
eth0: 192.168.1.204
eth1: 192.168.200.204
Container 107 (AI) on node 'StormPeak
eth0: 192.168.2.107
eth1: 192.168.200.107
Why are we turning on nico2 on 22:00 if all tasks assigned to it are of nice level 1200 and so timeofday waiting for tomorrow morning?
nico2 nice size (static/imatrix) -- jobs 2/4-12 maxm 300 free 1483 budget 770 uploads 0 hfd 0 32c
1200 23 I BlackSheep-MoE-4x3B blocked/timeofday/imatrix
1200 8 sI Phi-3.5-mini-instruct-LoRA-128 blocked/timeofday/static
Edit: I see you already turned it off again. So seems like you came to the same realization or where just testing the new functionality.
The existance check is just a grep check against the submit log, which only exists since june or so. It does not care about parameters, but it does care about the full URL. IT was only meant to avoid submitting models already in the queue, unfortunately.
Good to know. So I will use my own existing check based on the repolist again in the future. I'm surprised there are not more checks being performed when adding a model. It could check the repolist to see if a model already exists. It downloads the config.json and so could easily check if the architecture is supported and no quantization section is present. But not important to add that as I have all those checks in my model selection script and for user requested models I’m manually checking the config.json anyways.
It shows both quantisation and imatrix logs, but only when we have them and only when they failed.
Ah yes it didn't fail because only the quantization task that used the imatrix failed. Makes sense it isn't there.
If you have filesystem access enough, they are in /llmjob/sdir/MODEL.log - for simplicity
Thanks a lot! This worked on nico1 and I was able to confirm that snowflake-arctic-base is doing well before it reached a quant that would trigger the "Missing importance matrix for tensor XX in a very low-bit quantization" issue.
this is /dev/shm on most nodes, which is why its not visible in llmc shell, but moving it is on the todo (somewhere).
No hurry me looking at it if there is no error is rare and if I do the model will usualy be special in some way and so probably scheduled to nico1.
We have similar issues with /tmp being the default working dir.
Feel free to change the working directory to something else. At least for nico1, nico2 and rich1 there is no benefit in using /tmp. Actually for nico1 it already is /tmp/quant instead of /tmp.
And /tmp., on marco, seems to be his main document storage directory, which is why it is not accessible via llmc shell either.
Wow what a strange location to put documents. Would be a shame if it just deletes itself an a reboot as it once did on rich1.
Let's do it at once :)
Please do. I'm looking forward to seeing it in action. I highly recommend to set a cgroup memory limit like you do for quantization tasks in case I missed some strange code path on some niche model but I tested my code on all the relatively diverse ConfitiveComputation models so I'm confident it works for the vast majority of models. I used it to check which of them are MoE. Turns out mixtral-1x22b-base surprisingly is despite only containing a single expert.
I thought you did a lot more than Qwen2.5 (the results I have). I loosely plan to let people chose the model to compare with, and the default will be a trimmed mean of all models.
I did more than just Qwen 2.5 but things got quite fragmented over time so here the important download links.
Qwen 2.5 series quant quality measurements: http://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst
Other quant quality measurements: http://www.nicobosshard.ch/LLM-Eval_v2.tar.zst
Quant performance measurements: http://www.nicobosshard.ch/perfData.zip
Also, will create a new thread soon, it twakes chromium a minute to keep up with typing when the posts are unfolded :)
I'm already excited for the name! :D
Why are we turning on nico2 on 22:00 if all tasks assigned to it are of nice level 1200 and so timeofday waiting for tomorrow morning?
I am sorry, but you have to allow me to test things - I can't just write things perfectly on first try, usually :)
I'm surprised there are not more checks being performed when adding a model.
Well, it really is an internal tool. The things you wonder about were either not available at the time it was implemented, or were of no consequence (e.g. my frontend knows which architectures are supported). and at least in the past, there was no reliable list of supported architectures. I do note things, of course, but whether something will be implemented or not is a complex problem,
Feel free to change the working directory to something else.
For the jobs, they are already somewhere else (/llmjob/?dir), but /tmp is easy to type....
Wow what a strange location to put documents.
I agree. But there are stranbger things. I once had an austrian coworker who got his new laptop with a massive and unbelievable 16(!!!!)GB of RAM. Having so much ram, he put everything in a ramdisk, claiming it was super fast to develop this way, despite us telling him it's a dumb idea. Till he accidentally rebooted once. He never mentioned this again.
Would be a shame if it just deletes itself an a reboot as it once did on rich1.
Well, that was outright sabotage - on our boxes, this has not happened once in the last three decades (not even when debian introduced systemd, but it admittedly was close, as the debian maintainer wisely implemented upgrade code for this case (so I guess it was not uncommon)) -- chances are therefore low. It's not as if files in /etc magically change normally :(
Regardless of the merits of using /tmp, I still would have preferred tracking down what or who mucked in my /etc and thought deleting files is a good thing to do -- and summarily execute it/him/her. It just makes zero sense to erase /tmp on reboot - cleaning up /tmp, yes, but on reboot, no, that's never correct. And whatever changes some files, will also change other important files, probably.
I did more than just Qwen 2.5 but things got quite fragmented over time so here the important download links.
I guess extracting all this is not going to be fun. Sigh.
In other news, the noquant step will now check the resulting gguf using DRYRUN. If everything is red soon, that's why.
All IP addresses are static.
Really strange - after I shut down and woke up nico2 this afternoon, I had to update two of my scripts (that I wrote today for nico2) from .110 to .111 so they would work again.
Update: yeah, I even had the commands in my scrollback, I tested with 192.168.2.108 (to reach nico1) and ..110 (to reach nico2). Very weird. It was even in the .rhosts
nico2 is no longer getting any imatrix. It seems like uploading the source GGUF from nico2 to nico1 no longer works as I can’t see the source GGUFs on nico1 under /tmp. It already went idle due to running out of budged due to not having the required imatrix to complete the tasks. I was able to temporary fix this by llmc audit removing the failures, but it will likely run out of work again soon unless you fix this:
ram budget 490 use 0
40+ ? aya-23-8b-oscar error/2 (from nico2)
51+ ? Llama-chatDoctor error/2 (from nico2)
52+ ? Llama-3-8B-Instruct-DB-Expert error/2 (from nico2)
53+ ? Shastra-Mistral-Math-DPO error/2 (from nico2)
54+ ? Viking-SlimSonnet-v1-7B error/2 (from nico2)
55+ ? llama3-1_8b_mlfoundations-dev-stackexchange_sports error/2 (from nico2)
56+ ? Llama-3-8B-Instruct-Coding-Expert error/2 (from nico2)
57+ ? Mistral-MetaMath-7b error/2 (from nico2)
1200 ? Llama3-DiscoLeo-Instruct-8B-32k-v0.1 error/2 (from nico2)
thats not the red i wanted to wake up to :) but yeah, something deifnitely has changed
I can imagine I got confused between eth0 and eth1 (and, indeed, between nico1 and nico2), but there is no .110 in either list. This is just very strange. Anyway, updating the imatrix transfer address fixed it.
I accidentally nuked BlackSheep-MoE-4x3B earlier today as I had to keep audit things all the time due to nico2 being so full and once I forgot skipping its "Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization" error. The static quants are now done and imatrix quants half-done. Can I just forcefully requeue it for imatrix quants or do I have to nukerepo the existing imatrix quants first? We will have to change it so it doesn't do the super low BPW imatrix quants as they fail.
Anyway, updating the imatrix transfer address fixed it.
Thanks a lot for fixing it!
I can imagine I got confused between eth0 and eth1 (and, indeed, between nico1 and nico2), but there is no .110 in either list.
No idea why you are keep mentioning 110 but in case you are curious that is the LXC GUI container on Threadripper assigned to RichardErkhov for training his models using the 2070s GPU. 192.168.2.110 does not exist as RichardGUI is on Threadripper which uses the 1
-subnet and 192.168.200.110 doesn't exist because RichardGUI has no intranet access.
Container 110 (RichardGUI) on node 'Threadripper'
eth0: 192.168.1.110
eth1: Not Present
I am sorry, but you have to allow me to test things - I can't just write things perfectly on first try, usually :)
Feel free to test as much as you want. Totally understandable that it might not end up perfectly in the first try. And the same applies to me. I'm still waiting for the moment when some random special model messes with my memory calculation.
Well, it really is an internal tool. The things you wonder about were either not available at the time it was implemented, or were of no consequence (e.g. my frontend knows which architectures are supported). and at least in the past, there was no reliable list of supported architectures. I do note things, of course, but whether something will be implemented or not is a complex problem,
No problem it's all really low priority.
For the jobs, they are already somewhere else (/llmjob/?dir), but /tmp is easy to type....
Not only that. It is also super easy to remember.
I agree. But there are stranbger things. I once had an austrian coworker who got his new laptop with a massive and unbelievable 16(!!!!)GB of RAM. Having so much ram, he put everything in a ramdisk, claiming it was super fast to develop this way, despite us telling him it's a dumb idea. Till he accidentally rebooted once. He never mentioned this again.
Haha that is so stupid. But to be fair RAM disks are really fast but you better make sure automatically in almost real-time synchronize them to some persistent storage medium because it is like never saving your documents. A system crash and all your work is gone.
Regardless of the merits of using /tmp, I still would have preferred tracking down what or who mucked in my /etc and thought deleting files is a good thing to do -- and summarily execute it/him/her.
It is almost certain that it was Mail-in-a-Box. One of their scripts must have somehow messed with it from outside the container. It is the same shit that forced us to reinstall the entire host because it turned it into a mail server. With "Mail-in-a-Box turns a fresh cloud computer into a working mail server." they don't mean installing a mainserver on your server but replacing like half the OS with their mail server garbage with no way to ever uninstall it again without reinstalling the entire server. Like it in fact replaces every single systemd service with its own and changes a ton of configurations everywhere but it really shouldn’t have touched the container which is on a separate SSD with the root file system being locate multiple folders deep.
In other news, the noquant step will now check the resulting gguf using DRYRUN. If everything is red soon, that's why.
How is it going so far. I have not seen anything red because of it. I would be surprised it works flawlessly with every niche model.
Can I just forcefully requeue it with for imatrix quants or do I have to nukerep
Always - existing quants will be skipped. And yes, the list of quants needs to be edited, so, sucks, but it has to be done manually either way.
No idea how why you are keep mentioning 110 but in case you are curious that is the LXC GUI container
Because that was the address I used in all my scripts. Well, three scripts, really.
Feel free to test as much as you want.
There are even unexpected new issues :) For example, I can't update the scripts when its down, or make a backup. A whole new level of inconvenience with either exciting hacks to fix or, best of all, fixced by ignorign the problem.
Not only that. It is also super easy to remember.
That's why I regret using /tmp/quant on nico1, rather than /tmp/imatrix or so. But there are a few other nodes with "exotic" locations, so it's best to use /llmjob/wdir. However, I don'T work much on those nodes with exotic locations, and it's nice to have it in, say /tmp, on those boxes where I am often (mostly kaos and nico1 these days).
Anyway, I wouldn't use /tmp now, but back when I started, all this was very... temporary....
And I am a great proponent of keeping wrong naming choices when they have been ingrained into my memory. Gives stuff flavour.
Haha that us so stupid.
It was a source of fun for years.
One of their scripts must have somehow messed with it from outside the container.
Probably some find over everything. That is so scary, but yes, these things exist in the wild.
How is it going so far. [dryrun]
Well, most failures of this kind happen a while after queuing jobs, which I only did just now. And usually there are 1-2 per day.
However, the models most likely to have these failures are all through for today. That is suspicious, as I had 1-4 failures every day for the last week.
So, if we see red imatrix statuses, chances are it didn't work.
Regarding logs - I want to combine logs centrally (right now, there are a lot of local logs on nico1/rich1/nico2, and logging via nfs isn't ideal (although it hasn't failed me yet)). And maybe also keep all logs.
Problem is, that's an estimated gigabyte of logs each day, uncompressed, so for the job logs, I'd want to reduce them. I think the most intelligent way is to cut out some stuff, such as attribute printing for noquant, and quantize per-tensor logging, when archiving.
In any case, all of this is fun, and useful, but also not high priority stuff, so I am not sure when I go about it.
Oh, and btw., this is what I use - the switches are there mainly to hopefully reduce issues if I ever stumble over an unpatched llama
DRYRUN= llama llama-cli -m "$SRC.gguf~" --no-warmup -b 0 -t 1 -sys "" -st
One minor concern is that llama-cli scans the current directory for files and crashes if it finds filenames it doesn't like (quality software - why does it even list all files). Not sure if that will ever be a concern.
And while you are patiently listening, these are the things on my todo list, if I ever find time:
- teach patchreadme to rename models and files to new upstream names (minimal fun)
- add the model overview page link (multiple failed attempts, not fun anymore)
- collect llmjob logs, and also job logs, better
- add an interactive vram calculator to the model overview
regarding the model overview page link, i think the main progress I had with this was the realisation I had last week that it doesn't matter if we generate lots of updates anymore - we should upsate, say, a few hundred model pages per day, and it will be down in the noise. the main issues I had is that I really want the link to stand out, and each time I came up with an updates style, huggingface broke it by updates to their (undocumented, afaics) markdown syntax.
Always - existing quants will be skipped. And yes, the list of quants needs to be edited, so, sucks, but it has to be done manually either way.
Great to know.
There are even unexpected new issues :) For example, I can't update the scripts when its down, or make a backup. A whole new level of inconvenience with either exciting hacks to fix or, best of all, fixced by ignorign the problem.
Wouldn't the scripts synchronize when it is available again? You can always turn it on if you need it even if it is just to update a script. Backups are a good point. I probably should enable them for your containers. Actually you using /tmp
for all big files is really convenient as it is like the only folder excluded by Proxmox backups.
Anyway, I wouldn't use /tmp now, but back when I started, all this was very... temporary....
And I am a great proponent of keeping wrong naming choices when they have been ingrained into my memory. Gives stuff flavour.
Yes even for me it would now be inconvenient to switch as I memorized the path so well.
Probably some find over everything. That is so scary, but yes, these things exist in the wild.
This is almost certainly what they did. Mail-in-a-Box is suck a pice of trash. It ruined Richards entire server and then didn't even work. On the newly installer server Richard then used Mailcow inside an LXC container and it worked perfectly fine and passed all security tests with very little configuration required.
Well, most failures of this kind happen a while after queuing jobs, which I only did just now. And usually there are 1-2 per day.
However, the models most likely to have these failures are all through for today. That is suspicious, as I had 1-4 failures every day for the last week.
So, if we see red imatrix statuses, chances are it didn't work.
Oh, let's hope for the best. No imatrix failure so far but a lot of imatrix tasks will only be started at 22:00 due to most of them currently being timeofday blocked.
Regarding logs - I want to combine logs centrally (right now, there are a lot of local logs on nico1/rich1/nico2, and logging via nfs isn't ideal (although it hasn't failed me yet)). And maybe also keep all logs.
Collecting all the logs would for sure make a lot of sense.
Problem is, that's an estimated gigabyte of logs each day, uncompressed, so for the job logs, I'd want to reduce them. I think the most intelligent way is to cut out some stuff, such as attribute printing for noquant, and quantize per-tensor logging, when archiving.
We are dealing with terabytes of models every day so a few gigabytes of daily logs are not that bad. Especially if you store them zstandard level 18 compressed their size will likely be almost irrelevant. But I agree that filtering out or not even having llama.cpp print useless information makes sense.
In any case, all of this is fun, and useful, but also not high priority stuff, so I am not sure when I go about it.
Yes exactly while having all the logs would be nice we currently have much more important things to work on.
Oh, and btw., this is what I use - the switches are there mainly to hopefully reduce issues if I ever stumble over an unpatched llama
DRYRUN= llama llama-cli -m "$SRC.gguf~" --no-warmup -b 0 -t 1 -sys "" -st
Just so you know DRYRUN is supposed to work with every llama.cpp executable that loades a model so you are not limited to llama-cli.
One minor concern is that llama-cli scans the current directory for files and crashes if it finds filenames it doesn't like (quality software - why does it even list all files). Not sure if that will ever be a concern.
Then just don't use llama-cli but any other one that doesn't do this.
new discussion https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/4
With an awesome title as always: "Whimsical Waffle: The Curious Case of LLMs and Their Linguistic Shenanigans" :D
teach patchreadme to rename models and files to new upstream names (minimal fun)
Nice. No idea why everyone keeps renaming thair models but us having a diffrent name makes ouer models hard to find so automated renames would be quite usefull.
collect llmjob logs, and also job logs, better
I'm looking forward to it but the times we need it are luckily relatively rare.
add an interactive vram calculator to the model overview
That would be amazing! There are quite a lot of factors that influence vram usage but maybe you can find a pattern by playing around with dryrun.
add the model overview page link (multiple failed attempts, not fun anymore).
regarding the model overview page link, i think the main progress I had with this was the realisation I had last week that it doesn't matter if we generate lots of updates anymore - we should upsate, say, a few hundred model pages per day, and it will be down in the noise.
Hundreds of daily updates really don't matter. If someone follows us they get so much spam they won’t look at it anyways. What personally bothers me way more is that models always show the date when they were last updated even if sorted by last created. But nothing we can do about it. I guess we can at least try to update them in chronological order, so the order stays the same. Or can we?!? HuggingFace uses Git and might trust the git commit and author dates so you could spoof them by setting GIT_COMMITTER_DATE and GIT_AUTHOR_DATE environment variables before committing using git and see if HuggingFace then uses the spoofed ast updated date.
the main issues I had is that I really want the link to stand out, and each time I came up with an updates style, huggingface broke it by updates to their (undocumented, afaics) markdown syntax.
We for sure want to have it stand out. When we are at it, we really should general overhaul the rest of the model cards and make them perfect as this will hopefully be the last time we ever edit all of them.