mradermacher/BabyHercules-4x150M-GGUF · Quantum Entanglement and the Sentient Toaster: Revolutionizing LLM Training

Owner Dec 7, 2024

•

edited Dec 7, 2024

I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.

nicoboss

Dec 7, 2024

-rw------- 1 root root 509G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf

I assume that is in GB and not GiB. In which case 474 GiB might fit as we have 503 GiB of RAM (after subtracting RAM reserved for hardware) but would be extremely tight given the RAM required for context.

I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.

Q6_K is fine for me. Q8_0 might not fit without offloading and it is unclear if offloading is even possible. I don't think it's worth using RPC if Q6_K fits. As a bonus there will be enough RAM left to let quantization tasks running if we do Q6_K. If you already have Q8_0 locally you should give it a try and see if it fits but if not Q6_K is fine for me.

nicoboss

Dec 7, 2024

•

edited Dec 7, 2024

I just checked and you do have it locally under /tmp/snowflake-arctic-instruct.Q8_0.gguf so please give it a try to see if it fits. I believe it should fit if nothing else is running as the model has such a small number of layers. If it doesn't fit use Q6_K instead.

474G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf

mradermacher

Owner Dec 7, 2024

I'll try an offload of 1 and 0, then Q6. hopefully it does not crash.

nicoboss

Dec 7, 2024

•

edited Dec 7, 2024

I think you have to finish or kill the frozen quantisation tasks first. They are using a lot of reserved RAM (not cached RAM that can be taked away).

mradermacher

Owner Dec 7, 2024

So, despite it listing both cpus, it only allocated something on cpu 0 (19GB). Otherwise, top says the process uses 435.6g, which is good, because I forgot to resume/stop the running quantize. I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.

mradermacher

Owner Dec 7, 2024

457.4g after warming up.

nicoboss

Dec 7, 2024

•

edited Dec 7, 2024

So, despite it listing both GPUs, it only allocated something on GPU0 (19GB)

llama.cpp uses booth GPUs for imatrix but only offloaded to one because you set -ngl 1 and it can only offload on a per-layer bases. Also ince when are quantisation tasks using the GPUs?

I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.

I'm not so sure about that. Keep in mind that imatrix uses mmap memory that can be taken away by other processes like quantisation tasks that use reserved memory.

dstat shows a relatively high disk read rate so imatrix might now be streaming from SSD:

Yes it is clearly streaming from SSD now:

Once the quantisation tasks are interrupted it should work without SSD streaming again.

574 hidden messages

Expand all

mradermacher

Owner Mar 11

Regarding logs - I want to combine logs centrally (right now, there are a lot of local logs on nico1/rich1/nico2, and logging via nfs isn't ideal (although it hasn't failed me yet)). And maybe also keep all logs.

Problem is, that's an estimated gigabyte of logs each day, uncompressed, so for the job logs, I'd want to reduce them. I think the most intelligent way is to cut out some stuff, such as attribute printing for noquant, and quantize per-tensor logging, when archiving.

In any case, all of this is fun, and useful, but also not high priority stuff, so I am not sure when I go about it.

mradermacher

Owner Mar 11

•

edited Mar 11

Oh, and btw., this is what I use - the switches are there mainly to hopefully reduce issues if I ever stumble over an unpatched llama

DRYRUN= llama llama-cli -m "$SRC.gguf~" --no-warmup -b 0 -t 1 -sys "" -st

mradermacher

Owner Mar 11

One minor concern is that llama-cli scans the current directory for files and crashes if it finds filenames it doesn't like (quality software - why does it even list all files). Not sure if that will ever be a concern.

mradermacher

Owner Mar 11

new discussion https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/4

mradermacher

Owner Mar 11

And while you are patiently listening, these are the things on my todo list, if I ever find time:

teach patchreadme to rename models and files to new upstream names (minimal fun)
add the model overview page link (multiple failed attempts, not fun anymore)
collect llmjob logs, and also job logs, better
add an interactive vram calculator to the model overview

regarding the model overview page link, i think the main progress I had with this was the realisation I had last week that it doesn't matter if we generate lots of updates anymore - we should upsate, say, a few hundred model pages per day, and it will be down in the noise. the main issues I had is that I really want the link to stand out, and each time I came up with an updates style, huggingface broke it by updates to their (undocumented, afaics) markdown syntax.

nicoboss

Mar 11

Always - existing quants will be skipped. And yes, the list of quants needs to be edited, so, sucks, but it has to be done manually either way.

Great to know.

There are even unexpected new issues :) For example, I can't update the scripts when its down, or make a backup. A whole new level of inconvenience with either exciting hacks to fix or, best of all, fixced by ignorign the problem.

Wouldn't the scripts synchronize when it is available again? You can always turn it on if you need it even if it is just to update a script. Backups are a good point. I probably should enable them for your containers. Actually you using /tmp for all big files is really convenient as it is like the only folder excluded by Proxmox backups.

Anyway, I wouldn't use /tmp now, but back when I started, all this was very... temporary....
And I am a great proponent of keeping wrong naming choices when they have been ingrained into my memory. Gives stuff flavour.

Yes even for me it would now be inconvenient to switch as I memorized the path so well.

Probably some find over everything. That is so scary, but yes, these things exist in the wild.

This is almost certainly what they did. Mail-in-a-Box is suck a pice of trash. It ruined Richards entire server and then didn't even work. On the newly installer server Richard then used Mailcow inside an LXC container and it worked perfectly fine and passed all security tests with very little configuration required.

Well, most failures of this kind happen a while after queuing jobs, which I only did just now. And usually there are 1-2 per day.
However, the models most likely to have these failures are all through for today. That is suspicious, as I had 1-4 failures every day for the last week.
So, if we see red imatrix statuses, chances are it didn't work.

Oh, let's hope for the best. No imatrix failure so far but a lot of imatrix tasks will only be started at 22:00 due to most of them currently being timeofday blocked.

Regarding logs - I want to combine logs centrally (right now, there are a lot of local logs on nico1/rich1/nico2, and logging via nfs isn't ideal (although it hasn't failed me yet)). And maybe also keep all logs.

Collecting all the logs would for sure make a lot of sense.

Problem is, that's an estimated gigabyte of logs each day, uncompressed, so for the job logs, I'd want to reduce them. I think the most intelligent way is to cut out some stuff, such as attribute printing for noquant, and quantize per-tensor logging, when archiving.

We are dealing with terabytes of models every day so a few gigabytes of daily logs are not that bad. Especially if you store them zstandard level 18 compressed their size will likely be almost irrelevant. But I agree that filtering out or not even having llama.cpp print useless information makes sense.

In any case, all of this is fun, and useful, but also not high priority stuff, so I am not sure when I go about it.

Yes exactly while having all the logs would be nice we currently have much more important things to work on.

Oh, and btw., this is what I use - the switches are there mainly to hopefully reduce issues if I ever stumble over an unpatched llama
DRYRUN= llama llama-cli -m "$SRC.gguf~" --no-warmup -b 0 -t 1 -sys "" -st

Just so you know DRYRUN is supposed to work with every llama.cpp executable that loades a model so you are not limited to llama-cli.

One minor concern is that llama-cli scans the current directory for files and crashes if it finds filenames it doesn't like (quality software - why does it even list all files). Not sure if that will ever be a concern.

Then just don't use llama-cli but any other one that doesn't do this.

new discussion https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/4

With an awesome title as always: "Whimsical Waffle: The Curious Case of LLMs and Their Linguistic Shenanigans" :D

teach patchreadme to rename models and files to new upstream names (minimal fun)

Nice. No idea why everyone keeps renaming thair models but us having a diffrent name makes ouer models hard to find so automated renames would be quite usefull.

collect llmjob logs, and also job logs, better

I'm looking forward to it but the times we need it are luckily relatively rare.

add an interactive vram calculator to the model overview

That would be amazing! There are quite a lot of factors that influence vram usage but maybe you can find a pattern by playing around with dryrun.

add the model overview page link (multiple failed attempts, not fun anymore).
regarding the model overview page link, i think the main progress I had with this was the realisation I had last week that it doesn't matter if we generate lots of updates anymore - we should upsate, say, a few hundred model pages per day, and it will be down in the noise.

Hundreds of daily updates really don't matter. If someone follows us they get so much spam they won’t look at it anyways. What personally bothers me way more is that models always show the date when they were last updated even if sorted by last created. But nothing we can do about it. I guess we can at least try to update them in chronological order, so the order stays the same. Or can we?!? HuggingFace uses Git and might trust the git commit and author dates so you could spoof them by setting GIT_COMMITTER_DATE and GIT_AUTHOR_DATE environment variables before committing using git and see if HuggingFace then uses the spoofed ast updated date.

the main issues I had is that I really want the link to stand out, and each time I came up with an updates style, huggingface broke it by updates to their (undocumented, afaics) markdown syntax.

We for sure want to have it stand out. When we are at it, we really should general overhaul the rest of the model cards and make them perfect as this will hopefully be the last time we ever edit all of them.

RichardErkhov

Mar 13

@mradermacher @nicoboss I didnt read messages that you had, but just letting you know that apparently we again hit the rate limiits lol. my rich1 is a bit idle rigth now, or more like for the last hour or so at least

nicoboss

Mar 13

•

edited Mar 13

@mradermacher @nicoboss I didnt read messages that you had, but just letting you know that apparently we again hit the rate limiits lol. my rich1 is a bit idle rigth now, or more like for the last hour or so at least

@RichardErkhov we now moved on to https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/4. The reason we managed to hit the rate limit is because we did a ton of small 1200 priority models today. @mradermacher Let's maybe queue imatrix quants for that snowflake arctic instruct on /apool so we are less likely to run into rate limits. In any case it is 17:00 now so nico1 and nico2 will be idle/off and so we will for sure soon drop below the limit.