mradermacher/model_requests

Apr 15

This is my first model. I was stoked when I saw that you generated Quants for it! But then it shortly disappeared.

Is there something wrong with the model? I think the unique 3x12b parameter model is actually pretty solid so far (though not perfect.)

I would love to get it on your quant list again if possible!

SuperbEmphasis/Velvet-Eclipse-v0.1-3x12B-MoE
https://huggingface.co/SuperbEmphasis/Velvet-Eclipse-v0.1-3x12B-MoE

mradermacher

Owner Apr 16

Is there something wrong with the model? I think the unique 3x12b parameter model is actually pretty solid so far (though not perfect.)

Yeah, it doesn't work in llama.cpp:

 /llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed

SuperbEmphasis

Apr 16

•

edited Apr 16

What version of llamacpp are you running?

I'm using llamacpp with their official docker container without issue.

Well to be specific, the quant from GGUF my repo works:
https://huggingface.co/SuperbEmphasis/Velvet-Eclipse-v0.1-3x12B-MoE-Q4_K_M-GGUF

mradermacher

Owner Apr 16

We are currently on 5122. Did you actually try generating an imatrix, or did you just run inference?

SuperbEmphasis

Apr 16

Just ran inference. I was trying to figure out how to make an imatrix dataset, and the layout of what that should look like (paraqueet, json, jsonl?) but I had trouble figuring out what that was supposed to look like.

Maybe it is because it is because I only have 3 experts? mergekit warns that llamacpp wont be able to run the model, but it seemed to work okay. This is the first model I have ever made/merged (Made is a bit of a stretch here...) so I am still learning.

Also, thank you for your efforts in general! I always go to your weighted quants. So I really appreciate how much time and effort you put into this!

mradermacher

Owner Apr 16

•

edited Apr 16

Well, if mergekit warns, it proabbly should be taken seriously, but the crash is a tokenizer problem. Maybe newer llama.cpp versions have a workaround, or maybe it's a bug in llama.cpp or the converter. I'll retry once we upgrade the next time.

In this case, from what I see, the mergekit warning is probably wrong. And few-expert moes should not be a problem per se.

nicoboss

Apr 17

@mradermacher Please update to the latest version of ouer llama.cpp fork and try again. The latest version also finally supports MLA for V2/V3/R1 but heated duscussion about MLA releated performance issues are still ongoing so let wait for them to conclude before we do them.

SuperbEmphasis

Apr 17

I also went ahead and made a 4x12 version
https://huggingface.co/SuperbEmphasis/Velvet-Eclipse-v0.1-4x12B-MoE

:D

mradermacher

Owner Apr 17

•

edited Apr 17

saucool :)

I also went ahead and made a 4x12 version

I'll queue both again, let's see if that fixes things...

so let wait

most sane thing i heard in a week or so :-)

SuperbEmphasis

Apr 17

Thank you! I love using your quants.

I also used some pruning magic to make the mistral fine tunes FrankenMoEMerge into something where the Q4 quant will fit within 24GB of VRAM with some good context. Could you add this one too if the others work:
https://huggingface.co/SuperbEmphasis/Velvet-Eclipse-v0.1-4x12B-MoE-EVISCERATED

:D

mradermacher

Owner Apr 17

they work / also added / you get the nightshift mostly to yourself :)

mradermacher

Owner Apr 17

nope, still fails the same way, llama 5179 (the 3xMOE). the 4x work.

@nicoboss any idea why he can run it (e.g. with llama-cli), while ours fails? our llama fails with his quant as well.

SuperbEmphasis

Apr 17

•

edited Apr 17

Weird... for what it's worth...I'm not running anything special.

docker run -itd -p 8080:8080 --name llamacpp -v /mnt/local-data/models:/models:z \
  --security-opt label=type:nvidia_container_t \
  --device nvidia.com/gpu=all \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
      -m /models/${MODEL} \
      -c 12000 --host 0.0.0.0 --port 8080 \
      --n-gpu-layers 999 --flash-attn \
      --cache-type-k q8_0 --cache-type-v q8_0

And my version:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Quadro P6000, compute capability 6.1, VMM: yes
version: 5083 (7538246e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

mradermacher

Owner Apr 17

somethingf weird is going on. this problem occurs with upstream llama, regardless of compile time options. maybe asserts are simply switched off somehow in the versions where it works?

mradermacher

Owner Apr 18

so, i debugged it a bit further. from what i can see in the code, llama crashes because it was coded to crash on this model: basically, when tokenizer.ggml.add_eos_token is true, and architecture is llama, it is crashing (main.cpp:261) - and this is true for this model.

this is either a bug in llama.cpp, or the model. and seems nothing we can fix, or we can do about, as it happens with all ggufs.

the check to fail is present in 5083 as well, afaics, there is no way to escape it. so, somehow, you must have a a version of llama.cpp that simply ignores assertion failures.

I've asked nico, he has a better grasp of llama.cpp. maybe he can pipe in with something (but he is quite busy atm.)

SuperbEmphasis

Apr 18

so, i debugged it a bit further. from what i can see in the code, llama crashes because it was coded to crash on this model: basically, when tokenizer.ggml.add_eos_token is true, and architecture is llama, it is crashing (main.cpp:261) - and this is true for this model.

this is either a bug in llama.cpp, or the model. and seems nothing we can fix, or we can do about, as it happens with all ggufs.

the check to fail is present in 5083 as well, afaics, there is no way to escape it. so, somehow, you must have a a version of llama.cpp that simply ignores assertion failures.

I've asked nico, he has a better grasp of llama.cpp. maybe he can pipe in with something (but he is quite busy atm.)

That's interesting. Another different is I am not using llama-cli, I am using llama-server for the openai api. I wonder if the check does not exist in the llama-server binary?

Thank you for looking into it! I have been learning a lot and I really appreciate all of your quants! Locallama on reddit wouldnt be the same without you guys!

If I could ask one more question, how are updates handled? The "EVISCERATED" model I am still tinkering with which layers I am removing. The last rendition I uploaded a couple hours ago works pretty well, but gets a bit repetitive which I think is a symptom of my LLM Brain surgery. So if I update that model, does your process detect/and requant/upload as well?

Also I would be super curious about your pipeline you use for handling all of this, if you have blogged or posted about it :D

Thank you!

SuperbEmphasis

Apr 19

@mradermacher ,

I made a new pruned moe that works way better.
https://huggingface.co/SuperbEmphasis/Velvet-Eclipse-v0.1-4x12B-MoE-EVISCERATED

And I have no idea how your pipeline works, so apologies if this is annoying to you. I noticed a lot of people downloaded your imatrix gguf (which is awesome). But I'd love to get the better working pruned one gguf'd using your process!

nicoboss

Apr 19

Turns out this already was queued my @mradermacher as user requested model.

mradermacher

Owner Apr 22

That's interesting. Another different is I am not using llama-cli, I am using llama-server for the openai api. I wonder if the check does not exist in the llama-server binary?

That's it. @nicoboss looks like a bug in llama.cpp, what do you think?

And I have no idea how your pipeline works, so apologies if this is annoying to you.

It's not annoying. Or rather, you are not annoying, llama.cpp is annoying, no stable releases, no quality control etc. ;)

Anyway, our pipeline is pretty simple, we convert the model, quantize it, possibly create an imatrix. Since there is a relatively high failure rate, we "recently" started to test-run models, so we catch problems before we do a costly static quantize + imatrix calculation.

Unfortunately, both llama-imatrix and llama-cli do contain this check, and it looks pretty deliberate (still looks like a bug, imho, since the model clearly works. it currently affects >>150 models though, not just yours).

SuperbEmphasis

Apr 23

Thanks @mradermacher !

Could I request that:
https://huggingface.co/SuperbEmphasis/Velvet-Eclipse-v0.1-4x12B-MoE-EVISCERATED (Not the 3x12B model, or the other 4x12B model) get an update? When this first fired off, I was experimenting with removing 4 layers, but this caused lots of repetition. I have since remade the model to only prune out 2 layers in each of the 4 12B models, and it works much better, but anyone downloading the old quants will have poor results.

Thank you for your awesome work!

mradermacher

Owner Apr 24

•

edited Apr 24

Sure, but may I suggest making a new repo next time? It's very hard to see (and practically impossible to link to) specific commits - even huggingface doesn't do it and will simply link to the wrong model. It also allows us to keep both models, which is useful if the old model is not totally broken (I've seen lots of "improved" models that actually work worse, but the creator thought it was simply better).