Velvet-Eclipse
This is my first model. I was stoked when I saw that you generated Quants for it! But then it shortly disappeared.
Is there something wrong with the model? I think the unique 3x12b parameter model is actually pretty solid so far (though not perfect.)
I would love to get it on your quant list again if possible!
SuperbEmphasis/Velvet-Eclipse-v0.1-3x12B-MoE
https://huggingface.co/SuperbEmphasis/Velvet-Eclipse-v0.1-3x12B-MoE
Is there something wrong with the model? I think the unique 3x12b parameter model is actually pretty solid so far (though not perfect.)
Yeah, it doesn't work in llama.cpp:
/llmjob/llama.cpp-cuda512/examples/imatrix/imatrix.cpp:470: GGML_ASSERT(!llama_vocab_get_add_eos(vocab)) failed
What version of llamacpp are you running?
I'm using llamacpp with their official docker container without issue.
Well to be specific, the quant from GGUF my repo works:
https://huggingface.co/SuperbEmphasis/Velvet-Eclipse-v0.1-3x12B-MoE-Q4_K_M-GGUF
We are currently on 5122. Did you actually try generating an imatrix, or did you just run inference?
Just ran inference. I was trying to figure out how to make an imatrix dataset, and the layout of what that should look like (paraqueet, json, jsonl?) but I had trouble figuring out what that was supposed to look like.
Maybe it is because it is because I only have 3 experts? mergekit warns that llamacpp wont be able to run the model, but it seemed to work okay. This is the first model I have ever made/merged (Made is a bit of a stretch here...) so I am still learning.
Also, thank you for your efforts in general! I always go to your weighted quants. So I really appreciate how much time and effort you put into this!
Well, if mergekit warns, it proabbly should be taken seriously, but the crash is a tokenizer problem. Maybe newer llama.cpp versions have a workaround, or maybe it's a bug in llama.cpp or the converter. I'll retry once we upgrade the next time.
In this case, from what I see, the mergekit warning is probably wrong. And few-expert moes should not be a problem per se.
@mradermacher Please update to the latest version of ouer llama.cpp fork and try again. The latest version also finally supports MLA for V2/V3/R1 but heated duscussion about MLA releated performance issues are still ongoing so let wait for them to conclude before we do them.
I also went ahead and made a 4x12 version
https://huggingface.co/SuperbEmphasis/Velvet-Eclipse-v0.1-4x12B-MoE
:D
saucool :)
I also went ahead and made a 4x12 version
I'll queue both again, let's see if that fixes things...
so let wait
most sane thing i heard in a week or so :-)
Thank you! I love using your quants.
I also used some pruning magic to make the mistral fine tunes FrankenMoEMerge into something where the Q4 quant will fit within 24GB of VRAM with some good context. Could you add this one too if the others work:
https://huggingface.co/SuperbEmphasis/Velvet-Eclipse-v0.1-4x12B-MoE-EVISCERATED
:D
they work / also added / you get the nightshift mostly to yourself :)
nope, still fails the same way, llama 5179 (the 3xMOE). the 4x work.
@nicoboss any idea why he can run it (e.g. with llama-cli), while ours fails? our llama fails with his quant as well.
Weird... for what it's worth...I'm not running anything special.
docker run -itd -p 8080:8080 --name llamacpp -v /mnt/local-data/models:/models:z \
--security-opt label=type:nvidia_container_t \
--device nvidia.com/gpu=all \
ghcr.io/ggerganov/llama.cpp:server-cuda \
-m /models/${MODEL} \
-c 12000 --host 0.0.0.0 --port 8080 \
--n-gpu-layers 999 --flash-attn \
--cache-type-k q8_0 --cache-type-v q8_0
And my version:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Quadro P6000, compute capability 6.1, VMM: yes
version: 5083 (7538246e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
somethingf weird is going on. this problem occurs with upstream llama, regardless of compile time options. maybe asserts are simply switched off somehow in the versions where it works?
so, i debugged it a bit further. from what i can see in the code, llama crashes because it was coded to crash on this model: basically, when tokenizer.ggml.add_eos_token is true, and architecture is llama, it is crashing (main.cpp:261) - and this is true for this model.
this is either a bug in llama.cpp, or the model. and seems nothing we can fix, or we can do about, as it happens with all ggufs.
the check to fail is present in 5083 as well, afaics, there is no way to escape it. so, somehow, you must have a a version of llama.cpp that simply ignores assertion failures.
I've asked nico, he has a better grasp of llama.cpp. maybe he can pipe in with something (but he is quite busy atm.)
so, i debugged it a bit further. from what i can see in the code, llama crashes because it was coded to crash on this model: basically, when tokenizer.ggml.add_eos_token is true, and architecture is llama, it is crashing (main.cpp:261) - and this is true for this model.
this is either a bug in llama.cpp, or the model. and seems nothing we can fix, or we can do about, as it happens with all ggufs.
the check to fail is present in 5083 as well, afaics, there is no way to escape it. so, somehow, you must have a a version of llama.cpp that simply ignores assertion failures.
I've asked nico, he has a better grasp of llama.cpp. maybe he can pipe in with something (but he is quite busy atm.)
That's interesting. Another different is I am not using llama-cli, I am using llama-server for the openai api. I wonder if the check does not exist in the llama-server binary?
Thank you for looking into it! I have been learning a lot and I really appreciate all of your quants! Locallama on reddit wouldnt be the same without you guys!
If I could ask one more question, how are updates handled? The "EVISCERATED" model I am still tinkering with which layers I am removing. The last rendition I uploaded a couple hours ago works pretty well, but gets a bit repetitive which I think is a symptom of my LLM Brain surgery. So if I update that model, does your process detect/and requant/upload as well?
Also I would be super curious about your pipeline you use for handling all of this, if you have blogged or posted about it :D
Thank you!
I made a new pruned moe that works way better.
https://huggingface.co/SuperbEmphasis/Velvet-Eclipse-v0.1-4x12B-MoE-EVISCERATED
And I have no idea how your pipeline works, so apologies if this is annoying to you. I noticed a lot of people downloaded your imatrix gguf (which is awesome). But I'd love to get the better working pruned one gguf'd using your process!
That's interesting. Another different is I am not using llama-cli, I am using llama-server for the openai api. I wonder if the check does not exist in the llama-server binary?
That's it. @nicoboss looks like a bug in llama.cpp, what do you think?
And I have no idea how your pipeline works, so apologies if this is annoying to you.
It's not annoying. Or rather, you are not annoying, llama.cpp is annoying, no stable releases, no quality control etc. ;)
Anyway, our pipeline is pretty simple, we convert the model, quantize it, possibly create an imatrix. Since there is a relatively high failure rate, we "recently" started to test-run models, so we catch problems before we do a costly static quantize + imatrix calculation.
Unfortunately, both llama-imatrix and llama-cli do contain this check, and it looks pretty deliberate (still looks like a bug, imho, since the model clearly works. it currently affects >>150 models though, not just yours).
Thanks @mradermacher !
Could I request that:
https://huggingface.co/SuperbEmphasis/Velvet-Eclipse-v0.1-4x12B-MoE-EVISCERATED (Not the 3x12B model, or the other 4x12B model) get an update? When this first fired off, I was experimenting with removing 4 layers, but this caused lots of repetition. I have since remade the model to only prune out 2 layers in each of the 4 12B models, and it works much better, but anyone downloading the old quants will have poor results.
Thank you for your awesome work!
Sure, but may I suggest making a new repo next time? It's very hard to see (and practically impossible to link to) specific commits - even huggingface doesn't do it and will simply link to the wrong model. It also allows us to keep both models, which is useful if the old model is not totally broken (I've seen lots of "improved" models that actually work worse, but the creator thought it was simply better).