GLM 4 32B too?
Hey! I'll take a look at it, but don't expect it anytime soon, too many things in my backlog
I will give it a try, but it will be a couple of days:
I'm just creating a 12-headed (~ 0.4B
) distilled version of Qwen2.5-0.5B-Instruct
that I can use for future draft models (ie: instead of having to trim to 12 heads and then retrain every time for a new model).
After I have this then I should be able to create draft models using way less data (hopefully)... I can generate around 0.5B
tokens per day using 7 GPUs, and hope to have at least 2B
tokens for creating the distilled version.
If it works, then I'll try GLM 4-32B-0414
first as it seems a good test case with no tiny models available to use as a draft.
@alamios @jukofyork Thanks, looking forward to it!
There still seems to be problems with this model in llama.cpp
:
https://github.com/ggml-org/llama.cpp/issues/12946
So I'll probably hold off until it's working 100%.