GLM 4 32B too?

by qingy2024 - opened 3 days ago

qingy2024

3 days ago

This model works really well. Are you planning to create an equivalent 0.5B draft model for the new GLM 4-32B-0414 model?

alamios

Owner 1 day ago

Hey! I'll take a look at it, but don't expect it anytime soon, too many things in my backlog

jukofyork

about 23 hours ago

•

edited about 23 hours ago

I will give it a try, but it will be a couple of days:

I'm just creating a 12-headed (~ 0.4B) distilled version of Qwen2.5-0.5B-Instruct that I can use for future draft models (ie: instead of having to trim to 12 heads and then retrain every time for a new model).

After I have this then I should be able to create draft models using way less data (hopefully)... I can generate around 0.5B tokens per day using 7 GPUs, and hope to have at least 2B tokens for creating the distilled version.

If it works, then I'll try GLM 4-32B-0414 first as it seems a good test case with no tiny models available to use as a draft.

qingy2024

about 23 hours ago

@alamios @jukofyork Thanks, looking forward to it!

jukofyork

about 13 hours ago

@alamios @jukofyork Thanks, looking forward to it!

There still seems to be problems with this model in llama.cpp:

https://github.com/ggml-org/llama.cpp/issues/12946

https://old.reddit.com/r/LocalLLaMA/comments/1k6ably/bartowski_just_updated_his_glm432b_quants_working/mooou7n/

https://old.reddit.com/r/LocalLLaMA/comments/1k6ably/bartowski_just_updated_his_glm432b_quants_working/mopjtp7/

So I'll probably hold off until it's working 100%.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment