Considering a distilled version of 80B parameters

by snapo - opened 4 days ago

4 days ago

•

Hi Moonshot,
Do you ever consider to create a 78B MoE distilled version of Kimi K2?
You might ask why 78B, its pretty simple very good quantisation of Q4_0 brings this down to 40G, which means if someone has 2 x 20-24GB GPU he could use this model at home without any trouble in Q4 with imense speed. Everything above 80GB can not be run on two local GPU's....

The comunity has models in the 0-30B range for single GPU, but for dual GPU there is not a single model unfortunately...

120B from GPT-OSS dosent fit into two 20GB gpus, same as the GLM Air model which also dosent fit into two GPU's... If they both would be 78B parameters , then dual GPU users could profit from this in Q4 a lot.

So would you consider creating a flash version that is approx. 80B parameters?

JasonLee996

4 days ago

Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^

davics

4 days ago

Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^

why why tell me why

snapo

4 days ago

Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^

would love to know why? ... for nvidia selling more big memory GPU's for 10k usd? Because it cant be the because new Ryzen AI PC's or nvidia sparks.... both of them dont have the compute.... thats why 2 GPU's would be much more interesting, but your opinnion realy would be interesting....

CHNtentes

3 days ago

Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^

why why tell me why

because moe model <100b is not powerful enough?

snapo

3 days ago

Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^

why why tell me why

because moe model <100b is not powerful enough?

wrong, layer depth (how many layers) * (active parameters) == intelligence....
Additional parameters in a MoE are just retrival of knowledge, thats why they need more. But training ultra high layer count llm's is extremely difficult.

WIDTH (dimension) is what most do, because much cheaper to do training on....
LAYER's is what they should do....

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment