Neglible loss of PPL when using only 6 of 8 experts

#7
by whoisjeremylam - opened

I'm running the IQ2_KS quant and thought I'd try playing with the -ser parameter.

Surprisingly, specifying ser 7,1 and ser 6,1 seems to slightly the perplexity but let's just say it is unchanged.

The parameters I specified that remained constant were --ctx-size 512 --ubatch-size 512 -f wikitext-2-raw/wiki.test.raw --seed 1337

baseline: Final estimate: PPL = 3.1894 +/- 0.01625
ser 7,1 : Final estimate: PPL = 3.1665 +/- 0.01600
ser 6,1 : Final estimate: PPL = 3.1756 +/- 0.01596
ser 5,1 : Final estimate: PPL = 3.2252 +/- 0.01614
ser 4,1 : Final estimate: PPL = 3.4532 +/- 0.01744

Running with -ser 6,1 improves token generation on my rig by ~19% - YMMV!

Oh nice, I'd not taken the time lately to check out the effects of -ser N,1 on perplexity. Good to know the effect seems minimal, though curious the perplexity "improved" with it which could indicate something else is going on. But its another tool in the toolbox and almost 20% faster TG is definitely nice!

Interesting find, I'll test it out on my rig, I use my local Kimi every day, excited for tomorrows model update. I'll update this post with TG results when I get home.


Update: I remoted into my server and edited my conf to restart my Kimi instance with -serv 6,1 parameter. Before I was getting a solid 16 t/s on a dry run, now I'm getting a good 18.6 t/s.

A welcome increase, but I can't help but wonder if this hinders overall model knowledge. By disabling two experts we are essentially lobotomizing two parts of the models conjoined brain cells lmao. I'll run it like this for a few days and see if I notice big knowledge degredation / hallucinations

Update 2: atleast for my primary use case (creative / health-science workflows) kimi seems to be functioning as before...

Final update: It's interesting, even at IQ2_XSS, with -serv 4,1 I'm getting good speeds, (21t/s) and pretty coherent answers. Passed a bunch of workplace specific knowledge tests with flying colors. I'd say this is a decent way to speed up models that are otherwise on the slower side, at the cost of accuracy and some knowledge. fun experiment! I'll keep the model at defaults though just so I have all the experts available to me.

I think with a model like Kimi this method works without major noticeable degradation solely because Kimi has 1 trillion parameters, it can afford to lob a few experts off?

Oh nice, I'd not taken the time lately to check out the effects of -ser N,1 on perplexity. Good to know the effect seems minimal, though curious the perplexity "improved" with it which could indicate something else is going on. But its another tool in the toolbox and almost 20% faster TG is definitely nice!

I guess the perplexity is only representative onwikitext-2-raw. Perhaps arguably so that wikitext-2-raw isn't a wide enough corpus, especially for maths and coding.

Sign up or log in to comment