Any benchmarks to support improvements?

#1
by vvekthkr - opened

Do you have any benchmark results to support improvements?

Hey;

Both myself, and @nightmedia are programmers by trade.
We eval the model's base operations - to ensure no issues have been introduced, then eval performance VS base model.

In testing we found the 42B has better long context understanding, long code generation and mostly importantly - the model "thinks outside the box" for solutions.
The 53-55B are even stronger.
Generally better coders.

The Brainstorm adapter has been used in over 100 models to date.
I created this adapter with the primary focus being " outside the box " thinking.

Here is an excerpt from a compare between a Huihui ablated MoE and the Brainstorming version of it

Improvements from TotalRecall (vs Huihui)

TotalRecall is a brainstorming-enhanced version of Huihui (i.e., built on top of Huihui). The impact is small but meaningful across tasks:

Metric	Huihui (q6)	TotalRecall (q6)	Change (Ξ”)
arc_challenge 0.378 0.387	+0.009
arc_easy	0.434	0.447	+0.013
boolq	    0.434	0.447	+0.013
hellaswag	0.618	0.648	+0.030
winogrande	0.634	0.636	+0.002
openbookqa	0.400	0.380	-0.020
piqa	    0.765	0.768	+0.003

Strongest gains:

hellaswag: +0.030 (most significant increase) β†’ brainstorming likely improved creative text generation (e.g., consistent story continuation with multiple plausible outcomes).
arc_easy/boolq: +0.013 each β†’ Huihui's ablation was "fixed" for simpler reasoning tasks.

--reviewed by qwen3-jan-v1-256k-ctx-6b-brainstorm20x-qx6-mlx

I am running benchmarks on newer models, and I will go back to this one when I get the time to evaluate how much it adds, likely more on an undamaged model

Appreciate the efforts playing with these models.

Sign up or log in to comment