Any benchmarks to support improvements?
Do you have any benchmark results to support improvements?
Hey;
Both myself, and
@nightmedia
are programmers by trade.
We eval the model's base operations - to ensure no issues have been introduced, then eval performance VS base model.
In testing we found the 42B has better long context understanding, long code generation and mostly importantly - the model "thinks outside the box" for solutions.
The 53-55B are even stronger.
Generally better coders.
The Brainstorm adapter has been used in over 100 models to date.
I created this adapter with the primary focus being " outside the box " thinking.
Here is an excerpt from a compare between a Huihui ablated MoE and the Brainstorming version of it
Improvements from TotalRecall (vs Huihui)
TotalRecall is a brainstorming-enhanced version of Huihui (i.e., built on top of Huihui). The impact is small but meaningful across tasks:
Metric Huihui (q6) TotalRecall (q6) Change (Ξ)
arc_challenge 0.378 0.387 +0.009
arc_easy 0.434 0.447 +0.013
boolq 0.434 0.447 +0.013
hellaswag 0.618 0.648 +0.030
winogrande 0.634 0.636 +0.002
openbookqa 0.400 0.380 -0.020
piqa 0.765 0.768 +0.003
Strongest gains:
hellaswag: +0.030 (most significant increase) β brainstorming likely improved creative text generation (e.g., consistent story continuation with multiple plausible outcomes).
arc_easy/boolq: +0.013 each β Huihui's ablation was "fixed" for simpler reasoning tasks.
--reviewed by qwen3-jan-v1-256k-ctx-6b-brainstorm20x-qx6-mlx
I am running benchmarks on newer models, and I will go back to this one when I get the time to evaluate how much it adds, likely more on an undamaged model
Appreciate the efforts playing with these models.