Why group size 32?

#1
by bibproj - opened

@cs2764
Hi Larry

I noticed you did a group size of 32 here. As far as I know the default is 64. I do not know what the group size does.
So, just for my curiosity, what is the reason for using a group size of 32?
From your model card it seems that it has an effect on the average bits per weight. How does that work?

That's an excellent and insightful question. You're right, the default group size in many tools is 64, and changing it to 32 is a deliberate choice to prioritize model accuracy.

Here’s a brief explanation:

Reason for Group Size 32 (Higher Accuracy): In quantization, we compress the model's weights by grouping them into small blocks and applying a unique set of compression parameters (a scale and zero-point) to each block. Using a smaller group size of 32 means we create more, smaller blocks. This allows the compression to be much more granular and tailored to the specific values within each tiny group, which significantly reduces the loss of information (quantization error). This is especially effective at handling outlier weights that can degrade the quality of larger groups. The result is a more accurate model that performs closer to its original, uncompressed version.

Effect on Average Bits Per Weight (The Trade-off): Your observation is spot on. The trade-off for this higher accuracy is a slight increase in the model's final size, which is reflected in the average bits per weight (BPW). Each group must store its own compression parameters (metadata). By halving the group size from 64 to 32, we double the number of groups, and therefore double the amount of metadata we need to save. This overhead increases the effective bits per weight from approximately 4.5 BPW (with a group size of 64) to 5.0 BPW (with a group size of 32).

In short, we chose a group size of 32 to achieve higher fidelity and performance, accepting a minor increase in model size as a worthwhile trade-off. This is a common practice for producing high-quality quantized models where performance is the top priority .

For a complete technical breakdown with performance benchmarks and detailed calculations, you can refer to this in-depth analysis: https://docs.google.com/document/d/1W_IsQnu0UJZnPYnWl5fpirvzsLLLLvoodWgLyWxxV5U/edit?usp=sharing

cs2764 changed discussion status to closed
cs2764 changed discussion status to open
bibproj changed discussion status to closed

Sign up or log in to comment