Dynamic quantization
Hi, I noticed that you store some of the layers not in the target quantization format, but an order of magnitude higher.
Using the example of the file "wan2.1-t2v-14b-Q4_K_S.gguf"
- the main dtype fp32
- part of the blocks in Q4
- part of the blocks in Q5
For example:
blocks.0.ffn.0.weight - have Q4_K quantization
blocks.0.ffn.2.weight - have Q5_K
By what logic did you determine which part of the blocks to quantize in Q4, and which in a different format?
(fp32 is not discussed, everything is clear with it)
Hi!
The rules are simplified versions of the llama.cpp ones here, with the key names adapted to image models:
https://github.com/ggml-org/llama.cpp/blob/958367bf530d943a902afa1ce1c342476098576b/src/llama.cpp#L18111
https://github.com/ggml-org/llama.cpp/blob/958367bf530d943a902afa1ce1c342476098576b/src/llama.cpp#L18168
We don't have the logic for keeping the first few blocks in higher precision (since we have two sets of blocks and no simple naming/count in the metadata), so most of the time it's the attn_v
(or sometimes attn_qkv
) weights + the ffn_down
weights that are varied.
attn_v
is kept at a higher precision because it comes last in the attention (i.e. (Q x K) x V
) while the other member of the matmul at that point is a product of two weights (Q, K) after scaling/softmax.
The logic for ffn_down
is that both ffn_up
and ffn_gate
feed into it, i.e. it's output is what gets added back to the hidden state.
At least that's how I understood it. I did some quick A/B tests originally and saw some improvements, but there's no testing backing this method other than llama.cpp using it and the theory behind it being relatively sound.