Dynamic quantization

#9
by Deniaud - opened

Hi, I noticed that you store some of the layers not in the target quantization format, but an order of magnitude higher.

Using the example of the file "wan2.1-t2v-14b-Q4_K_S.gguf"

  • the main dtype fp32
  • part of the blocks in Q4
  • part of the blocks in Q5

For example:
blocks.0.ffn.0.weight - have Q4_K quantization
blocks.0.ffn.2.weight - have Q5_K

By what logic did you determine which part of the blocks to quantize in Q4, and which in a different format?
(fp32 is not discussed, everything is clear with it)

Hi!
The rules are simplified versions of the llama.cpp ones here, with the key names adapted to image models:
https://github.com/ggml-org/llama.cpp/blob/958367bf530d943a902afa1ce1c342476098576b/src/llama.cpp#L18111
https://github.com/ggml-org/llama.cpp/blob/958367bf530d943a902afa1ce1c342476098576b/src/llama.cpp#L18168

We don't have the logic for keeping the first few blocks in higher precision (since we have two sets of blocks and no simple naming/count in the metadata), so most of the time it's the attn_v (or sometimes attn_qkv) weights + the ffn_down weights that are varied.

attn_v is kept at a higher precision because it comes last in the attention (i.e. (Q x K) x V) while the other member of the matmul at that point is a product of two weights (Q, K) after scaling/softmax.
The logic for ffn_down is that both ffn_up and ffn_gate feed into it, i.e. it's output is what gets added back to the hidden state.

At least that's how I understood it. I did some quick A/B tests originally and saw some improvements, but there's no testing backing this method other than llama.cpp using it and the theory behind it being relatively sound.

Sign up or log in to comment