city96/Wan2.1-T2V-14B-gguf · Dynamic quantization

Hi!
The rules are simplified versions of the llama.cpp ones here, with the key names adapted to image models:
https://github.com/ggml-org/llama.cpp/blob/958367bf530d943a902afa1ce1c342476098576b/src/llama.cpp#L18111
https://github.com/ggml-org/llama.cpp/blob/958367bf530d943a902afa1ce1c342476098576b/src/llama.cpp#L18168

We don't have the logic for keeping the first few blocks in higher precision (since we have two sets of blocks and no simple naming/count in the metadata), so most of the time it's the attn_v (or sometimes attn_qkv) weights + the ffn_down weights that are varied.

attn_v is kept at a higher precision because it comes last in the attention (i.e. (Q x K) x V) while the other member of the matmul at that point is a product of two weights (Q, K) after scaling/softmax.
The logic for ffn_down is that both ffn_up and ffn_gate feed into it, i.e. it's output is what gets added back to the hidden state.

At least that's how I understood it. I did some quick A/B tests originally and saw some improvements, but there's no testing backing this method other than llama.cpp using it and the theory behind it being relatively sound.