Post
2692
Squeezing out tensor bits?
I have been tinkering with quantization and pruning to reduce model sizes. So far, I've had modest success in producing, on average, 8% smaller versions with negligible loss of quality, and I think further reductions in the 10-15% range are realistic, but I've come across a behaviour I wasn't expecting!
Part of the process I'm following consists of quantizing the embedding and output layers aggressively. Since the embedding layer is more about lookup than complex computation, the vectors representing the relative distances between embeddings are usually preserved well enough making this layer fairly robust to quantization. So far, so good.
The output layer, on the other hand, maps the final hidden state to the vocabulary logits and therefore, small changes in these logits could lead to a different probability distribution over the vocabulary, resulting in incorrect word predictions, or so I thought.
Surprisingly, I'm finding that even at Q2_K the loss of overall capability is minimal. Was this to be expected? or am I missing something?
I have published a version with all the test results if you want to give it a try: eaddario/DeepSeek-R1-Distill-Qwen-7B-GGUF
I'll upload other models as time allows.
Any ideas / clarifications / suggestions are very much welcomed!
I have been tinkering with quantization and pruning to reduce model sizes. So far, I've had modest success in producing, on average, 8% smaller versions with negligible loss of quality, and I think further reductions in the 10-15% range are realistic, but I've come across a behaviour I wasn't expecting!
Part of the process I'm following consists of quantizing the embedding and output layers aggressively. Since the embedding layer is more about lookup than complex computation, the vectors representing the relative distances between embeddings are usually preserved well enough making this layer fairly robust to quantization. So far, so good.
The output layer, on the other hand, maps the final hidden state to the vocabulary logits and therefore, small changes in these logits could lead to a different probability distribution over the vocabulary, resulting in incorrect word predictions, or so I thought.
Surprisingly, I'm finding that even at Q2_K the loss of overall capability is minimal. Was this to be expected? or am I missing something?
I have published a version with all the test results if you want to give it a try: eaddario/DeepSeek-R1-Distill-Qwen-7B-GGUF
I'll upload other models as time allows.
Any ideas / clarifications / suggestions are very much welcomed!