lemonilia commited on
Commit
3a9e559
·
verified ·
1 Parent(s): 287cf4e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -6
README.md CHANGED
@@ -4,11 +4,60 @@ base_model:
4
  - mistralai/Mistral-Small-24B-Instruct-2501
5
  ---
6
 
7
- Following quantization suggestions in section 6.2 in the [Llama-3 paper](https://arxiv.org/abs/2407.21783), these are experimental extra-large quantizations of vanilla [Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501), where:
8
 
9
- - `token_embd` and `output` are in F16 precision;
10
- - The attention layers are in F16 precision;
11
- - The entirety of the first and final transformer layers are in F16 precision;
12
- - The feed-forward network (FFN) layers are in **low** precision.
13
 
14
- For the same total model size, perplexity values might not be favorable compared to more uniformly quantized models, but supposedly this quantization scheme might help with real-world long-context performance and complex tasks. Your mileage may vary.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - mistralai/Mistral-Small-24B-Instruct-2501
5
  ---
6
 
7
+ Following suggestions from section 6.2 in the [Llama-3 paper](https://arxiv.org/abs/2407.21783) and discussions elsewhere, here are experimental extra-large GGUF quantizations of vanilla [Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501), where:
8
 
9
+ - `token_embd` and `output` are in FP16 precision;
10
+ - All attention layers are in FP16 precision;
11
+ - The entirety of the first and final transformer layers are in FP16 precision;
12
+ - Intermediate feed-forward network (FFN) layers are in _uniformly **low**_ precision.
13
 
14
+ For the same total model size, computed perplexity values _do not appear to better_ than smaller standard GGUF quantizations, but supposedly this quantization scheme might help with real-world long-context performance and complex tasks while keeping size limited. Your mileage may vary.
15
+
16
+ ## Perplexity
17
+ Computed using `llama-perplexity` on a custom text file over 4 chunks, n_ctx=32768, batch_size=2048, n_seq=1
18
+
19
+ | Quantization | Size (GiB) | Perplexity | ΔP | Error (+/-)
20
+ |:-------------|-----------:|------------:|-----------:|-----------:
21
+ | BF16 | 43.9 | 7.2512 | 0.0000 | 0.06239
22
+ | Q6_K | 18.0 | 7.2683 | 0.0171 | 0.06249
23
+ | **Q4_K_XXXL**| **18.3** | **7.2884**| **0.0372**| 0.06268
24
+ | Q4_K_M | 13.3 | 7.3155 | 0.0643 | 0.06295
25
+ | **Q3_K_XXXL**| **15.9** | **7.4084**| **0.1572**| 0.06409
26
+ | Q3_K_M | 10.7 | 7.4252 | 0.1740 | 0.06451
27
+
28
+
29
+ ## Method
30
+ These quantizations were made by naively modifying `llama-quant.cpp` in llama.cpp (specifically, the function `ggml_type llama_tensor_get_type()`), recompiling the project and invoking `llama-quantize` afterward. In summary, I forced the F16 type for the attention layers and first and last transformer layers for Mistral-Small-24B, and forcing the Q4_K type for the FFN layers which would otherwise be in mixed Q4_K and Q6_K precision. `token_embed` and `output` could be set to F16 precision with `llama-quantize` using the flags `--output-tensor-type F16 --token-embedding-type F16`.
31
+
32
+ Some pseudocode with the modifications to `llama-quant.cpp` in the case of the Q4_K_XXXL quantization:
33
+
34
+ ```cpp
35
+ static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
36
+ ...
37
+ } else if (name.find("blk.0") != std::string::npos) {
38
+ new_type = GGML_TYPE_F16;
39
+ } else if (name.find("blk.39") != std::string::npos) {
40
+ new_type = GGML_TYPE_F16;
41
+ ...
42
+ } else if (name.find("attn_k.weight") != std::string::npos) {
43
+ ...
44
+ else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
45
+ new_type = GGML_TYPE_F16;
46
+ }
47
+ } else if (name.find("attn_q.weight") != std::string::npos) {
48
+ ...
49
+ else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
50
+ new_type = GGML_TYPE_F16;
51
+ }
52
+ } else if (name.find("ffn_down") != std::string::npos) {
53
+ ...
54
+ else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
55
+ new_type = GGML_TYPE_Q4_K;
56
+ } else if (name.find("attn_output.weight") != std::string::npos) {
57
+ if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
58
+ new_type = GGML_TYPE_F16;
59
+ }
60
+ } else if (name.find("attn_qkv.weight") != std::string::npos) {
61
+ ...
62
+ else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) new_type = GGML_TYPE_F16;
63
+ ```