Update README.md
Browse files
README.md
CHANGED
@@ -4,11 +4,60 @@ base_model:
|
|
4 |
- mistralai/Mistral-Small-24B-Instruct-2501
|
5 |
---
|
6 |
|
7 |
-
Following
|
8 |
|
9 |
-
- `token_embd` and `output` are in
|
10 |
-
-
|
11 |
-
- The entirety of the first and final transformer layers are in
|
12 |
-
-
|
13 |
|
14 |
-
For the same total model size, perplexity values
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
- mistralai/Mistral-Small-24B-Instruct-2501
|
5 |
---
|
6 |
|
7 |
+
Following suggestions from section 6.2 in the [Llama-3 paper](https://arxiv.org/abs/2407.21783) and discussions elsewhere, here are experimental extra-large GGUF quantizations of vanilla [Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501), where:
|
8 |
|
9 |
+
- `token_embd` and `output` are in FP16 precision;
|
10 |
+
- All attention layers are in FP16 precision;
|
11 |
+
- The entirety of the first and final transformer layers are in FP16 precision;
|
12 |
+
- Intermediate feed-forward network (FFN) layers are in _uniformly **low**_ precision.
|
13 |
|
14 |
+
For the same total model size, computed perplexity values _do not appear to better_ than smaller standard GGUF quantizations, but supposedly this quantization scheme might help with real-world long-context performance and complex tasks while keeping size limited. Your mileage may vary.
|
15 |
+
|
16 |
+
## Perplexity
|
17 |
+
Computed using `llama-perplexity` on a custom text file over 4 chunks, n_ctx=32768, batch_size=2048, n_seq=1
|
18 |
+
|
19 |
+
| Quantization | Size (GiB) | Perplexity | ΔP | Error (+/-)
|
20 |
+
|:-------------|-----------:|------------:|-----------:|-----------:
|
21 |
+
| BF16 | 43.9 | 7.2512 | 0.0000 | 0.06239
|
22 |
+
| Q6_K | 18.0 | 7.2683 | 0.0171 | 0.06249
|
23 |
+
| **Q4_K_XXXL**| **18.3** | **7.2884**| **0.0372**| 0.06268
|
24 |
+
| Q4_K_M | 13.3 | 7.3155 | 0.0643 | 0.06295
|
25 |
+
| **Q3_K_XXXL**| **15.9** | **7.4084**| **0.1572**| 0.06409
|
26 |
+
| Q3_K_M | 10.7 | 7.4252 | 0.1740 | 0.06451
|
27 |
+
|
28 |
+
|
29 |
+
## Method
|
30 |
+
These quantizations were made by naively modifying `llama-quant.cpp` in llama.cpp (specifically, the function `ggml_type llama_tensor_get_type()`), recompiling the project and invoking `llama-quantize` afterward. In summary, I forced the F16 type for the attention layers and first and last transformer layers for Mistral-Small-24B, and forcing the Q4_K type for the FFN layers which would otherwise be in mixed Q4_K and Q6_K precision. `token_embed` and `output` could be set to F16 precision with `llama-quantize` using the flags `--output-tensor-type F16 --token-embedding-type F16`.
|
31 |
+
|
32 |
+
Some pseudocode with the modifications to `llama-quant.cpp` in the case of the Q4_K_XXXL quantization:
|
33 |
+
|
34 |
+
```cpp
|
35 |
+
static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
|
36 |
+
...
|
37 |
+
} else if (name.find("blk.0") != std::string::npos) {
|
38 |
+
new_type = GGML_TYPE_F16;
|
39 |
+
} else if (name.find("blk.39") != std::string::npos) {
|
40 |
+
new_type = GGML_TYPE_F16;
|
41 |
+
...
|
42 |
+
} else if (name.find("attn_k.weight") != std::string::npos) {
|
43 |
+
...
|
44 |
+
else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
|
45 |
+
new_type = GGML_TYPE_F16;
|
46 |
+
}
|
47 |
+
} else if (name.find("attn_q.weight") != std::string::npos) {
|
48 |
+
...
|
49 |
+
else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
|
50 |
+
new_type = GGML_TYPE_F16;
|
51 |
+
}
|
52 |
+
} else if (name.find("ffn_down") != std::string::npos) {
|
53 |
+
...
|
54 |
+
else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
|
55 |
+
new_type = GGML_TYPE_Q4_K;
|
56 |
+
} else if (name.find("attn_output.weight") != std::string::npos) {
|
57 |
+
if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
|
58 |
+
new_type = GGML_TYPE_F16;
|
59 |
+
}
|
60 |
+
} else if (name.find("attn_qkv.weight") != std::string::npos) {
|
61 |
+
...
|
62 |
+
else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) new_type = GGML_TYPE_F16;
|
63 |
+
```
|