lemonilia
/

Mistral-Small-24B-Instruct-2501-GGUF-XXXL

GGUF

conversational

Model card Files Files and versions Community

lemonilia commited on Mar 17

Commit

3a9e559

verified ·

1 Parent(s): 287cf4e

Update README.md

Browse files

Files changed (1) hide show

README.md +55 -6

README.md CHANGED Viewed

@@ -4,11 +4,60 @@ base_model:
 - mistralai/Mistral-Small-24B-Instruct-2501
 ---
-Following quantization suggestions in section 6.2 in the [Llama-3 paper](https://arxiv.org/abs/2407.21783), these are experimental extra-large quantizations of vanilla [Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501), where:
-- `token_embd` and `output` are in F16 precision;
-- The attention layers are in F16 precision;
-- The entirety of the first and final transformer layers are in F16 precision;
-- The feed-forward network (FFN) layers are in **low** precision.
-For the same total model size, perplexity values might not be favorable compared to more uniformly quantized models, but supposedly this quantization scheme might help with real-world long-context performance and complex tasks. Your mileage may vary.

 - mistralai/Mistral-Small-24B-Instruct-2501
 ---
+Following suggestions from section 6.2 in the [Llama-3 paper](https://arxiv.org/abs/2407.21783) and discussions elsewhere, here are experimental extra-large GGUF quantizations of vanilla [Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501), where:
+- `token_embd` and `output` are in FP16 precision;
+- All attention layers are in FP16 precision;
+- The entirety of the first and final transformer layers are in FP16 precision;
+- Intermediate feed-forward network (FFN) layers are in _uniformly **low**_ precision.
+For the same total model size, computed perplexity values _do not appear to better_ than smaller standard GGUF quantizations, but supposedly this quantization scheme might help with real-world long-context performance and complex tasks while keeping size limited. Your mileage may vary.
+## Perplexity
+Computed using `llama-perplexity` on a custom text file over 4 chunks, n_ctx=32768, batch_size=2048, n_seq=1
+| Quantization | Size (GiB) |  Perplexity |         ΔP | Error (+/-)
+|:-------------|-----------:|------------:|-----------:|-----------:
+| BF16         |   43.9     |     7.2512  |    0.0000  |    0.06239
+| Q6_K         |   18.0     |     7.2683  |    0.0171  |    0.06249
+| **Q4_K_XXXL**| **18.3**   |   **7.2884**|  **0.0372**|    0.06268
+| Q4_K_M       |   13.3     |     7.3155  |    0.0643  |    0.06295
+| **Q3_K_XXXL**| **15.9**   |   **7.4084**|  **0.1572**|    0.06409
+| Q3_K_M       |   10.7     |     7.4252  |    0.1740  |    0.06451
+## Method
+These quantizations were made by naively modifying `llama-quant.cpp` in llama.cpp (specifically, the function `ggml_type llama_tensor_get_type()`), recompiling the project and invoking `llama-quantize` afterward. In summary, I forced the F16 type for the attention layers and first and last transformer layers for Mistral-Small-24B, and forcing the Q4_K type for the FFN layers which would otherwise be in mixed Q4_K and Q6_K precision. `token_embed` and `output` could be set to F16 precision with `llama-quantize` using the flags `--output-tensor-type F16 --token-embedding-type F16`.
+Some pseudocode with the modifications to `llama-quant.cpp` in the case of the Q4_K_XXXL quantization:
+```cpp
+static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
+    ...
+    } else if (name.find("blk.0") != std::string::npos) {
+        new_type = GGML_TYPE_F16;
+    } else if (name.find("blk.39") != std::string::npos) {
+        new_type = GGML_TYPE_F16;
+    ...
+    } else if (name.find("attn_k.weight") != std::string::npos) {
+        ...
+        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
+            new_type = GGML_TYPE_F16;
+        }
+    } else if (name.find("attn_q.weight") != std::string::npos) {
+        ...
+        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
+            new_type = GGML_TYPE_F16;
+        }
+    } else if (name.find("ffn_down") != std::string::npos) {
+        ...
+        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
+            new_type = GGML_TYPE_Q4_K;
+    } else if (name.find("attn_output.weight") != std::string::npos) {
+        if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
+            new_type = GGML_TYPE_F16;
+        }
+    } else if (name.find("attn_qkv.weight") != std::string::npos) {
+        ...
+        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) new_type = GGML_TYPE_F16;
+```