lemonilia's picture
Update README.md
3a9e559 verified
|
raw
history blame
3.51 kB
metadata
license: apache-2.0
base_model:
  - mistralai/Mistral-Small-24B-Instruct-2501

Following suggestions from section 6.2 in the Llama-3 paper and discussions elsewhere, here are experimental extra-large GGUF quantizations of vanilla Mistral-Small-24B-Instruct-2501, where:

  • token_embd and output are in FP16 precision;
  • All attention layers are in FP16 precision;
  • The entirety of the first and final transformer layers are in FP16 precision;
  • Intermediate feed-forward network (FFN) layers are in uniformly low precision.

For the same total model size, computed perplexity values do not appear to better than smaller standard GGUF quantizations, but supposedly this quantization scheme might help with real-world long-context performance and complex tasks while keeping size limited. Your mileage may vary.

Perplexity

Computed using llama-perplexity on a custom text file over 4 chunks, n_ctx=32768, batch_size=2048, n_seq=1

Quantization Size (GiB) Perplexity ΔP Error (+/-)
BF16 43.9 7.2512 0.0000 0.06239
Q6_K 18.0 7.2683 0.0171 0.06249
Q4_K_XXXL 18.3 7.2884 0.0372 0.06268
Q4_K_M 13.3 7.3155 0.0643 0.06295
Q3_K_XXXL 15.9 7.4084 0.1572 0.06409
Q3_K_M 10.7 7.4252 0.1740 0.06451

Method

These quantizations were made by naively modifying llama-quant.cpp in llama.cpp (specifically, the function ggml_type llama_tensor_get_type()), recompiling the project and invoking llama-quantize afterward. In summary, I forced the F16 type for the attention layers and first and last transformer layers for Mistral-Small-24B, and forcing the Q4_K type for the FFN layers which would otherwise be in mixed Q4_K and Q6_K precision. token_embed and output could be set to F16 precision with llama-quantize using the flags --output-tensor-type F16 --token-embedding-type F16.

Some pseudocode with the modifications to llama-quant.cpp in the case of the Q4_K_XXXL quantization:

static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
    ...
    } else if (name.find("blk.0") != std::string::npos) {
        new_type = GGML_TYPE_F16;
    } else if (name.find("blk.39") != std::string::npos) {
        new_type = GGML_TYPE_F16;
    ...
    } else if (name.find("attn_k.weight") != std::string::npos) {
        ...
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
            new_type = GGML_TYPE_F16;
        }
    } else if (name.find("attn_q.weight") != std::string::npos) {
        ...
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
            new_type = GGML_TYPE_F16;
        }
    } else if (name.find("ffn_down") != std::string::npos) {
        ...
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
            new_type = GGML_TYPE_Q4_K;
    } else if (name.find("attn_output.weight") != std::string::npos) {
        if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
            new_type = GGML_TYPE_F16;
        }
    } else if (name.find("attn_qkv.weight") != std::string::npos) {
        ...
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) new_type = GGML_TYPE_F16;