InferenceIllusionist
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -22,9 +22,9 @@ license: apache-2.0
|
|
22 |
|
23 |
# Mistral-Nemo-Instruct-12B-iMat-GGUF
|
24 |
|
25 |
-
|
26 |
-
|
27 |
-
Other front-ends like the main branch of llama.cpp, kobold.cpp, and text-generation-web-ui may not work as intended</b>
|
28 |
|
29 |
Quantized from Mistral-Nemo-Instruct-2407 fp16
|
30 |
* Weighted quantizations were creating using fp16 GGUF and groups_merged.txt in 92 chunks and n_ctx=512
|
@@ -36,12 +36,12 @@ Quantized from Mistral-Nemo-Instruct-2407 fp16
|
|
36 |
(Click on image to view in full size)
|
37 |
[<img src="https://i.imgur.com/mV0nYdA.png" width="920"/>](https://i.imgur.com/mV0nYdA.png)
|
38 |
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
|
46 |
|
47 |
Original model card can be found [here](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
|
|
|
22 |
|
23 |
# Mistral-Nemo-Instruct-12B-iMat-GGUF
|
24 |
|
25 |
+
> [!WARNING]
|
26 |
+
><b>Important Note:</b> Inferencing is *only* available on this fork of llama.cpp at the moment: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo (All credits to iamlemec for his work on Mistral-Nemo support)
|
27 |
+
>Other front-ends like the main branch of llama.cpp, kobold.cpp, and text-generation-web-ui may not work as intended</b>
|
28 |
|
29 |
Quantized from Mistral-Nemo-Instruct-2407 fp16
|
30 |
* Weighted quantizations were creating using fp16 GGUF and groups_merged.txt in 92 chunks and n_ctx=512
|
|
|
36 |
(Click on image to view in full size)
|
37 |
[<img src="https://i.imgur.com/mV0nYdA.png" width="920"/>](https://i.imgur.com/mV0nYdA.png)
|
38 |
|
39 |
+
> [!TIP]
|
40 |
+
><b>Quant-specific Tips:</b>
|
41 |
+
>* If you are getting a `cudaMalloc failed: out of memory` error, try passing an argument for lower context in llama.cpp, e.g. for 8k: `-c 8192`
|
42 |
+
>* If you have all ampere generation or newer cards, you can use flash attention like so: `-fa`
|
43 |
+
>* Provided Flash Attention is enabled you can also use quantized cache to save on VRAM e.g. for 8-bit: `-ctk q8_0 -ctv q8_0`
|
44 |
+
>* Mistral recommends a temperature of 0.3 for this model
|
45 |
|
46 |
|
47 |
Original model card can be found [here](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
|