Update README.md
Browse files
README.md
CHANGED
@@ -63,6 +63,34 @@ Please consider setting temperature = 0 to get consistent outputs.
|
|
63 |
- Transformers 4.47.1
|
64 |
- Pytorch 2.5.1+cu121
|
65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
66 |
## Recommended Hardware
|
67 |
|
68 |
-
Running this model requires 2 or more 80GB GPUs, e.g. NVIDIA A100, with at least 150GB of free disk space.
|
|
|
|
|
|
|
|
|
|
|
|
63 |
- Transformers 4.47.1
|
64 |
- Pytorch 2.5.1+cu121
|
65 |
|
66 |
+
### Quantization
|
67 |
+
|
68 |
+
In case your hardware is not sufficient to run the large model, you might want to consider reducing the size via a quantization method. This will reduce size of the model, such that you can run it with lesser specs. However, information stored in the model will also be lost.
|
69 |
+
Caution: we advise against using the model in quantized form, as much of the finetuned information will get lost. So we cannot guarantee that the resulting model is up to the performance of the un-quantized variant, which was provided by the domain expert.
|
70 |
+
|
71 |
+
In case you decide to give the quantized model a try, you are probably not able to load the full model into VRAM or RAM. So we recommend to use *gguf-quantization. It allows you to run quantization without loading the full model into RAM.
|
72 |
+
To use gguf-quantization, we need to transform the model into a gguf format. For this step the lama.cpp (release: b5233) tools should be sufficient.
|
73 |
+
Our model has a specialized setup, which is not automatically detected by the new lama.cpp convert function. Thus we used the legacy version with the flag "vocab-type".
|
74 |
+
``` Shell
|
75 |
+
> python .\llama.cpp\examples\convert_legacy_llama.py .\ncos_model_directory\ --outfile ncos.gguf --vocab-type bpe
|
76 |
+
```
|
77 |
+
The resulting file can now be quantized with a working memory of less than 5 GB.
|
78 |
+
Please read up on the different kind of quantization and parameters for each option to chose the right quantization scheme for your use case.
|
79 |
+
Here (as an example) we specify a ad-hoc 4bit quantization q4_0
|
80 |
+
``` Shell
|
81 |
+
.\lama.cpp\llama-quantize.exe .\ncos.gguf .\ncos-q4_0.gguf q4_0
|
82 |
+
```
|
83 |
+
The 4bit version of the model is roughly 40GB in size. This reduces the GPU requirements quite a bit. If running via the "CPU"-option, we can now even run the model on high end consumer setups.
|
84 |
+
The easiest way to set up a local deployment is probably to call the ollama-library (version: v0.6.7) to setup the model. But you could also stick to the gradio instructions above. Important is to read up on how to set up a "modelfile" according to your use case. The preferred system prompt and some additional setups can be found in the config files of the model.
|
85 |
+
``` Shell
|
86 |
+
ollama create ncos-q40 -f .\ncos-gguf\Modelfile
|
87 |
+
```
|
88 |
+
|
89 |
## Recommended Hardware
|
90 |
|
91 |
+
Running this model requires 2 or more 80GB GPUs, e.g. NVIDIA A100, with at least 150GB of free disk space.
|
92 |
+
|
93 |
+
FYI:
|
94 |
+
We know that this is a "sporty" requirement. We produced the model as a PoC (proof of concept) implementation. However, we are planning to follow up the refinement of our model with a destilled version of the model, conducted by domain experts who can supervise the process. To make use of the model on more limited hardware, you could of course follow standard quantization procedures as sketeched in the 'use with transformers' section.
|
95 |
+
|
96 |
+
|