usernameisokaynow commited on
Commit
680c743
·
verified ·
1 Parent(s): 784e97b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -1
README.md CHANGED
@@ -63,6 +63,34 @@ Please consider setting temperature = 0 to get consistent outputs.
63
  - Transformers 4.47.1
64
  - Pytorch 2.5.1+cu121
65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  ## Recommended Hardware
67
 
68
- Running this model requires 2 or more 80GB GPUs, e.g. NVIDIA A100, with at least 150GB of free disk space.
 
 
 
 
 
 
63
  - Transformers 4.47.1
64
  - Pytorch 2.5.1+cu121
65
 
66
+ ### Quantization
67
+
68
+ In case your hardware is not sufficient to run the large model, you might want to consider reducing the size via a quantization method. This will reduce size of the model, such that you can run it with lesser specs. However, information stored in the model will also be lost.
69
+ Caution: we advise against using the model in quantized form, as much of the finetuned information will get lost. So we cannot guarantee that the resulting model is up to the performance of the un-quantized variant, which was provided by the domain expert.
70
+
71
+ In case you decide to give the quantized model a try, you are probably not able to load the full model into VRAM or RAM. So we recommend to use *gguf-quantization. It allows you to run quantization without loading the full model into RAM.
72
+ To use gguf-quantization, we need to transform the model into a gguf format. For this step the lama.cpp (release: b5233) tools should be sufficient.
73
+ Our model has a specialized setup, which is not automatically detected by the new lama.cpp convert function. Thus we used the legacy version with the flag "vocab-type".
74
+ ``` Shell
75
+ > python .\llama.cpp\examples\convert_legacy_llama.py .\ncos_model_directory\ --outfile ncos.gguf --vocab-type bpe
76
+ ```
77
+ The resulting file can now be quantized with a working memory of less than 5 GB.
78
+ Please read up on the different kind of quantization and parameters for each option to chose the right quantization scheme for your use case.
79
+ Here (as an example) we specify a ad-hoc 4bit quantization q4_0
80
+ ``` Shell
81
+ .\lama.cpp\llama-quantize.exe .\ncos.gguf .\ncos-q4_0.gguf q4_0
82
+ ```
83
+ The 4bit version of the model is roughly 40GB in size. This reduces the GPU requirements quite a bit. If running via the "CPU"-option, we can now even run the model on high end consumer setups.
84
+ The easiest way to set up a local deployment is probably to call the ollama-library (version: v0.6.7) to setup the model. But you could also stick to the gradio instructions above. Important is to read up on how to set up a "modelfile" according to your use case. The preferred system prompt and some additional setups can be found in the config files of the model.
85
+ ``` Shell
86
+ ollama create ncos-q40 -f .\ncos-gguf\Modelfile
87
+ ```
88
+
89
  ## Recommended Hardware
90
 
91
+ Running this model requires 2 or more 80GB GPUs, e.g. NVIDIA A100, with at least 150GB of free disk space.
92
+
93
+ FYI:
94
+ We know that this is a "sporty" requirement. We produced the model as a PoC (proof of concept) implementation. However, we are planning to follow up the refinement of our model with a destilled version of the model, conducted by domain experts who can supervise the process. To make use of the model on more limited hardware, you could of course follow standard quantization procedures as sketeched in the 'use with transformers' section.
95
+
96
+