bhenrym14
/

airophin-13b-pntk-16k-GPTQ

Text Generation

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 25, 2023

Commit

360ba9f

•

1 Parent(s): 7a6f333

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -28,7 +28,7 @@ All training was performed with 1x RTX 6000 Ada.
 This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scaled-rope/pull/1). This methodology is not yet implemented natively in Transformers or Exllama (as of 7/21). There are three options to run this.
 1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16). This will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384` and `original_max_position_embeddings=4096`. A monkeypatch can be found [here](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_pntk_monkey_patch.py).
 2. Autogptq/GPTQ-for-Llama. Use these quantized weights. Make the same replacement as in 1.
-3. Use ExLLama, replacing the `model.py` file with the [modified version](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/exllama_pntk/model.py). Use `compress_pos_emb=1` and `alpha_value = 1` (defaults). The necessary scaling values should flow from the configuration file. If you have done this correctly, there should be a dump of indications in the console indicating the scaling factor used (should be 4). If not, be sure your client is importing exllama from where you replaced the file. (ooba was from sitepackages for me). I hacked this together very quickly so don't be surprised if something goes wrong.
 Please comment with any questions. This hasn't been extensively tested.

 This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scaled-rope/pull/1). This methodology is not yet implemented natively in Transformers or Exllama (as of 7/21). There are three options to run this.
 1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16). This will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384` and `original_max_position_embeddings=4096`. A monkeypatch can be found [here](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_pntk_monkey_patch.py).
 2. Autogptq/GPTQ-for-Llama. Use these quantized weights. Make the same replacement as in 1.
+3. Use ExLLama, replacing the `model.py` file with the [modified version](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/exllama_pntk/model.py). Use `compress_pos_emb=1` and `alpha_value = 1` (defaults). The necessary scaling values should flow from the configuration file. If you have done this correctly, there should be a dump of indications in the console indicating the scaling factor used (should be 4). If not, be sure your client is importing exllama from where you replaced the file. (ooba imported from sitepackages for me). I hacked this together very quickly so don't be surprised if something goes wrong. It shouldn't break functionality with normal models (as long as the model config file does not have `original_max_embeddings` defined) but I haven't tested this.
 Please comment with any questions. This hasn't been extensively tested.