bhenrym14
/

airophin-13b-pntk-16k-GPTQ

Text Generation

Transformers

llama

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 25, 2023

Commit

1206595

•

1 Parent(s): d4d8959

Update README.md

Browse files

Files changed (1) hide show

README.md +13 -9

README.md CHANGED Viewed

@@ -22,12 +22,16 @@ This is a finetune of Llama-2-13b, intended to extend the useful context window
 All training was performed with 1x RTX 6000 Ada.
 ## How to Use
-This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scaled-rope/pull/1). This methodology is not yet implemented natively in Transformers or Exllama (as of 7/21). There are two options to run this, each of which will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384` and `original_max_position_embeddings=4096`. A monkeypatch can be found here:
-1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16).
-2. Autogptq/GPTQ-for-Llama. Use these quantized weights.
 ## Motivation
 Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation [kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and [NTK aware scaling](https://github.com/jquesnelle/scaled-rope). My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.
@@ -43,14 +47,14 @@ Here I explore whether training on long sequences that have clear conceptual dep
 | 512 | 7.62 | 8.24 | 7.90 | **7.23** |
 | 1024 | 6.20 | 6.71 | 6.17 | **5.85**  |
 | 2048 | 5.38 | 5.87 | 5.23 | **5.07** |
-| 4096 | 5.08 | 5.50 | **4.91** | 4.77 |
-| 8192 | 4.90 | 5.32 | -- | 57.1 |
-| 12000 | 4.82 | 56.1 | -- | -- |
-- This model is competitive with the Llama-1 33b variants, outperforming the best long context model for short sequences.
-- Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 54.9. While not an appreciable improvement, the fact there wasn't a performance regression despite the context extension is notable.
 - Perplexity continues to decline to 12000 tokens, the longest context length I tested due to VRAM constraints.
--
 ## Quantization:

 All training was performed with 1x RTX 6000 Ada.
+**For the standard 4096 context length model using airoboros-gpt4-1.4.1 see: [Jon Durbin's airoboros-l2-13b-gpt4-1.4.1](https://huggingface.co/jondurbin/airoboros-l2-13b-gpt4-1.4.1)**
 ## How to Use
+This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scaled-rope/pull/1). This methodology is not yet implemented natively in Transformers or Exllama (as of 7/21). There are three options to run this.
+1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16). This will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384` and `original_max_position_embeddings=4096`. A monkeypatch can be found [here](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_pntk_monkey_patch.py).
+2. Autogptq/GPTQ-for-Llama. Use these quantized weights. Make the same replacement as in 1.
+3. Use ExLLama, replacing the `model.py` file with the [modified version](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/exllama_pntk/model.py). Use `compress_pos_emb=1` and `alpha_value = 1` (defaults). The necessary scaling values should flow from the configuration file. If you have done this correctly, there should be a dump of indications in the console indicating the scaling factor used (should be 4). If not, be sure your client is importing exllama from where you replaced the file. (ooba was from sitepackages for me)
+Please comment with any questions. This hasn't been extensively tested.
 ## Motivation
 Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation [kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and [NTK aware scaling](https://github.com/jquesnelle/scaled-rope). My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.
 | 512 | 7.62 | 8.24 | 7.90 | **7.23** |
 | 1024 | 6.20 | 6.71 | 6.17 | **5.85**  |
 | 2048 | 5.38 | 5.87 | 5.23 | **5.07** |
+| 4096 | 5.08 | 5.50 | 4.91 | **4.77** |
+| 8192 | **4.90** | 5.32 | Not Tested | 57.1 |
+| 12000 | **4.82** | 56.1 | Not Tested | Not Tested |
+- This model is very competitive with the Llama-1 33b extended context variants.
+- Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 54.9. While perhaps an insignificant difference, the fact there isn't a clear performance regression despite the context extension is notable.
 - Perplexity continues to decline to 12000 tokens, the longest context length I tested due to VRAM constraints.
+- Feedback regarding real-world performance is appreciated. I don't know if the first dolphin training phase really contributed much; many relevant modeling components changed here, so it's difficult to make any specific attributions. The base model improvement may very well be the most dominant change.
 ## Quantization: