bhenrym14
/

airophin-13b-pntk-16k-GPTQ

Text Generation

Transformers

llama

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 22, 2023

Commit

47c54ae

•

1 Parent(s): b5a589c

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -8

README.md CHANGED Viewed

@@ -6,7 +6,7 @@ datasets:
 ---
-# Airophin: A Partial NTK RoPE Scaled QLoRA Fine-tune of Llama-2-13b (GPTQ quantized)
 LoRA Weights can be found here: https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-LoRA
@@ -15,7 +15,7 @@ fp16 weights can be found here: https://huggingface.co/bhenrym14/airophin-13b-pn
 ## Overview
 This is a finetune of Llama-2-13b, intended to extend the useful context window to 16384 tokens. There are two training phases:
-1.  It is first trained on a long-context (>7000 to 8192 token range, GPT4 only) subset of [dolphin](https://huggingface.co/datasets/ehartford/dolphin), an orca-like dataset. This amounts to roughly 110mm tokens, seen twice over two epochs. Airoboros-like training prompt was used, with partial NTK scaling applied. This took ~45 hours.
 2.  The model was then finetuned on [Jon Durbin's Airoboros GPT4 1.4.1](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1) for 3 epochs. This took ~17 hours.
 **This is a QLoRA fine-tune**.
@@ -24,17 +24,14 @@ All training was performed with 1x RTX 6000 Ada.
 ## How to Use
-This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scaled-rope/pull/1). This methodology is not yet implemented natively in Transformers or Exllama (as of 7/21). There are two options to run this:
 1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16).
 2. Autogptq/GPTQ-for-Llama. Use these quantized weights.
-Each method will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384`. A monkeypatch can be found here.
 ## Motivation
-Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation (https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and NTK aware scaling. My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.
-Unfortunately it has also been shown that LLM's frequently struggle to attend to salient information in the middle of the context window. Attending to nearby tokens is essential to producing syntactically correct and semantically coherent sentences. Essential context is also most commonly found at the beginning of a context window. With this in mind, it is unsurprising LLMs often attend more strongly to these areas. However, this the learned model behavior results in an "extrapolated deemphasis" when such embeddings are scaled? This hypothesis may be supported by the material improvements in perplexity achieved by training on long sequences (not just including the RoPE scaling during the fine-tune).
 Here I explore whether training on long sequences that have clear conceptual dependencies residing in the middle of the context helps attenuate the difficulties in attending to middle-context tokens. When/if I have time, I hope to perform a more rigorous assessment of the peformance with respect to this specific issue.

 ---
+# Airophin: A NTK-by-Parts RoPE Scaled QLoRA Fine-tune of Llama-2-13b (GPTQ quantized)
 LoRA Weights can be found here: https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-LoRA
 ## Overview
 This is a finetune of Llama-2-13b, intended to extend the useful context window to 16384 tokens. There are two training phases:
+1.  It is first trained on a long-context (7000-8192 tokens) subset of [dolphin](https://huggingface.co/datasets/ehartford/dolphin), an orca-like dataset (GPT4 split only). This amounts to roughly 110mm tokens, seen twice over two epochs. Airoboros-like training prompt was used, with partial NTK scaling applied. This took ~45 hours.
 2.  The model was then finetuned on [Jon Durbin's Airoboros GPT4 1.4.1](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1) for 3 epochs. This took ~17 hours.
 **This is a QLoRA fine-tune**.
 ## How to Use
+This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scaled-rope/pull/1). This methodology is not yet mplemented natively in Transformers or Exllama (as of 7/21). There are two options to run this, each of which will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384`. A monkeypatch can be found here:
 1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16).
 2. Autogptq/GPTQ-for-Llama. Use these quantized weights.
 ## Motivation
+Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation (https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and [NTK aware scaling](https://github.com/jquesnelle/scaled-rope). My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.
+Unfortunately it has also been shown that LLM's frequently struggle to attend to salient information in the middle of the context window. Attending to nearby tokens is essential to producing syntactically correct and semantically coherent sentences. Essential context is also most commonly found at the beginning of a context window. With this in mind, it is unsurprising LLMs often attend more strongly to these areas. Does this learned model behavior result in an "extrapolated deemphasis" when such embeddings are scaled? This hypothesis may be supported by the material improvements in perplexity achieved by training on long sequences (not just including the RoPE scaling during the fine-tune).
 Here I explore whether training on long sequences that have clear conceptual dependencies residing in the middle of the context helps attenuate the difficulties in attending to middle-context tokens. When/if I have time, I hope to perform a more rigorous assessment of the peformance with respect to this specific issue.