bhenrym14
/

airophin-13b-pntk-16k-GPTQ

Text Generation

Transformers

llama

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 24, 2023

Commit

0343b8e

•

1 Parent(s): 717e68d

Update README.md

Browse files

Files changed (1) hide show

README.md +2 -3

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ fp16 weights can be found here: https://huggingface.co/bhenrym14/airophin-13b-pn
 ## Overview
 This is a finetune of Llama-2-13b, intended to extend the useful context window to 16384 tokens. There are two training phases:
-1.  It is first trained on a long-context (7000-8192 tokens) subset of [dolphin](https://huggingface.co/datasets/ehartford/dolphin), an orca-like dataset (GPT4 split only). This amounts to roughly 110mm tokens, seen twice over two epochs. Airoboros-like training prompt was used, with partial NTK scaling applied. This took ~45 hours.
 2.  The model was then finetuned on [Jon Durbin's Airoboros GPT4 1.4.1](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1) for 3 epochs. This took ~17 hours.
 **This is a QLoRA fine-tune**.
@@ -24,11 +24,10 @@ All training was performed with 1x RTX 6000 Ada.
 ## How to Use
-This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scaled-rope/pull/1). This methodology is not yet mplemented natively in Transformers or Exllama (as of 7/21). There are two options to run this, each of which will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384`. A monkeypatch can be found here:
 1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16).
 2. Autogptq/GPTQ-for-Llama. Use these quantized weights.
-**Note: Due to an erronious `max_position_embeddings` figure in the base model config file, the RoPE scaling factor was computed with `original_max_position_embeddings=2048` (llama-2 should be 4096). This resulted in a scaling factor of 8 instead of 4, despite passing a new `max_position_embeddings=16384`. This could have a negative to neutral performance impact. I intend on retraining this model with the proper scaling factor. If and when I do so, I will replace the weights in this repo and make note of this change at the top of this model card.**
 ## Motivation
 Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation [kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and [NTK aware scaling](https://github.com/jquesnelle/scaled-rope). My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.

 ## Overview
 This is a finetune of Llama-2-13b, intended to extend the useful context window to 16384 tokens. There are two training phases:
+1.  It is first trained on a long-context (7000-8192 tokens) subset of [dolphin](https://huggingface.co/datasets/ehartford/dolphin), an orca-like dataset (GPT4 split only). This amounts to roughly 110mm tokens. Airoboros-like training prompt was used, with partial NTK scaling applied. This took ~20 hours.
 2.  The model was then finetuned on [Jon Durbin's Airoboros GPT4 1.4.1](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1) for 3 epochs. This took ~17 hours.
 **This is a QLoRA fine-tune**.
 ## How to Use
+This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scaled-rope/pull/1). This methodology is not yet implemented natively in Transformers or Exllama (as of 7/21). There are two options to run this, each of which will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384` and `original_max_position_embeddings=4096`. A monkeypatch can be found here:
 1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16).
 2. Autogptq/GPTQ-for-Llama. Use these quantized weights.
 ## Motivation
 Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation [kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and [NTK aware scaling](https://github.com/jquesnelle/scaled-rope). My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.