Text Generation
Transformers
llama
Inference Endpoints
bhenrym14 commited on
Commit
3fc3dba
1 Parent(s): 765aa18

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -6
README.md CHANGED
@@ -15,7 +15,7 @@ fp16 weights can be found here: https://huggingface.co/bhenrym14/airophin-13b-pn
15
  ## Overview
16
 
17
  This is a finetune of Llama-2-13b, intended to extend the useful context window to 16384 tokens. There are two training phases:
18
- 1. It is first trained on a long-context (>7000 to 8192 token range) subset of [dolphin](), a orca-like dataset. This amounts to roughly 110mm tokens, seen twice over two epochs. Airoboros-like training prompt was used. This took ~45 hours.
19
  2. The model was then finetuned on [Jon Durbin's Airoboros 13B GPT4 1.4](https://huggingface.co/jondurbin/airoboros-13b-gpt4-1.4) for 3 epochs. This took ~17 hours.
20
 
21
  **This is a QLoRA fine-tune**.
@@ -37,6 +37,7 @@ Methods of extending the useful context window of LLM's have gained significant
37
  Unfortunately it has also been shown that LLM's frequently struggle to attend to salient information in the middle of the context window. Attending to nearby tokens is essential to producing syntactically correct and semantically coherent sentences. Context is also most commonly found at the beginning of a context window. Perhaps the learned model behavior with respect to token position results in an "extrapolated deemphasis" when such embeddings are scaled? This hypothesis would be supported by the material improvements in perplexity achieved by training on long sequences (not just including the RoPE scaling during the fine-tune).
38
 
39
  Here I explore whether training on long sequences that have clear conceptual dependencies residing in the middle of the context helps attenuate the difficulties in attending to middle-context tokens.
 
40
  ## Relative Performance (perplexity)
41
  | Model | Context (tokens) | Perplexity |
42
  | ---------------------------------------------------- | ----------- | ---------- |
@@ -53,14 +54,10 @@ Here I explore whether training on long sequences that have clear conceptual dep
53
  | **bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ** | 4096 | **5.15** |
54
 
55
 
56
- - For contexts shorter than the original 2048, the original model has lower perplexity. This is consistent with the literature. The gap shrinks with context length, with the original becoming incoherent beyond this point.
57
- - In terms of perplexity, this model outperforms the SuperHOT variant at all tested context lengths. I haven't used models with the SuperHOT LoRA enough to have any sense of performance differences, but feedback on the 33b variant suggests it is particularly noticable at longer context lengths.
58
- - This comparison isn't perfect. I did use the 1.4.1 dataset, the quantization method is slightly different, and the finetuning method is different (QLoRA vs full). In short, there are other potentially influential variables responsible for these performance differences.
59
 
60
- This model could be a little undertrained. I'll update the weights if I end up training it longer and/or with better hyperparameters
61
  ## Quantization:
62
 
63
- The merged model was quantized with AutoGPTQ (bits = 4, group_size = 128, desc_act = True).
64
 
65
  ## Prompting:
66
 
 
15
  ## Overview
16
 
17
  This is a finetune of Llama-2-13b, intended to extend the useful context window to 16384 tokens. There are two training phases:
18
+ 1. It is first trained on a long-context (>7000 to 8192 token range, GPT4 only) subset of [dolphin](https://huggingface.co/datasets/ehartford/dolphin), an orca-like dataset. This amounts to roughly 110mm tokens, seen twice over two epochs. Airoboros-like training prompt was used. This took ~45 hours.
19
  2. The model was then finetuned on [Jon Durbin's Airoboros 13B GPT4 1.4](https://huggingface.co/jondurbin/airoboros-13b-gpt4-1.4) for 3 epochs. This took ~17 hours.
20
 
21
  **This is a QLoRA fine-tune**.
 
37
  Unfortunately it has also been shown that LLM's frequently struggle to attend to salient information in the middle of the context window. Attending to nearby tokens is essential to producing syntactically correct and semantically coherent sentences. Context is also most commonly found at the beginning of a context window. Perhaps the learned model behavior with respect to token position results in an "extrapolated deemphasis" when such embeddings are scaled? This hypothesis would be supported by the material improvements in perplexity achieved by training on long sequences (not just including the RoPE scaling during the fine-tune).
38
 
39
  Here I explore whether training on long sequences that have clear conceptual dependencies residing in the middle of the context helps attenuate the difficulties in attending to middle-context tokens.
40
+
41
  ## Relative Performance (perplexity)
42
  | Model | Context (tokens) | Perplexity |
43
  | ---------------------------------------------------- | ----------- | ---------- |
 
54
  | **bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-GPTQ** | 4096 | **5.15** |
55
 
56
 
 
 
 
57
 
 
58
  ## Quantization:
59
 
60
+ The merged model was quantized with AutoGPTQ (bits = 4, group_size = 64, desc_act = True).
61
 
62
  ## Prompting:
63