Update README.md
Browse files
README.md
CHANGED
@@ -34,9 +34,9 @@ Each method will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaled
|
|
34 |
## Motivation
|
35 |
Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation (https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and NTK aware scaling. My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.
|
36 |
|
37 |
-
Unfortunately it has also been shown that LLM's frequently struggle to attend to salient information in the middle of the context window. Attending to nearby tokens is essential to producing syntactically correct and semantically coherent sentences.
|
38 |
|
39 |
-
Here I explore whether training on long sequences that have clear conceptual dependencies residing in the middle of the context helps attenuate the difficulties in attending to middle-context tokens.
|
40 |
|
41 |
## Relative Performance (perplexity)
|
42 |
| Model | Context (tokens) | Perplexity |
|
|
|
34 |
## Motivation
|
35 |
Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation (https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and NTK aware scaling. My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.
|
36 |
|
37 |
+
Unfortunately it has also been shown that LLM's frequently struggle to attend to salient information in the middle of the context window. Attending to nearby tokens is essential to producing syntactically correct and semantically coherent sentences. Essential context is also most commonly found at the beginning of a context window. With this in mind, it is unsurprising LLMs often attend more strongly to these areas. However, this the learned model behavior results in an "extrapolated deemphasis" when such embeddings are scaled? This hypothesis may be supported by the material improvements in perplexity achieved by training on long sequences (not just including the RoPE scaling during the fine-tune).
|
38 |
|
39 |
+
Here I explore whether training on long sequences that have clear conceptual dependencies residing in the middle of the context helps attenuate the difficulties in attending to middle-context tokens. When/if I have time, I hope to perform a more rigorous assessment of the peformance with respect to this specific issue.
|
40 |
|
41 |
## Relative Performance (perplexity)
|
42 |
| Model | Context (tokens) | Perplexity |
|