ssz1111 commited on
Commit
3ebfa54
·
verified ·
1 Parent(s): 636aae0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -3
README.md CHANGED
@@ -1,3 +1,17 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # GATEAU-LLaMA-7B-1K-10K
6
+
7
+ The expansion of large language models to effectively handle instructions with extremely long contexts has yet to be fully investigated.
8
+ The primary obstacle lies in constructing a high-quality long instruction-following dataset devised for long context alignment.
9
+ Existing studies have attempted to scale up the available data volume by synthesizing long instruction-following samples.
10
+ However, indiscriminately increasing the quantity of data without a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the final performance.
11
+ To bridge this gap, we aim to address the unique challenge of long-context alignment, i.e., modeling the long-range dependencies for handling instructions and lengthy input contexts.
12
+ We propose \textbf{GATEAU}, a novel framework designed to identify the influential and high-quality samples enriched with long-range dependency relations by utilizing crafted
13
+ Homologous Models' Guidance (HMG) and Contextual Awareness Measurement (CAM).
14
+ Specifically, HMG attempts to measure the difficulty of generating corresponding responses due to the long-range dependencies, using the perplexity scores of the response from two homologous models with different context windows.
15
+ Also, the role of CAM is to measure the difficulty of understanding the long input contexts due to long-range dependencies by evaluating whether the model’s attention is focused on important segments.
16
+ Built upon both proposed methods, we select the most challenging samples as the influential data to effectively frame the long-range dependencies, thereby achieving better performance of LLMs.
17
+ Comprehensive experiments indicate that GATEAU effectively identifies samples enriched with long-range dependency relations and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.