ssz1111
/

GATEAU-1k-10k

Model card Files Files and versions

ssz1111 commited on Sep 11, 2024

Commit

3ebfa54

·

verified ·

1 Parent(s): 636aae0

Update README.md

Files changed (1) hide show

README.md +17 -3

README.md CHANGED Viewed

@@ -1,3 +1,17 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+# GATEAU-LLaMA-7B-1K-10K
+The expansion of large language models to effectively handle instructions with extremely long contexts has yet to be fully investigated.
+The primary obstacle lies in constructing a high-quality long instruction-following dataset devised for long context alignment.
+Existing studies have attempted to scale up the available data volume by synthesizing long instruction-following samples.
+However, indiscriminately increasing the quantity of data without a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the final performance.
+To bridge this gap, we aim to address the unique challenge of long-context alignment, i.e., modeling the long-range dependencies for handling instructions and lengthy input contexts.
+We propose \textbf{GATEAU}, a novel framework designed to identify the influential and high-quality samples enriched with long-range dependency relations by utilizing crafted
+Homologous Models' Guidance (HMG) and Contextual Awareness Measurement (CAM).
+Specifically, HMG attempts to measure the difficulty of generating corresponding responses due to the long-range dependencies, using the perplexity scores of the response from two homologous models with different context windows.
+Also, the role of CAM is to measure the difficulty of understanding the long input contexts due to long-range dependencies by evaluating whether the model’s attention is focused on important segments.
+Built upon both proposed methods, we select the most challenging samples as the influential data to effectively frame the long-range dependencies, thereby achieving better performance of LLMs.
+Comprehensive experiments indicate that GATEAU effectively identifies samples enriched with long-range dependency relations and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.