ffgcc commited on
Commit
702abd3
·
1 Parent(s): 11f05fa
.gitattributes CHANGED
@@ -33,3 +33,17 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ model-00005-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
37
+ model-00006-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
38
+ model-00007-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
39
+ model-00001-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
40
+ model-00002-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
41
+ model-00003-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
42
+ model-00004-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
43
+ config.json filter=lfs diff=lfs merge=lfs -text
44
+ generation_config.json filter=lfs diff=lfs merge=lfs -text
45
+ model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
46
+ special_tokens_map.json filter=lfs diff=lfs merge=lfs -text
47
+ tokenizer_config.json filter=lfs diff=lfs merge=lfs -text
48
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
49
+ figure/LongMagpie.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions
2
+
3
+ This repository contains the code, models and datasets for our paper [LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions].
4
+
5
+
6
+ ## Quick Links
7
+
8
+ - [Overview](#overview)
9
+ - [LongMagpie Models](#LongMagpie-models)
10
+ - [LongMagpie Datasets](#LongMagpie-datasets)
11
+ - [Datasets list](#datasets-list)
12
+ - [Train Llama-3-8B-LongMagpie-512K-Instruct](#train-LongMagpie512K)
13
+ - [Requirements](#requirements)
14
+ - [Evaluation](#evaluation)
15
+ - [Build your long-context instruction data](#build-long-data)
16
+ - [Bugs or Questions?](#bugs-or-questions)
17
+
18
+
19
+ <a id="overview"></a>
20
+
21
+ ## Overview
22
+
23
+
24
+ High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.
25
+
26
+ <div style="text-align: center;">
27
+ <img src="figure/LongMagpie.png" width="700" height="350">
28
+ </div>
29
+
30
+ <a id="LongMagpie-models"></a>
31
+
32
+ ## LongMagpie Models
33
+
34
+ Our released models are listed as follows. You can import these models by using [HuggingFace's Transformers](https://github.com/huggingface/transformers). All models are trained on long-context instruction data synthesized by [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) and [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) model. In the following comparision, we choose [Llama-3-8B-NExtLong-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-NExtLong-512K-Instruct) as a baseline model, which is trained with [Magpie instruction data](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.3-Pro-1M-v0.1). In addition, to maintain short-text performance, we propose a p-mix strategy that combines LongMagpie and [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) datasets, resulting in a performance-balanced model [Llama-3-8B-LongMagpie-p-mix-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-512K-Instruct).
35
+
36
+
37
+ #### The performance on [HELMET](https://github.com/princeton-nlp/HELMET) and [RULER](https://github.com/NVIDIA/RULER)
38
+
39
+ | Model | RULER Avg. | HELMET Avg. | HELMET Recall | HELMET RAG | HELMET ICL | HELMET Re-rank | HELMET LongQA |
40
+ |:-------------------------------|:-------:|:-------:|:------:|:-----:|:-----:|:-------:|:------:|
41
+ | [Llama-3-8B-NExtLong-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-NExtLong-512K-Instruct) | 88.00 | 59.92 | **98.63** | 62.70 | 81.00 | 26.41 | 30.89 |
42
+ | [Llama-3-8B-LongMagpie-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-512K-Instruct) | **91.17** | 62.10 | 97.53 | 63.37 | **85.84** | 28.60 | 35.16 |
43
+ | [Llama-3-8B-LongMagpie-p-mix-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-p-mix-512K-Instruct) | 89.70 | **62.11** | 95.96 | **64.17** | 85.12 | **29.61** | **35.71** |
44
+
45
+
46
+
47
+ #### The performance on [Longbench V2](https://github.com/THUDM/LongBench)
48
+
49
+ | Model | Overall (%) | Easy (%) | Hard (%) | Short (%) | Medium (%) | Long (%) |
50
+ |--------------------------------------------|-------------|----------|----------|-----------|------------|----------|
51
+ | [Llama-3-8B-NExtLong-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-NExtLong-512K-Instruct) | 30.8 | 33.9 | 28.9 | 37.8 | 27.4 | **25.9** |
52
+ | [Llama-3-8B-LongMagpie-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-512K-Instruct) | **34.4**| **38.5** |**31.8**| **41.7** |33 |25 |
53
+ | [Llama-3-8B-LongMagpie-p-mix-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-p-mix-512K-Instruct) | 33 | 35.9 |31.2 |37.2 |**34.9**| 22.2 |
54
+
55
+
56
+
57
+
58
+
59
+
60
+
61
+ #### The performance on Short-context Benchmarks
62
+
63
+
64
+
65
+ | Model | Avg. | Hel. | Lam. | AR-C. | AR-E. | PIQA | Win. | Logiqa | MMLU |
66
+ |----------------------------|-------|-----------|----------------|---------------|----------|-------|------------|--------|-------|
67
+ | [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 0.6332 | 0.5773 | 0.7171 | 0.5316 | 0.8165 | 0.7889 | 0.7198 | 0.2765 | 0.6376 |
68
+ | [Llama-3-8B-NExtLong-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-NExtLong-512K-Instruct) | **0.6410** | **0.5953** | 0.7242 | 0.5188 | 0.8224 | **0.8079** | 0.7324 | **0.3041** | 0.6232 |
69
+ | [Llama-3-8B-LongMagpie-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-512K-Instruct) | 0.6237 |0.5803 |0.7025 |0.4804| 0.8047| 0.7938 |0.7293| 0.278 |0.6209 |
70
+ | [Llama-3-8B-LongMagpie-p-mix-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-p-mix-512K-Instruct) | **0.6410** | 0.5893 | **0.7355**| **0.5282**| **0.8279**| 0.8052| **0.734**| 0.2842| **0.6236** |
71
+
72
+
73
+
74
+
75
+ <a id="LongMagpie-datasets"></a>
76
+
77
+ ## LongMagpie Datasets
78
+
79
+ <a id="datasets-list"></a>
80
+
81
+ ### Datasets list
82
+
83
+ Our released datasets are listed as follows. All datasets are synthesized from the short-text datasets [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).
84
+
85
+
86
+
87
+ | Dataset | Description |
88
+ |:-------------------------------|:--------|
89
+ | [LongMagpie_singledoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_singledoc_longcontext_dataset) | Our synthesized 450k raw text files(refer to [infer_demo.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/infer_demo.py)). Each line of data contains context extracted from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), query generated by LongMapgie and answer. |
90
+ | [LongMagpie_multidoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_multidoc_longcontext_dataset) | Based on [LongMagpie_singledoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_singledoc_longcontext_dataset), we used the MultiDoc method (refer to [multidoc_format.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/multidoc_format.py)) to extend the context length and transformed it into SFT dialogue format. |
91
+ | [LongMagpie_64k_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_64k_dataset) | We tokenized [LongMagpie_multidoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_multidoc_longcontext_dataset) and concatenated it to a length of 64k (refer to [concat script](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/build_sft_data.py)), making it convenient to train using Document Mask technology. This dataset can be used to achieve the best long-text performance. |
92
+ | [LongMagpie_p-mix_64k_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_p-mix_64k_dataset) | To maintain short-text performance, we tokenized [LongMagpie_multidoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_multidoc_longcontext_dataset) and mixed it with [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) using the p-mix strategy, concatenating to a length of 64k (refer to [p-mix.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/build_sft_data_p_mix.py)). This dataset can be used to achieve balanced long and short text performance. |
93
+
94
+
95
+ <a id="train-LongMagpie512K"></a>
96
+
97
+ ## Train Llama-3-8B-LongMagpie-512K-Instruct
98
+
99
+ <a id="requirements"></a>
100
+
101
+ ### Requirements
102
+
103
+ Run the following script to install the remaining dependencies and train the model.
104
+
105
+ ```bash
106
+ pip install -r requirements.txt
107
+ ```
108
+
109
+ ### Train
110
+
111
+ ```bash
112
+ bash train_sft.sh
113
+ ```
114
+
115
+
116
+ <a id="evaluation"></a>
117
+
118
+ ## Evaluation
119
+
120
+ Refer to the [HELMET](https://github.com/princeton-nlp/HELMET), [RULER](https://github.com/NVIDIA/RULER), and [Longbench V2](https://github.com/THUDM/LongBench) to evaluate the Instruct model.
121
+
122
+
123
+ <a id="build-long-data"></a>
124
+
125
+ ## Build your long-context instruction data
126
+
127
+
128
+ ### 1. Synthesizing Single-Document Q&A Data
129
+
130
+ Refer to [infer_demo.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/infer_demo.py). Each line of data contains context extracted from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), query generated by LongMapgie and answer.
131
+
132
+
133
+ ```bash
134
+ python longmagpie/infer_demo.py
135
+ ```
136
+
137
+ ### 2. Synthesizing Multi-Document Q&A Data
138
+
139
+ Based on [LongMagpie_singledoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_singledoc_longcontext_dataset), we used the MultiDoc method (refer to [multidoc_format.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/multidoc_format.py)) to extend the context length and transformed it into SFT dialogue format.
140
+
141
+ ```bash
142
+ python longmagpie/multidoc_format.py
143
+ ```
144
+
145
+
146
+ ### 3. Dataset Concatenation
147
+
148
+ Following [ProLong](https://github.com/princeton-nlp/ProLong), we concatenate the datasets to a fixed 64k context length and train using Document Mask technology.
149
+
150
+ #### 3.1 Concatenating Document Q&A Datasets Only
151
+
152
+ We tokenized [LongMagpie_multidoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_multidoc_longcontext_dataset) and concatenated it to a length of 64k (refer to [build_sft_data.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/build_sft_data.py)), making it convenient to train using Document Mask technology. This dataset can be used to achieve the best long-text performance.
153
+
154
+ ```bash
155
+ python longmagpie/build_sft_data.py
156
+ ```
157
+
158
+ #### 3.2 Using p-mix Strategy
159
+
160
+ To balance these capabilities, we introduce \textit{p}-Mix, a novel instruction data hybridization strategy. The core idea is twofold. First, to emulate the typical non-contextual start of general tasks, we sample a short-context instruction at the beginning of each training sequence. Second, we append subsequent data segments probabilistically to construct a mixed-context sequence up to length $L_{max}$. With probability $P_L$, a long-context instruction (generated by LongMagpie) is chosen; otherwise, with probability $1-P_L$, another short-context sample is chosen. This process repeats until approaching the target sequence length, ensuring each instance starts with a short, context-free instruction followed by a dynamically mixed sequence of long and short segments.
161
+
162
+ ```bash
163
+ python longmagpie/build_sft_data_p_mix.py
164
+ ```
165
+
166
+
167
+ <a id="bugs-or-questions"></a>
168
+
169
+ ## Bugs or questions?
170
+
171
+ If you have any questions related to the code or the paper, feel free to email Chaochen (`[email protected]`) and XingWu (`[email protected]`). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
172
+
173
+ <!-- ## Citation
174
+
175
+ Please cite our paper if you use LongMagpie in your work:
176
+
177
+ ```bibtex
178
+
179
+ ``` -->
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4fc0d8f93206c0aafc94f170204edda028df2820bc4c78f84e66792305b192e3
3
+ size 771
figure/LongMagpie.png ADDED

Git LFS Details

  • SHA256: 70e3ea0255d36d310f3b6d640a2f42b16171da3dfd52736cc35d57ca41624107
  • Pointer size: 131 Bytes
  • Size of remote file: 286 kB
generation_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88eac5bf673682f548fbf6e03281c033d882bcaf9166ba184a6c49014995cfcf
3
+ size 194
model-00001-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d907a7d0ca10194879d3518e4c05488d7dc874744c8336d1e0a8c6cadb12e94
3
+ size 4886466168
model-00002-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c5cc423509aeb7a9c5e8745537319350905ac0bba83956a0d5cee6dfc43ba22b
3
+ size 4832007448
model-00003-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d6583ffcab682374a22ab02d3adb5de3c3784ab98fed5e59ad542ed06bd3ab2
3
+ size 4999813112
model-00004-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:96ddddb27f6b61dd68f141938828bc72c094993ad04b1dc3f307669487230322
3
+ size 4999813128
model-00005-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:50d7cdca0d81b1eb519e6eb1e24d35bdfa4d64550dfec0cc5c3681083d9d9010
3
+ size 4832007496
model-00006-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:39a49b903669bde90c8bcf5388363501c69701017322185fdd5c4906a179a47f
3
+ size 4999813120
model-00007-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c9bd47c5865d14ab8d6a4ae6307dcafe292f1db827f2657e4579455b4cd676a4
3
+ size 2571158184
model.safetensors.index.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bb83c0dcc965cf42c5bc8fa1b1d88eae170b5beb5b705297c33a6399be9d0d2d
3
+ size 23950
special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f38c73729248f6c127296386e3cdde96e254636cc58b4169d3fd32328d9a8ec
3
+ size 296
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c5cf44023714fb39b05e71e425f8d7b92805ff73f7988b083b8c87f0bf87393
3
+ size 17209961
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:da0e3a7cce6e4d787e85eb1c24d548420e0d7fe2c7a214e192795c46e40d75bb
3
+ size 50977