init

Files changed (16) hide show

.gitattributes +14 -0
README.md +179 -0
config.json +3 -0
figure/LongMagpie.png +3 -0
generation_config.json +3 -0
model-00001-of-00007.safetensors +3 -0
model-00002-of-00007.safetensors +3 -0
model-00003-of-00007.safetensors +3 -0
model-00004-of-00007.safetensors +3 -0
model-00005-of-00007.safetensors +3 -0
model-00006-of-00007.safetensors +3 -0
model-00007-of-00007.safetensors +3 -0
model.safetensors.index.json +3 -0
special_tokens_map.json +3 -0
tokenizer.json +3 -0
tokenizer_config.json +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,17 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+model-00005-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00006-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00007-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00001-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00002-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00003-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00004-of-00007.safetensors filter=lfs diff=lfs merge=lfs -text
+config.json filter=lfs diff=lfs merge=lfs -text
+generation_config.json filter=lfs diff=lfs merge=lfs -text
+model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
+special_tokens_map.json filter=lfs diff=lfs merge=lfs -text
+tokenizer_config.json filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
+figure/LongMagpie.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,179 @@

+## LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions
+This repository contains the code, models and datasets for our paper [LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions].
+## Quick Links
+  - [Overview](#overview)
+  - [LongMagpie Models](#LongMagpie-models)
+  - [LongMagpie Datasets](#LongMagpie-datasets)
+    - [Datasets list](#datasets-list)
+  - [Train Llama-3-8B-LongMagpie-512K-Instruct](#train-LongMagpie512K)
+    - [Requirements](#requirements)
+  - [Evaluation](#evaluation)
+  - [Build your long-context instruction data](#build-long-data)
+  - [Bugs or Questions?](#bugs-or-questions)
+<a id="overview"></a>
+## Overview
+High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.
+<div style="text-align: center;">
+  <img src="figure/LongMagpie.png"  width="700" height="350">
+</div>
+<a id="LongMagpie-models"></a>
+## LongMagpie Models
+Our released models are listed as follows. You can import these models by using [HuggingFace's Transformers](https://github.com/huggingface/transformers). All models are trained on long-context instruction data synthesized by [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) and [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) model. In the following comparision, we choose [Llama-3-8B-NExtLong-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-NExtLong-512K-Instruct) as a baseline model, which is trained with [Magpie instruction data](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.3-Pro-1M-v0.1). In addition, to maintain short-text performance, we propose a p-mix strategy that combines LongMagpie and [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) datasets, resulting in a performance-balanced model [Llama-3-8B-LongMagpie-p-mix-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-512K-Instruct).
+#### The performance on [HELMET](https://github.com/princeton-nlp/HELMET) and [RULER](https://github.com/NVIDIA/RULER)
+|              Model              |  RULER Avg.  |  HELMET Avg.  | HELMET Recall | HELMET RAG  | HELMET ICL  | HELMET Re-rank | HELMET LongQA |
+|:-------------------------------|:-------:|:-------:|:------:|:-----:|:-----:|:-------:|:------:|
+| [Llama-3-8B-NExtLong-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-NExtLong-512K-Instruct) | 88.00 | 59.92 | **98.63** | 	62.70 | 	81.00 | 	26.41 | 	30.89  |
+| [Llama-3-8B-LongMagpie-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-512K-Instruct) | **91.17** | 62.10 | 97.53 | 63.37 | **85.84** | 28.60 | 35.16 |
+| [Llama-3-8B-LongMagpie-p-mix-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-p-mix-512K-Instruct) | 89.70 | **62.11** | 95.96  |	**64.17** |	85.12 |	**29.61** |	**35.71** |
+#### The performance on [Longbench V2](https://github.com/THUDM/LongBench)
+| Model                                      | Overall (%) | Easy (%) | Hard (%) | Short (%) | Medium (%) | Long (%) |
+|--------------------------------------------|-------------|----------|----------|-----------|------------|----------|
+| [Llama-3-8B-NExtLong-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-NExtLong-512K-Instruct)          | 30.8        | 33.9     | 28.9     | 37.8      | 27.4       | **25.9**     |
+| [Llama-3-8B-LongMagpie-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-512K-Instruct)          | **34.4**|	**38.5**	|**31.8**|	**41.7**	|33	|25    |
+| [Llama-3-8B-LongMagpie-p-mix-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-p-mix-512K-Instruct)           | 33 |	35.9	|31.2	|37.2	|**34.9**|	22.2     |
+#### The performance on Short-context Benchmarks
+| Model                      | Avg.   | Hel. | Lam. | AR-C. | AR-E. | PIQA  | Win. | Logiqa | MMLU  |
+|----------------------------|-------|-----------|----------------|---------------|----------|-------|------------|--------|-------|
+| [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)   | 0.6332 | 0.5773    | 0.7171         | 0.5316        | 0.8165   | 0.7889 | 0.7198     | 0.2765 | 0.6376 |
+| [Llama-3-8B-NExtLong-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-NExtLong-512K-Instruct) | **0.6410** | **0.5953**    | 0.7242         | 0.5188        | 0.8224   | **0.8079** | 0.7324     | **0.3041** | 0.6232 |
+| [Llama-3-8B-LongMagpie-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-512K-Instruct) | 0.6237	|0.5803	|0.7025	|0.4804|	0.8047|	0.7938	|0.7293|	0.278	|0.6209 |
+| [Llama-3-8B-LongMagpie-p-mix-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-p-mix-512K-Instruct) | **0.6410** |	0.5893 |	**0.7355**|	**0.5282**|	**0.8279**|	0.8052|	**0.734**|	0.2842|	**0.6236** |
+<a id="LongMagpie-datasets"></a>
+## LongMagpie Datasets
+<a id="datasets-list"></a>
+### Datasets list
+Our released datasets are listed as follows. All datasets are synthesized from the short-text datasets [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).
+|              Dataset              | Description |
+|:-------------------------------|:--------|
+| [LongMagpie_singledoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_singledoc_longcontext_dataset) | Our synthesized 450k raw text files(refer to [infer_demo.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/infer_demo.py)). Each line of data contains context extracted from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), query generated by LongMapgie and answer. |
+| [LongMagpie_multidoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_multidoc_longcontext_dataset) | Based on [LongMagpie_singledoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_singledoc_longcontext_dataset), we used the MultiDoc method (refer to [multidoc_format.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/multidoc_format.py)) to extend the context length and transformed it into SFT dialogue format. |
+| [LongMagpie_64k_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_64k_dataset) | We tokenized [LongMagpie_multidoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_multidoc_longcontext_dataset) and concatenated it to a length of 64k (refer to [concat script](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/build_sft_data.py)), making it convenient to train using Document Mask technology. This dataset can be used to achieve the best long-text performance. |
+| [LongMagpie_p-mix_64k_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_p-mix_64k_dataset) | To maintain short-text performance, we tokenized [LongMagpie_multidoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_multidoc_longcontext_dataset) and mixed it with [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) using the p-mix strategy, concatenating to a length of 64k (refer to [p-mix.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/build_sft_data_p_mix.py)). This dataset can be used to achieve balanced long and short text performance. |
+<a id="train-LongMagpie512K"></a>
+## Train Llama-3-8B-LongMagpie-512K-Instruct
+<a id="requirements"></a>
+### Requirements
+Run the following script to install the remaining dependencies and train the model.
+```bash
+pip install -r requirements.txt
+```
+### Train
+```bash
+bash train_sft.sh
+```
+<a id="evaluation"></a>
+## Evaluation
+Refer to the [HELMET](https://github.com/princeton-nlp/HELMET), [RULER](https://github.com/NVIDIA/RULER), and [Longbench V2](https://github.com/THUDM/LongBench) to evaluate the Instruct model.
+<a id="build-long-data"></a>
+## Build your long-context instruction data
+### 1. Synthesizing Single-Document Q&A Data
+Refer to [infer_demo.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/infer_demo.py). Each line of data contains context extracted from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), query generated by LongMapgie and answer.
+```bash
+python longmagpie/infer_demo.py
+```
+### 2. Synthesizing Multi-Document Q&A Data
+Based on [LongMagpie_singledoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_singledoc_longcontext_dataset), we used the MultiDoc method (refer to [multidoc_format.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/multidoc_format.py)) to extend the context length and transformed it into SFT dialogue format.
+```bash
+python longmagpie/multidoc_format.py
+```
+### 3. Dataset Concatenation
+Following [ProLong](https://github.com/princeton-nlp/ProLong), we concatenate the datasets to a fixed 64k context length and train using Document Mask technology.
+#### 3.1 Concatenating Document Q&A Datasets Only
+We tokenized [LongMagpie_multidoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_multidoc_longcontext_dataset) and concatenated it to a length of 64k (refer to [build_sft_data.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/build_sft_data.py)), making it convenient to train using Document Mask technology. This dataset can be used to achieve the best long-text performance.
+```bash
+python longmagpie/build_sft_data.py
+```
+#### 3.2 Using p-mix Strategy
+To balance these capabilities, we introduce \textit{p}-Mix, a novel instruction data hybridization strategy. The core idea is twofold. First, to emulate the typical non-contextual start of general tasks, we sample a short-context instruction at the beginning of each training sequence. Second, we append subsequent data segments probabilistically to construct a mixed-context sequence up to length $L_{max}$. With probability $P_L$, a long-context instruction (generated by LongMagpie) is chosen; otherwise, with probability $1-P_L$, another short-context sample is chosen. This process repeats until approaching the target sequence length, ensuring each instance starts with a short, context-free instruction followed by a dynamically mixed sequence of long and short segments.
+```bash
+python longmagpie/build_sft_data_p_mix.py
+```
+<a id="bugs-or-questions"></a>
+## Bugs or questions?
+If you have any questions related to the code or the paper, feel free to email Chaochen (`[email protected]`) and XingWu (`[email protected]`). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
+<!-- ## Citation
+Please cite our paper if you use LongMagpie in your work:
+```bibtex
+``` -->

config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4fc0d8f93206c0aafc94f170204edda028df2820bc4c78f84e66792305b192e3
+size 771

figure/LongMagpie.png ADDED Viewed

Git LFS Details

SHA256: 70e3ea0255d36d310f3b6d640a2f42b16171da3dfd52736cc35d57ca41624107
Pointer size: 131 Bytes
Size of remote file: 286 kB

generation_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:88eac5bf673682f548fbf6e03281c033d882bcaf9166ba184a6c49014995cfcf
+size 194

model-00001-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5d907a7d0ca10194879d3518e4c05488d7dc874744c8336d1e0a8c6cadb12e94
+size 4886466168

model-00002-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c5cc423509aeb7a9c5e8745537319350905ac0bba83956a0d5cee6dfc43ba22b
+size 4832007448

model-00003-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5d6583ffcab682374a22ab02d3adb5de3c3784ab98fed5e59ad542ed06bd3ab2
+size 4999813112

model-00004-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:96ddddb27f6b61dd68f141938828bc72c094993ad04b1dc3f307669487230322
+size 4999813128

model-00005-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:50d7cdca0d81b1eb519e6eb1e24d35bdfa4d64550dfec0cc5c3681083d9d9010
+size 4832007496

model-00006-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:39a49b903669bde90c8bcf5388363501c69701017322185fdd5c4906a179a47f
+size 4999813120

model-00007-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c9bd47c5865d14ab8d6a4ae6307dcafe292f1db827f2657e4579455b4cd676a4
+size 2571158184

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bb83c0dcc965cf42c5bc8fa1b1d88eae170b5beb5b705297c33a6399be9d0d2d
+size 23950

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6f38c73729248f6c127296386e3cdde96e254636cc58b4169d3fd32328d9a8ec
+size 296

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3c5cf44023714fb39b05e71e425f8d7b92805ff73f7988b083b8c87f0bf87393
+size 17209961

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:da0e3a7cce6e4d787e85eb1c24d548420e0d7fe2c7a214e192795c46e40d75bb
+size 50977