LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions
This repository contains the code, models and datasets for our paper [LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions].
Quick Links
- Overview
- LongMagpie Models
- LongMagpie Datasets
- Train Llama-3-8B-LongMagpie-512K-Instruct
- Evaluation
- Build your long-context instruction data
- Bugs or Questions?
Overview
High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.

LongMagpie Models
Our released models are listed as follows. You can import these models by using HuggingFace's Transformers. All models are trained on long-context instruction data synthesized by fineweb-edu and Qwen/Qwen2.5-72B-Instruct model. In the following comparision, we choose Llama-3-8B-NExtLong-512K-Instruct as a baseline model, which is trained with Magpie instruction data. In addition, to maintain short-text performance, we propose a p-mix strategy that combines LongMagpie and UltraChat datasets, resulting in a performance-balanced model Llama-3-8B-LongMagpie-p-mix-512K-Instruct.
The performance on HELMET and RULER
Model | RULER Avg. | HELMET Avg. | HELMET Recall | HELMET RAG | HELMET ICL | HELMET Re-rank | HELMET LongQA |
---|---|---|---|---|---|---|---|
Llama-3-8B-NExtLong-512K-Instruct | 88.00 | 59.92 | 98.63 | 62.70 | 81.00 | 26.41 | 30.89 |
Llama-3-8B-LongMagpie-512K-Instruct | 91.17 | 62.10 | 97.53 | 63.37 | 85.84 | 28.60 | 35.16 |
Llama-3-8B-LongMagpie-p-mix-512K-Instruct | 89.70 | 62.11 | 95.96 | 64.17 | 85.12 | 29.61 | 35.71 |
The performance on Longbench V2
Model | Overall (%) | Easy (%) | Hard (%) | Short (%) | Medium (%) | Long (%) |
---|---|---|---|---|---|---|
Llama-3-8B-NExtLong-512K-Instruct | 30.8 | 33.9 | 28.9 | 37.8 | 27.4 | 25.9 |
Llama-3-8B-LongMagpie-512K-Instruct | 34.4 | 38.5 | 31.8 | 41.7 | 33 | 25 |
Llama-3-8B-LongMagpie-p-mix-512K-Instruct | 33 | 35.9 | 31.2 | 37.2 | 34.9 | 22.2 |
The performance on Short-context Benchmarks
Model | Avg. | Hel. | Lam. | AR-C. | AR-E. | PIQA | Win. | Logiqa | MMLU |
---|---|---|---|---|---|---|---|---|---|
Meta-Llama-3-8B-Instruct | 0.6332 | 0.5773 | 0.7171 | 0.5316 | 0.8165 | 0.7889 | 0.7198 | 0.2765 | 0.6376 |
Llama-3-8B-NExtLong-512K-Instruct | 0.6410 | 0.5953 | 0.7242 | 0.5188 | 0.8224 | 0.8079 | 0.7324 | 0.3041 | 0.6232 |
Llama-3-8B-LongMagpie-512K-Instruct | 0.6237 | 0.5803 | 0.7025 | 0.4804 | 0.8047 | 0.7938 | 0.7293 | 0.278 | 0.6209 |
Llama-3-8B-LongMagpie-p-mix-512K-Instruct | 0.6410 | 0.5893 | 0.7355 | 0.5282 | 0.8279 | 0.8052 | 0.734 | 0.2842 | 0.6236 |
LongMagpie Datasets
Datasets list
Our released datasets are listed as follows. All datasets are synthesized from the short-text datasets fineweb-edu.
Dataset | Description |
---|---|
LongMagpie_singledoc_longcontext_dataset | Our synthesized 450k raw text files(refer to infer_demo.py). Each line of data contains context extracted from fineweb-edu, query generated by LongMapgie and answer. |
LongMagpie_multidoc_longcontext_dataset | Based on LongMagpie_singledoc_longcontext_dataset, we used the MultiDoc method (refer to multidoc_format.py) to extend the context length and transformed it into SFT dialogue format. |
LongMagpie_64k_dataset | We tokenized LongMagpie_multidoc_longcontext_dataset and concatenated it to a length of 64k (refer to concat script), making it convenient to train using Document Mask technology. This dataset can be used to achieve the best long-text performance. |
LongMagpie_p-mix_64k_dataset | To maintain short-text performance, we tokenized LongMagpie_multidoc_longcontext_dataset and mixed it with UltraChat using the p-mix strategy, concatenating to a length of 64k (refer to p-mix.py). This dataset can be used to achieve balanced long and short text performance. |
Train Llama-3-8B-LongMagpie-512K-Instruct
Requirements
Run the following script to install the remaining dependencies and train the model.
pip install -r requirements.txt
Train
bash train_sft.sh
Evaluation
Refer to the HELMET, RULER, and Longbench V2 to evaluate the Instruct model.
Build your long-context instruction data
1. Synthesizing Single-Document Q&A Data
Refer to infer_demo.py. Each line of data contains context extracted from fineweb-edu, query generated by LongMapgie and answer.
python longmagpie/infer_demo.py
2. Synthesizing Multi-Document Q&A Data
Based on LongMagpie_singledoc_longcontext_dataset, we used the MultiDoc method (refer to multidoc_format.py) to extend the context length and transformed it into SFT dialogue format.
python longmagpie/multidoc_format.py
3. Dataset Concatenation
Following ProLong, we concatenate the datasets to a fixed 64k context length and train using Document Mask technology.
3.1 Concatenating Document Q&A Datasets Only
We tokenized LongMagpie_multidoc_longcontext_dataset and concatenated it to a length of 64k (refer to build_sft_data.py), making it convenient to train using Document Mask technology. This dataset can be used to achieve the best long-text performance.
python longmagpie/build_sft_data.py
3.2 Using p-mix Strategy
To balance these capabilities, we introduce \textit{p}-Mix, a novel instruction data hybridization strategy. The core idea is twofold. First, to emulate the typical non-contextual start of general tasks, we sample a short-context instruction at the beginning of each training sequence. Second, we append subsequent data segments probabilistically to construct a mixed-context sequence up to length $L_{max}$. With probability $P_L$, a long-context instruction (generated by LongMagpie) is chosen; otherwise, with probability $1-P_L$, another short-context sample is chosen. This process repeats until approaching the target sequence length, ensuring each instance starts with a short, context-free instruction followed by a dynamically mixed sequence of long and short segments.
python longmagpie/build_sft_data_p_mix.py
Bugs or questions?
If you have any questions related to the code or the paper, feel free to email Chaochen ([email protected]
) and XingWu ([email protected]
). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
- Downloads last month
- 1