caskcsg/Llama-3-8B-LongMagpie-p-mix-512K-Instruct

LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

This repository contains the code, models and datasets for our paper [LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions].

Quick Links

Overview
LongMagpie Models
LongMagpie Datasets
- Datasets list
Train Llama-3-8B-LongMagpie-512K-Instruct
- Requirements
Evaluation
Build your long-context instruction data
Bugs or Questions?

Overview

High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.

LongMagpie Models

Our released models are listed as follows. You can import these models by using HuggingFace's Transformers. All models are trained on long-context instruction data synthesized by fineweb-edu and Qwen/Qwen2.5-72B-Instruct model. In the following comparision, we choose Llama-3-8B-NExtLong-512K-Instruct as a baseline model, which is trained with Magpie instruction data. In addition, to maintain short-text performance, we propose a p-mix strategy that combines LongMagpie and UltraChat datasets, resulting in a performance-balanced model Llama-3-8B-LongMagpie-p-mix-512K-Instruct.

The performance on HELMET and RULER

Model	RULER Avg.	HELMET Avg.	HELMET Recall	HELMET RAG	HELMET ICL	HELMET Re-rank	HELMET LongQA
Llama-3-8B-NExtLong-512K-Instruct	88.00	59.92	98.63	62.70	81.00	26.41	30.89
Llama-3-8B-LongMagpie-512K-Instruct	91.17	62.10	97.53	63.37	85.84	28.60	35.16
Llama-3-8B-LongMagpie-p-mix-512K-Instruct	89.70	62.11	95.96	64.17	85.12	29.61	35.71

The performance on Longbench V2

Model	Overall (%)	Easy (%)	Hard (%)	Short (%)	Medium (%)	Long (%)
Llama-3-8B-NExtLong-512K-Instruct	30.8	33.9	28.9	37.8	27.4	25.9
Llama-3-8B-LongMagpie-512K-Instruct	34.4	38.5	31.8	41.7	33	25
Llama-3-8B-LongMagpie-p-mix-512K-Instruct	33	35.9	31.2	37.2	34.9	22.2

The performance on Short-context Benchmarks

Model	Avg.	Hel.	Lam.	AR-C.	AR-E.	PIQA	Win.	Logiqa	MMLU
Meta-Llama-3-8B-Instruct	0.6332	0.5773	0.7171	0.5316	0.8165	0.7889	0.7198	0.2765	0.6376
Llama-3-8B-NExtLong-512K-Instruct	0.6410	0.5953	0.7242	0.5188	0.8224	0.8079	0.7324	0.3041	0.6232
Llama-3-8B-LongMagpie-512K-Instruct	0.6237	0.5803	0.7025	0.4804	0.8047	0.7938	0.7293	0.278	0.6209
Llama-3-8B-LongMagpie-p-mix-512K-Instruct	0.6410	0.5893	0.7355	0.5282	0.8279	0.8052	0.734	0.2842	0.6236

LongMagpie Datasets

Datasets list

Our released datasets are listed as follows. All datasets are synthesized from the short-text datasets fineweb-edu.

Dataset	Description
LongMagpie_singledoc_longcontext_dataset	Our synthesized 450k raw text files(refer to infer_demo.py). Each line of data contains context extracted from fineweb-edu, query generated by LongMapgie and answer.
LongMagpie_multidoc_longcontext_dataset	Based on LongMagpie_singledoc_longcontext_dataset, we used the MultiDoc method (refer to multidoc_format.py) to extend the context length and transformed it into SFT dialogue format.
LongMagpie_64k_dataset	We tokenized LongMagpie_multidoc_longcontext_dataset and concatenated it to a length of 64k (refer to concat script), making it convenient to train using Document Mask technology. This dataset can be used to achieve the best long-text performance.
LongMagpie_p-mix_64k_dataset	To maintain short-text performance, we tokenized LongMagpie_multidoc_longcontext_dataset and mixed it with UltraChat using the p-mix strategy, concatenating to a length of 64k (refer to p-mix.py). This dataset can be used to achieve balanced long and short text performance.

Train Llama-3-8B-LongMagpie-512K-Instruct

Requirements

Run the following script to install the remaining dependencies and train the model.

pip install -r requirements.txt

Train

bash train_sft.sh

Evaluation

Refer to the HELMET, RULER, and Longbench V2 to evaluate the Instruct model.

Build your long-context instruction data

1. Synthesizing Single-Document Q&A Data

Refer to infer_demo.py. Each line of data contains context extracted from fineweb-edu, query generated by LongMapgie and answer.

python longmagpie/infer_demo.py

2. Synthesizing Multi-Document Q&A Data

Based on LongMagpie_singledoc_longcontext_dataset, we used the MultiDoc method (refer to multidoc_format.py) to extend the context length and transformed it into SFT dialogue format.

python longmagpie/multidoc_format.py

3. Dataset Concatenation

Following ProLong, we concatenate the datasets to a fixed 64k context length and train using Document Mask technology.

3.1 Concatenating Document Q&A Datasets Only

We tokenized LongMagpie_multidoc_longcontext_dataset and concatenated it to a length of 64k (refer to build_sft_data.py), making it convenient to train using Document Mask technology. This dataset can be used to achieve the best long-text performance.

python longmagpie/build_sft_data.py

3.2 Using p-mix Strategy

To balance these capabilities, we introduce \textit{p}-Mix, a novel instruction data hybridization strategy. The core idea is twofold. First, to emulate the typical non-contextual start of general tasks, we sample a short-context instruction at the beginning of each training sequence. Second, we append subsequent data segments probabilistically to construct a mixed-context sequence up to length $L_{max}$. With probability $P_L$, a long-context instruction (generated by LongMagpie) is chosen; otherwise, with probability $1-P_L$, another short-context sample is chosen. This process repeats until approaching the target sequence length, ensuring each instance starts with a short, context-free instruction followed by a dynamically mixed sequence of long and short segments.

python longmagpie/build_sft_data_p_mix.py

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Chaochen ([email protected]) and XingWu ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

caskcsg
/

Llama-3-8B-LongMagpie-p-mix-512K-Instruct

LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

Quick Links

Overview

LongMagpie Models

The performance on HELMET and RULER

The performance on Longbench V2

The performance on Short-context Benchmarks

LongMagpie Datasets

Datasets list

Train Llama-3-8B-LongMagpie-512K-Instruct

Requirements

Train

Evaluation

Build your long-context instruction data

1. Synthesizing Single-Document Q&A Data

2. Synthesizing Multi-Document Q&A Data

3. Dataset Concatenation

3.1 Concatenating Document Q&A Datasets Only

3.2 Using p-mix Strategy

Bugs or questions?

Collection including caskcsg/Llama-3-8B-LongMagpie-p-mix-512K-Instruct

LongMagpie Models