YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

This repository contains the code, models and datasets for our paper [LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions].

Quick Links

Overview

High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.

LongMagpie Models

Our released models are listed as follows. You can import these models by using HuggingFace's Transformers. All models are trained on long-context instruction data synthesized by fineweb-edu and Qwen/Qwen2.5-72B-Instruct model. In the following comparision, we choose Llama-3-8B-NExtLong-512K-Instruct as a baseline model, which is trained with Magpie instruction data. In addition, to maintain short-text performance, we propose a p-mix strategy that combines LongMagpie and UltraChat datasets, resulting in a performance-balanced model Llama-3-8B-LongMagpie-p-mix-512K-Instruct.

The performance on HELMET and RULER

Model RULER Avg. HELMET Avg. HELMET Recall HELMET RAG HELMET ICL HELMET Re-rank HELMET LongQA
Llama-3-8B-NExtLong-512K-Instruct 88.00 59.92 98.63 62.70 81.00 26.41 30.89
Llama-3-8B-LongMagpie-512K-Instruct 91.17 62.10 97.53 63.37 85.84 28.60 35.16
Llama-3-8B-LongMagpie-p-mix-512K-Instruct 89.70 62.11 95.96 64.17 85.12 29.61 35.71

The performance on Longbench V2

Model Overall (%) Easy (%) Hard (%) Short (%) Medium (%) Long (%)
Llama-3-8B-NExtLong-512K-Instruct 30.8 33.9 28.9 37.8 27.4 25.9
Llama-3-8B-LongMagpie-512K-Instruct 34.4 38.5 31.8 41.7 33 25
Llama-3-8B-LongMagpie-p-mix-512K-Instruct 33 35.9 31.2 37.2 34.9 22.2

The performance on Short-context Benchmarks

Model Avg. Hel. Lam. AR-C. AR-E. PIQA Win. Logiqa MMLU
Meta-Llama-3-8B-Instruct 0.6332 0.5773 0.7171 0.5316 0.8165 0.7889 0.7198 0.2765 0.6376
Llama-3-8B-NExtLong-512K-Instruct 0.6410 0.5953 0.7242 0.5188 0.8224 0.8079 0.7324 0.3041 0.6232
Llama-3-8B-LongMagpie-512K-Instruct 0.6237 0.5803 0.7025 0.4804 0.8047 0.7938 0.7293 0.278 0.6209
Llama-3-8B-LongMagpie-p-mix-512K-Instruct 0.6410 0.5893 0.7355 0.5282 0.8279 0.8052 0.734 0.2842 0.6236

LongMagpie Datasets

Datasets list

Our released datasets are listed as follows. All datasets are synthesized from the short-text datasets fineweb-edu.

Dataset Description
LongMagpie_singledoc_longcontext_dataset Our synthesized 450k raw text files(refer to infer_demo.py). Each line of data contains context extracted from fineweb-edu, query generated by LongMapgie and answer.
LongMagpie_multidoc_longcontext_dataset Based on LongMagpie_singledoc_longcontext_dataset, we used the MultiDoc method (refer to multidoc_format.py) to extend the context length and transformed it into SFT dialogue format.
LongMagpie_64k_dataset We tokenized LongMagpie_multidoc_longcontext_dataset and concatenated it to a length of 64k (refer to concat script), making it convenient to train using Document Mask technology. This dataset can be used to achieve the best long-text performance.
LongMagpie_p-mix_64k_dataset To maintain short-text performance, we tokenized LongMagpie_multidoc_longcontext_dataset and mixed it with UltraChat using the p-mix strategy, concatenating to a length of 64k (refer to p-mix.py). This dataset can be used to achieve balanced long and short text performance.

Train Llama-3-8B-LongMagpie-512K-Instruct

Requirements

Run the following script to install the remaining dependencies and train the model.

pip install -r requirements.txt

Train

bash train_sft.sh

Evaluation

Refer to the HELMET, RULER, and Longbench V2 to evaluate the Instruct model.

Build your long-context instruction data

1. Synthesizing Single-Document Q&A Data

Refer to infer_demo.py. Each line of data contains context extracted from fineweb-edu, query generated by LongMapgie and answer.

python longmagpie/infer_demo.py

2. Synthesizing Multi-Document Q&A Data

Based on LongMagpie_singledoc_longcontext_dataset, we used the MultiDoc method (refer to multidoc_format.py) to extend the context length and transformed it into SFT dialogue format.

python longmagpie/multidoc_format.py

3. Dataset Concatenation

Following ProLong, we concatenate the datasets to a fixed 64k context length and train using Document Mask technology.

3.1 Concatenating Document Q&A Datasets Only

We tokenized LongMagpie_multidoc_longcontext_dataset and concatenated it to a length of 64k (refer to build_sft_data.py), making it convenient to train using Document Mask technology. This dataset can be used to achieve the best long-text performance.

python longmagpie/build_sft_data.py

3.2 Using p-mix Strategy

To balance these capabilities, we introduce \textit{p}-Mix, a novel instruction data hybridization strategy. The core idea is twofold. First, to emulate the typical non-contextual start of general tasks, we sample a short-context instruction at the beginning of each training sequence. Second, we append subsequent data segments probabilistically to construct a mixed-context sequence up to length $L_{max}$. With probability $P_L$, a long-context instruction (generated by LongMagpie) is chosen; otherwise, with probability $1-P_L$, another short-context sample is chosen. This process repeats until approaching the target sequence length, ensuring each instance starts with a short, context-free instruction followed by a dynamically mixed sequence of long and short segments.

python longmagpie/build_sft_data_p_mix.py

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Chaochen ([email protected]) and XingWu ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Downloads last month
1
Safetensors
Model size
8.03B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including caskcsg/Llama-3-8B-LongMagpie-p-mix-512K-Instruct