AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
Abstract
Recently, there has been growing interest in collecting reasoning-intensive pretraining data to improve LLMs' complex reasoning ability. Prior approaches typically rely on supervised classifiers to identify such data, which requires labeling by humans or LLMs, often introducing domain-specific biases. Due to the attention heads being crucial to in-context reasoning, we propose AttentionInfluence, a simple yet effective, training-free method without supervision signal. Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation. Specifically, we identify retrieval heads and compute the loss difference when masking these heads. We apply AttentionInfluence to a 1.3B-parameter dense model to conduct data selection on the SmolLM corpus of 241B tokens, and mix the SmolLM corpus with the selected subset comprising 73B tokens to pretrain a 7B-parameter dense model using 1T training tokens and WSD learning rate scheduling. Our experimental results demonstrate substantial improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive and reasoning-heavy benchmarks (i.e., MMLU, MMLU-Pro, AGIEval-en, GSM8K, and HumanEval). This demonstrates an effective weak-to-strong scaling property, with small models improving the final performance of larger models-offering a promising and scalable path for reasoning-centric data selection.
Community
AttentionInfluence: A simple, training-free, zero-supervision method to select reasoning-rich pretraining data—by just masking attention heads! 🧠✨
No labels. No retraining.
Just a 1.3B model, doing magic.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SkyLadder: Better and Faster Pretraining via Context Window Scheduling (2025)
- From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models (2025)
- Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models (2025)
- QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining (2025)
- MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models (2025)
- Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data (2025)
- ICon: In-Context Contribution for Automatic Data Selection (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper