Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning
Abstract
LessIsMore is a training-free sparse attention mechanism that improves efficiency and generalization in reasoning tasks by aggregating token selections from local attention heads.
Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves -- and in some cases improves -- accuracy while achieving a 1.1times average decoding speed-up compared to full attention. Moreover, LessIsMore attends to 2times fewer tokens without accuracy loss, achieving a 1.13times end-to-end speed-up compared to existing sparse attention methods.
Community
We propose LessIsMore as a training-free sparse attention method to improve the efficiency of reasoning models while maintaining the accuracy. It performs accurate token selection via unified-attention-head selection and maintains a fixed ratio of recency window to ensure accuracy and efficiency.
Empirically, LessIsMore preserves accuracy on mainstream reasoning tasks with sparsity of up to 87.5% without extending the generation length, consistently outperforming SOTA sparse attention baselines; moreover, we achieve 1.10ร avg decoding speedup vs. full attention and 1.13ร end-to-end speedup vs. SOTA sparse attention approach.
๐ Paper: https://arxiv.org/abs/2508.07101
๐ป Code: https://github.com/DerrickYLJ/LessIsMore
Really impressive work! Love how the approach keeps things simple yet manages to boost reasoning efficiency without losing accuracy โ feels very practical.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper