arxiv:2508.07101

Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

Published on Aug 9

· Submitted by

drkylj on Aug 12

Upvote

Authors:

Lijie Yang ,

Abstract

LessIsMore is a training-free sparse attention mechanism that improves efficiency and generalization in reasoning tasks by aggregating token selections from local attention heads.

AI-generated summary

Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves -- and in some cases improves -- accuracy while achieving a 1.1times average decoding speed-up compared to full attention. Moreover, LessIsMore attends to 2times fewer tokens without accuracy loss, achieving a 1.13times end-to-end speed-up compared to existing sparse attention methods.

View arXiv page View PDF Add to collection

Community

drkylj

Paper author Paper submitter 4 days ago

We propose LessIsMore as a training-free sparse attention method to improve the efficiency of reasoning models while maintaining the accuracy. It performs accurate token selection via unified-attention-head selection and maintains a fixed ratio of recency window to ensure accuracy and efficiency.

Empirically, LessIsMore preserves accuracy on mainstream reasoning tasks with sparsity of up to 87.5% without extending the generation length, consistently outperforming SOTA sparse attention baselines; moreover, we achieve 1.10× avg decoding speedup vs. full attention and 1.13× end-to-end speedup vs. SOTA sparse attention approach.

📄 Paper: https://arxiv.org/abs/2508.07101
💻 Code: https://github.com/DerrickYLJ/LessIsMore

crispy-tomato

4 days ago

Really impressive work! Love how the approach keeps things simple yet manages to boost reasoning efficiency without losing accuracy — feels very practical.