Papers
arxiv:2508.07101

Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

Published on Aug 9
ยท Submitted by drkylj on Aug 12
Authors:
,
,
,
,
,
,

Abstract

LessIsMore is a training-free sparse attention mechanism that improves efficiency and generalization in reasoning tasks by aggregating token selections from local attention heads.

AI-generated summary

Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves -- and in some cases improves -- accuracy while achieving a 1.1times average decoding speed-up compared to full attention. Moreover, LessIsMore attends to 2times fewer tokens without accuracy loss, achieving a 1.13times end-to-end speed-up compared to existing sparse attention methods.

Community

Paper author Paper submitter

We propose LessIsMore as a training-free sparse attention method to improve the efficiency of reasoning models while maintaining the accuracy. It performs accurate token selection via unified-attention-head selection and maintains a fixed ratio of recency window to ensure accuracy and efficiency.

Empirically, LessIsMore preserves accuracy on mainstream reasoning tasks with sparsity of up to 87.5% without extending the generation length, consistently outperforming SOTA sparse attention baselines; moreover, we achieve 1.10ร— avg decoding speedup vs. full attention and 1.13ร— end-to-end speedup vs. SOTA sparse attention approach.

๐Ÿ“„ Paper: https://arxiv.org/abs/2508.07101
๐Ÿ’ป Code: https://github.com/DerrickYLJ/LessIsMore

Really impressive work! Love how the approach keeps things simple yet manages to boost reasoning efficiency without losing accuracy โ€” feels very practical.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.07101 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.07101 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.07101 in a Space README.md to link it from this page.

Collections including this paper 2