arxiv:2505.07793

Overflow Prevention Enhances Long-Context Recurrent LLMs

Published on May 12

· Submitted by

assafbk on May 13

Upvote

Authors:

Assaf Ben-Kish ,

Itamar Zimerman ,

James Glass ,

Leonid Karlinsky ,

Raja Giryes

Abstract

A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations.

View arXiv page View PDF GitHub repository Add to collection

Community

assafbk

Paper author Paper submitter about 21 hours ago

OPRM (Overflow Prevention for Recurrent Models) is a training-free inference method for long-context recurrent LLMs. By mitigating recurrent memory overflows, OPRM ensures reliable inference, leading to significant gains in both synthetic and real-world long-context tasks. In addition, OPRM naturally performs context extension, allowing the model to handle sequences far longer than those it was originally trained on, all while being faster than vanilla inference and requiring a surprisingly small memory footprint.

Code: https://github.com/assafbk/OPRM
Arxiv: https://arxiv.org/abs/2505.07793