Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Abstract
This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.
Community
Awesome work! Any chance of publishing the code too?
Excellent work! I'm curious, is the gating scalar Ξ² the only additional parameter that requires training?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Long-Context Language Modeling with Parallel Context Encoding (2024)
- LongHeads: Multi-Head Attention is Secretly a Long Context Processor (2024)
- CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory (2024)
- Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference (2024)
- Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
I'm working on a pytorch implementation, come and join me in the repo if you wanna help
https://github.com/jlamprou/Infini-Attention
Here's fully working implementation repo!
https://github.com/Beomi/InfiniTransformer
( @glamprou 's repo inspired me a lot! thanks βΊοΈ)
Llama-3 is out!
I updated my repo(https://github.com/Beomi/InfiniTransformer) to train Llama-3 with 1M seq len π€©
An implementation of Infini-attention on Gemma 2B for 10M context - https://github.com/mustafaaljadery/gemma-2B-10M
Unlocking Infinite Context: Meet Infini-attention for Transformers!
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
hki
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper