You Do Not Fully Utilize Transformer's Representation Capacity
Abstract
In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.
Community
We introduce Layer-Integrated Memory (LIMe), a simple yet powerful modification to multi-head attention that lets the model directly attend to representations from all earlier layers. By doing so, LIMe alleviates representation collapse in Transformers and yields consistent performance gains with only minimal computational overhead.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Layer by Layer: Uncovering Hidden Representations in Language Models (2025)
- On the Expressivity of Selective State-Space Layers: A Multivariate Polynomial Approach (2025)
- Masked Generative Nested Transformers with Decode Time Scaling (2025)
- Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction (2024)
- A Unified Perspective on the Dynamics of Deep Transformers (2025)
- Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling (2025)
- SAFR: Neuron Redistribution for Interpretability (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper