arxiv:2502.09245

You Do Not Fully Utilize Transformer's Representation Capacity

Published on Feb 13

· Submitted by

yaraksen on Feb 19

Upvote

Authors:

Gleb Gerasimov ,

Yaroslav Aksenov ,

Nikita Balagansky ,

Viacheslav Sinii ,

Daniil Gavrilov

Abstract

Layer-Integrated Memory (LIMe) addresses representation collapse in Transformers by allowing access to hidden states from earlier layers, leading to consistent performance improvements across tasks.

AI-generated summary

In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.

View arXiv page View PDF GitHub 29 Add to collection

Community

yaraksen

Paper author Paper submitter Feb 19

We introduce Layer-Integrated Memory (LIMe), a simple yet powerful modification to multi-head attention that lets the model directly attend to representations from all earlier layers. By doing so, LIMe alleviates representation collapse in Transformers and yields consistent performance gains with only minimal computational overhead.