Papers
arxiv:2503.24067

TransMamba: Flexibly Switching between Transformer and Mamba

Published on Mar 31
· Submitted by andyyang on Apr 7
Authors:
,
,
,
,
,
,
,
,

Abstract

Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. This paper proposes TransMamba, a novel framework that unifies Transformer and Mamba through shared parameter matrices (e.g., QKV and CBx), and thus could dynamically switch between attention and SSM mechanisms at different token lengths and layers. We design the Memory converter to bridge Transformer and Mamba by converting attention outputs into SSM-compatible states, ensuring seamless information flow at TransPoints where the transformation happens. The TransPoint scheduling is also thoroughly explored for further improvements. We conducted extensive experiments demonstrating that TransMamba achieves superior training efficiency and performance compared to baselines, and validated the deeper consistency between Transformer and Mamba paradigms, offering a scalable solution for next-generation sequence modeling.

Community

Paper author Paper submitter

This paper proposes a novel framework called TransMamba, which unifies the Transformer and Mamba architectures by sharing parameter matrices (e.g., QKV and CBx). This allows the model to dynamically switch between Attention and SSM mechanisms across different token lengths and levels.
To implement this idea, the paper designs a Memory Converter to ensure information consistency when transforming from Transformer to Mamba. Additionally, it introduces the concept of TransPoint to determine when the model should choose to be Mamba or Transformer.

Key Findings:

  • Design of Memory Converter: We need ensures information consistency during the transformation from Transformer to Mamba.
  • TransPoint Scheduling Strategy: Determines when to switch between Attention and SSM at different levels and token lengths. We designed a schedule to make TransMamba effective and efficient.
  • Even when trained with one TransPoint schedule, the model still works when performing inference with different schedules. This indicates that there is indeed some consistency.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.24067 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.24067 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.24067 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.