|
--- |
|
library_name: transformers |
|
license: mit |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
```markdown |
|
# MaTVLM Model Card |
|
|
|
## Introduction |
|
With the advancement of RNN models with linear complexity, the quadratic complexity challenge of transformers has the potential to be overcome. Notably, the emerging Mamba-2 has demonstrated competitive performance, bridging the gap between RNN models and transformers. |
|
However, due to sequential processing and vanishing gradients, RNN models struggle to capture long-range dependencies, limiting contextual understanding. This results in slow convergence, high resource demands, and poor performance on downstream understanding and complex reasoning tasks. |
|
In this work, we present a hybrid model MaTVLM |
|
by substituting a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers. Leveraging the inherent relationship between attention and Mamba-2, we initialize Mamba-2 with corresponding attention weights to accelerate convergence. Subsequently, we employ a single-stage distillation process, using the pre-trained VLM as the teacher model to transfer knowledge to the |
|
ame, further enhancing convergence speed and performance. Furthermore, we investigate the impact of differential distillation loss within our training framework. |
|
We evaluate the MaTVLM on multiple benchmarks, demonstrating competitive performance against the teacher model and existing VLMs while surpassing both Mamba-based VLMs and models of comparable parameter scales. Remarkably, the MaTVLM achieves up to $3.6\times$ faster inference than the teacher model while reducing GPU memory consumption by 27.5\%, all without compromising performance. |
|
|
|
Paper: [https://arxiv.org/abs/2503.13440](https://arxiv.org/abs/2503.13440) |
|
|
|
Code: [https://github.com/hustvl/MaTVLM](https://github.com/hustvl/MaTVLM) |
|
|
|
## Citation |
|
If you find MaTVLM is useful in your research or applications, please consider giving us a star ๐ and citing it by the following BibTeX entry. |
|
|
|
```bibtex |
|
@misc{li2025matvlmhybridmambatransformerefficient, |
|
title={MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling}, |
|
author={Yingyue Li and Bencheng Liao and Wenyu Liu and Xinggang Wang}, |
|
year={2025}, |
|
eprint={2503.13440}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2503.13440}, |
|
} |
|
``` |
|
``` |