No changes needed
#1
by
nielsr
HF staff
- opened
README.md
CHANGED
@@ -1,15 +1,18 @@
|
|
1 |
---
|
|
|
2 |
license: mit
|
3 |
pipeline_tag: image-text-to-text
|
4 |
-
library_name: transformers
|
5 |
---
|
|
|
|
|
6 |
# MaTVLM Model Card
|
7 |
|
8 |
## Introduction
|
9 |
With the advancement of RNN models with linear complexity, the quadratic complexity challenge of transformers has the potential to be overcome. Notably, the emerging Mamba-2 has demonstrated competitive performance, bridging the gap between RNN models and transformers.
|
10 |
However, due to sequential processing and vanishing gradients, RNN models struggle to capture long-range dependencies, limiting contextual understanding. This results in slow convergence, high resource demands, and poor performance on downstream understanding and complex reasoning tasks.
|
11 |
In this work, we present a hybrid model MaTVLM
|
12 |
-
by substituting a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers. Leveraging the inherent relationship between attention and Mamba-2, we initialize Mamba-2 with corresponding attention weights to accelerate convergence. Subsequently, we employ a single-stage distillation process, using the pre-trained VLM as the teacher model to transfer knowledge to the
|
|
|
13 |
We evaluate the MaTVLM on multiple benchmarks, demonstrating competitive performance against the teacher model and existing VLMs while surpassing both Mamba-based VLMs and models of comparable parameter scales. Remarkably, the MaTVLM achieves up to $3.6\times$ faster inference than the teacher model while reducing GPU memory consumption by 27.5\%, all without compromising performance.
|
14 |
|
15 |
Paper: [https://arxiv.org/abs/2503.13440](https://arxiv.org/abs/2503.13440)
|
@@ -29,4 +32,5 @@ If you find MaTVLM is useful in your research or applications, please consider g
|
|
29 |
primaryClass={cs.CV},
|
30 |
url={https://arxiv.org/abs/2503.13440},
|
31 |
}
|
|
|
32 |
```
|
|
|
1 |
---
|
2 |
+
library_name: transformers
|
3 |
license: mit
|
4 |
pipeline_tag: image-text-to-text
|
|
|
5 |
---
|
6 |
+
|
7 |
+
```markdown
|
8 |
# MaTVLM Model Card
|
9 |
|
10 |
## Introduction
|
11 |
With the advancement of RNN models with linear complexity, the quadratic complexity challenge of transformers has the potential to be overcome. Notably, the emerging Mamba-2 has demonstrated competitive performance, bridging the gap between RNN models and transformers.
|
12 |
However, due to sequential processing and vanishing gradients, RNN models struggle to capture long-range dependencies, limiting contextual understanding. This results in slow convergence, high resource demands, and poor performance on downstream understanding and complex reasoning tasks.
|
13 |
In this work, we present a hybrid model MaTVLM
|
14 |
+
by substituting a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers. Leveraging the inherent relationship between attention and Mamba-2, we initialize Mamba-2 with corresponding attention weights to accelerate convergence. Subsequently, we employ a single-stage distillation process, using the pre-trained VLM as the teacher model to transfer knowledge to the
|
15 |
+
ame, further enhancing convergence speed and performance. Furthermore, we investigate the impact of differential distillation loss within our training framework.
|
16 |
We evaluate the MaTVLM on multiple benchmarks, demonstrating competitive performance against the teacher model and existing VLMs while surpassing both Mamba-based VLMs and models of comparable parameter scales. Remarkably, the MaTVLM achieves up to $3.6\times$ faster inference than the teacher model while reducing GPU memory consumption by 27.5\%, all without compromising performance.
|
17 |
|
18 |
Paper: [https://arxiv.org/abs/2503.13440](https://arxiv.org/abs/2503.13440)
|
|
|
32 |
primaryClass={cs.CV},
|
33 |
url={https://arxiv.org/abs/2503.13440},
|
34 |
}
|
35 |
+
```
|
36 |
```
|