arxiv:2502.01341
Suyuchen Wang
sheryc
AI & ML interests
Playing with LLMs
Recent Activity
reacted
to
ahmed-masry's
post
with π
about 10 hours ago
Happy to announce AlignVLM π β a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) πππΌ
π Read the paper: https://huggingface.co/papers/2502.01341
π§ Whatβs the challenge?
Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. β
π― Our Solution: ALIGN Connector
We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. β
π¬ How does it perform?
We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all π on diverse document understanding tasks π.
π Meet the AlignVLM Model Family!
We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results:
β
AlignVLM surpasses all Base VLMs trained under similar configurations. β
Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 π.
π€ What about robustness to noise?
We injected Gaussian noise (ΞΌ=0, Ο=3) into the vision encoderβs outputs before feeding them to the connector:
β
ALIGN Connector: Minimal drop (β1.67%) β proving its high robustness!
β MLP Connector: Severe degradation (β25.54%) β struggling with noisy inputs.
Code & model weights coming soon! Stay tuned! π₯
authored
a paper
about 18 hours ago
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal
Understanding
upvoted
a
paper
about 19 hours ago
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal
Understanding
Organizations
models
None public yet
datasets
20
sheryc/instruct-concat-100-tokenized-llama3
Viewer
β’
Updated
β’
16k
β’
35
sheryc/instruct-concat-100-tokenized-llama1-2
Viewer
β’
Updated
β’
16k
β’
33
sheryc/wiki40b_it_test_1k_instances_processed
Viewer
β’
Updated
β’
1k
β’
35
sheryc/wiki40b_it_test_1k_instances_processed_keep_title
Viewer
β’
Updated
β’
1k
β’
37
sheryc/wiki40b_en_test_1k_instances_processed
Viewer
β’
Updated
β’
1k
β’
37
sheryc/wiki40b_en_test_1k_instances_processed_keep_title
Viewer
β’
Updated
β’
1k
β’
35
sheryc/wiki40b_fr_test_1k_instances_processed
Viewer
β’
Updated
β’
1k
β’
36
sheryc/wiki40b_ja_test_1k_instances_processed_keep_title
Viewer
β’
Updated
β’
1k
β’
35
sheryc/wiki40b_ko_test_1k_instances_processed_keep_title
Viewer
β’
Updated
β’
1k
β’
32
sheryc/wiki40b_fr_test_1k_instances_processed_keep_title
Viewer
β’
Updated
β’
1k
β’
33