X-Fusion: Introducing New Modality to Frozen Large Language Models
Abstract
We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.
Community
This paper proposes X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models (2025)
- RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models (2025)
- FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding (2025)
- MMRL: Multi-Modal Representation Learning for Vision-Language Models (2025)
- VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering (2025)
- Transfer between Modalities with MetaQueries (2025)
- Harmonizing Visual Representations for Unified Multimodal Understanding and Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper