un^2CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP
Abstract
A generative model framework, unCLIP, is inverted to improve CLIP's ability to capture detailed visual information while maintaining text alignment.
Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding. In other words, it inverts the CLIP image encoder. Compared to discriminative models like CLIP, generative models are better at capturing image details because they are trained to learn the data distribution of images. Additionally, the conditional input space of unCLIP aligns with CLIP's original image-text embedding space. Therefore, we propose to invert unCLIP (dubbed un^2CLIP) to improve the CLIP model. In this way, the improved image encoder can gain unCLIP's visual detail capturing ability while preserving its alignment with the original text encoder simultaneously. We evaluate our improved CLIP across various tasks to which CLIP has been applied, including the challenging MMVP-VLM benchmark, the dense-prediction open-vocabulary segmentation task, and multimodal large language model tasks. Experiments show that un^2CLIP significantly improves the original CLIP and previous CLIP improvement methods. Code and models will be available at https://github.com/LiYinqi/un2CLIP.
Community
A work to improve CLIP's visual detail capturing ability by inverting the unCLIP generative model, which inverts the CLIP vision encoder.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Implicit Inversion turns CLIP into a Decoder (2025)
- Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis (2025)
- Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation (2025)
- DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception (2025)
- TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation (2025)
- Decoupled Global-Local Alignment for Improving Compositional Understanding (2025)
- FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper