Abstract
We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training (2024)
- Contrastive Localized Language-Image Pre-Training (2024)
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives (2024)
- FoPru: Focal Pruning for Efficient Large Vision-Language Models (2024)
- Improving Multi-modal Large Language Model through Boosting Vision Capabilities (2024)
- RobustFormer: Noise-Robust Pre-training for images and videos (2024)
- Understanding the Benefits of SimCLR Pre-Training in Two-Layer Convolutional Neural Networks (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Is there any repo/example on how to use aimv2 for CLIP tasks? Thank you!
Models citing this paper 17
Browse 17 models citing this paperDatasets citing this paper 0
No dataset linking this paper