SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published 8 days ago • 118
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution Paper • 2307.06304 • Published Jul 12, 2023 • 30
Three Towers: Flexible Contrastive Learning with Pretrained Image Models Paper • 2305.16999 • Published May 26, 2023 • 2
PaLI-X: On Scaling up a Multilingual Vision and Language Model Paper • 2305.18565 • Published May 29, 2023 • 3