PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers
Abstract
PictSure is an in-context learning framework that enhances few-shot image classification by optimizing embedding models' architecture, pretraining, and fine-tuning strategies to improve out-of-domain performance.
Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model -- its architecture, pretraining, and training dynamics -- at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.
Community
TL;DR of "PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers"
The paper introduces PictSure, a vision-only in-context learning (ICL) framework for few-shot image classification (FSIC) that emphasizes the critical role of image embedding models. Unlike prior ICL methods that rely on language-supervised embeddings (like CLIP), PictSure uses purely visual features and transformer-based inference to classify images without any fine-tuning.
Key contributions include:
- A systematic analysis of how embedding architecture (ResNet vs. ViT), pretraining strategies (e.g., triplet loss), and training dynamics affect FSIC performance.
- Evidence showing that pretrained, frozen encoders—especially ViTs with triplet loss—enable better generalization, especially to out-of-domain datasets (e.g., medical imagery).
- PictSure outperforms larger models like CAML on out-of-domain tasks, while maintaining competitive in-domain performance, despite being significantly smaller.
The study highlights that embedding quality is more critical than model size or semantic alignment for generalization in low-data visual classification scenarios.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper