Papers
arxiv:2506.14842

PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

Published on Jun 16
· Submitted by cwolff on Jun 19

Abstract

PictSure is an in-context learning framework that enhances few-shot image classification by optimizing embedding models' architecture, pretraining, and fine-tuning strategies to improve out-of-domain performance.

AI-generated summary

Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model -- its architecture, pretraining, and training dynamics -- at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.

Community

Paper author Paper submitter

TL;DR of "PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers"

The paper introduces PictSure, a vision-only in-context learning (ICL) framework for few-shot image classification (FSIC) that emphasizes the critical role of image embedding models. Unlike prior ICL methods that rely on language-supervised embeddings (like CLIP), PictSure uses purely visual features and transformer-based inference to classify images without any fine-tuning.

Key contributions include:

  • A systematic analysis of how embedding architecture (ResNet vs. ViT), pretraining strategies (e.g., triplet loss), and training dynamics affect FSIC performance.
  • Evidence showing that pretrained, frozen encoders—especially ViTs with triplet loss—enable better generalization, especially to out-of-domain datasets (e.g., medical imagery).
  • PictSure outperforms larger models like CAML on out-of-domain tasks, while maintaining competitive in-domain performance, despite being significantly smaller.

The study highlights that embedding quality is more critical than model size or semantic alignment for generalization in low-data visual classification scenarios.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.14842 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.14842 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.14842 in a Space README.md to link it from this page.

Collections including this paper 1