arxiv:2506.14842

PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

Published on Jun 16

· Submitted by

cwolff on Jun 19

Upvote

Authors:

Lukas Schiesser ,

Cornelius Wolff ,

Sophie Haas ,

Simon Pukrop

Abstract

PictSure is an in-context learning framework that enhances few-shot image classification by optimizing embedding models' architecture, pretraining, and fine-tuning strategies to improve out-of-domain performance.

AI-generated summary

Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model -- its architecture, pretraining, and training dynamics -- at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.

View arXiv page View PDF GitHub repository Add to collection

Community

cwolff

Paper author Paper submitter 1 day ago

TL;DR of "PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers"

The paper introduces PictSure, a vision-only in-context learning (ICL) framework for few-shot image classification (FSIC) that emphasizes the critical role of image embedding models. Unlike prior ICL methods that rely on language-supervised embeddings (like CLIP), PictSure uses purely visual features and transformer-based inference to classify images without any fine-tuning.

Key contributions include:

A systematic analysis of how embedding architecture (ResNet vs. ViT), pretraining strategies (e.g., triplet loss), and training dynamics affect FSIC performance.
Evidence showing that pretrained, frozen encoders—especially ViTs with triplet loss—enable better generalization, especially to out-of-domain datasets (e.g., medical imagery).
PictSure outperforms larger models like CAML on out-of-domain tasks, while maintaining competitive in-domain performance, despite being significantly smaller.

The study highlights that embedding quality is more critical than model size or semantic alignment for generalization in low-data visual classification scenarios.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.14842 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.14842 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.14842 in a Space README.md to link it from this page.