PictSure: Few-Shot Image Classification with Context Learning

Model Description

PictSure is a novel few-shot learning model for image classification that leverages context images to make predictions on new, unseen images. The model combines pre-trained image encoders with transformer architecture to enable effective few-shot learning with minimal examples. More details can be found on our paper page.

Key Features

  • Few-shot Learning: Classify images with only a few examples per class
  • Context-Aware: Uses context images and labels to inform predictions
  • Flexible Architecture: Supports both ResNet and Vision Transformer (ViT) backbones as well as other custom backbones
  • Transformer-Based: Employs transformer encoders for sequence processing
  • Easy Integration: Simple API for setting context and making predictions

Model Architecture

PictSure consists of several key components:

  1. Embedding Network: Pre-trained ResNet18 or custom Vision Transformer for feature extraction
  2. Projection Layers: Linear projections for image features and label embeddings
  3. Transformer Encoder: Multi-head attention mechanism with configurable heads and layers
  4. Classification Head: Final linear layer for class prediction

Architecture Details

  • Image Resolution: 224 ร— 224 pixels
  • Feature Dimension: 1024D (concatenated image + label projections)
  • Default Configuration:
    • ResNet model: 8 attention heads, 4 transformer layers, ~53M parameters
    • ViT model: 8 attention heads, 4 transformer layers, ~128M parameters

Intended Use

Primary Use Cases

  • Few-shot image classification in scenarios with limited labeled data
  • Meta-learning applications where rapid adaptation to new classes is required
  • Educational and research purposes in computer vision and machine learning
  • Prototyping classification systems with minimal training data

Limitations

  • Requires context images to be set before making predictions
  • Performance depends on the quality and representativeness of context examples
  • Limited to classification tasks (not suitable for detection or segmentation)
  • Input images must be resized to 224ร—224 pixels

How to Use

Installation

pip install torch torchvision
pip install huggingface_hub
pip install PictSure

Basic Usage

from PictSure import PictSure
from PIL import Image

# Load pre-trained model
model = PictSure.from_pretrained("pictsure/pictsure-resnet")

# Prepare context images and labels
context_images = [
    Image.open("cat1.jpg"),
    Image.open("cat2.jpg"),
    Image.open("dog1.jpg"),
    Image.open("dog2.jpg")
]
context_labels = [0, 0, 1, 1]  # 0 for cat, 1 for dog

# Set context
model.set_context_images(context_images, context_labels)

# Make prediction on new image
test_image = Image.open("unknown_animal.jpg")
prediction = model.predict(test_image)
print(f"Predicted class: {prediction}")

Training Data

The pre-trained models were trained on curated datasets for few-shot learning evaluation:

  • Encoder models: ImageNet pre-trained features (ResNet18/ViT)
  • Meta-Training: ImageNet-21k
  • Validation: Standard few-shot learning benchmarks, e.g. mini-ImageNet, PlantDoc and BoneBreak

Data Preprocessing

  • Images resized to 224ร—224 pixels
  • Normalized with ImageNet statistics:
    • Mean: [0.4914, 0.4822, 0.4465]
    • Std: [0.2023, 0.1994, 0.2010]

Evaluation

Performance Metrics

The model is evaluated using standard few-shot learning metrics:

  • Accuracy: Overall classification accuracy
  • Few-shot Accuracy: Performance with 1, 5 shot scenarios

Model Variants

Model Backbone Parameters Model Size Performance
ResPreAll ResNet18 53M ~200MB Balanced speed/accuracy
ViTPreAll ViT-Base 128M ~500MB Higher accuracy

Ethical Considerations

Potential Biases

  • The model inherits biases from ImageNet and Imagenet-21k pre-training
  • Performance may vary across different demographic groups or geographic regions
  • Context examples significantly influence predictions and may introduce bias

Responsible Use

  • Validate performance on your specific use case and demographic groups
  • Be aware of potential biases in context image selection
  • Consider fairness implications when deploying in production systems
  • Ensure diverse and representative context examples

Limitations and Risks

Technical Limitations

  • Context Dependency: Requires good context examples for optimal performance
  • Computational Requirements: Transformer architecture requires significant memory
  • Fixed Architecture: Pre-trained models have fixed class numbers and architecture
  • Image Size: Limited to 224ร—224 input resolution

Potential Risks

  • Misclassification: Incorrect predictions in critical applications
  • Bias Amplification: May amplify biases present in context images
  • Overfitting to Context: May not generalize beyond provided examples

Citation

@misc{schiesser2025pictsure,
      title={PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers}, 
      author={Lukas Schiesser and Cornelius Wolff and Sophie Haas and Simon Pukrop},
      year={2025},
      eprint={2506.14842},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.14842}, 
}

Model Card Contact

For questions about this model card or the PictSure model, open an issue in the GitHub repository.

Changelog

Version 1.0 (Current)

  • Initial release with ResNet and ViT backbones
  • Support for HuggingFace Hub integration
  • CLI tools for model management
  • Comprehensive documentation and examples
Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support