PictSure: Few-Shot Image Classification with Context Learning
Model Description
PictSure is a novel few-shot learning model for image classification that leverages context images to make predictions on new, unseen images. The model combines pre-trained image encoders with transformer architecture to enable effective few-shot learning with minimal examples. More details can be found on our paper page.
Key Features
- Few-shot Learning: Classify images with only a few examples per class
- Context-Aware: Uses context images and labels to inform predictions
- Flexible Architecture: Supports both ResNet and Vision Transformer (ViT) backbones as well as other custom backbones
- Transformer-Based: Employs transformer encoders for sequence processing
- Easy Integration: Simple API for setting context and making predictions
Model Architecture
PictSure consists of several key components:
- Embedding Network: Pre-trained ResNet18 or custom Vision Transformer for feature extraction
- Projection Layers: Linear projections for image features and label embeddings
- Transformer Encoder: Multi-head attention mechanism with configurable heads and layers
- Classification Head: Final linear layer for class prediction
Architecture Details
- Image Resolution: 224 ร 224 pixels
- Feature Dimension: 1024D (concatenated image + label projections)
- Default Configuration:
- ResNet model: 8 attention heads, 4 transformer layers, ~53M parameters
- ViT model: 8 attention heads, 4 transformer layers, ~128M parameters
Intended Use
Primary Use Cases
- Few-shot image classification in scenarios with limited labeled data
- Meta-learning applications where rapid adaptation to new classes is required
- Educational and research purposes in computer vision and machine learning
- Prototyping classification systems with minimal training data
Limitations
- Requires context images to be set before making predictions
- Performance depends on the quality and representativeness of context examples
- Limited to classification tasks (not suitable for detection or segmentation)
- Input images must be resized to 224ร224 pixels
How to Use
Installation
pip install torch torchvision
pip install huggingface_hub
pip install PictSure
Basic Usage
from PictSure import PictSure
from PIL import Image
# Load pre-trained model
model = PictSure.from_pretrained("pictsure/pictsure-resnet")
# Prepare context images and labels
context_images = [
Image.open("cat1.jpg"),
Image.open("cat2.jpg"),
Image.open("dog1.jpg"),
Image.open("dog2.jpg")
]
context_labels = [0, 0, 1, 1] # 0 for cat, 1 for dog
# Set context
model.set_context_images(context_images, context_labels)
# Make prediction on new image
test_image = Image.open("unknown_animal.jpg")
prediction = model.predict(test_image)
print(f"Predicted class: {prediction}")
Training Data
The pre-trained models were trained on curated datasets for few-shot learning evaluation:
- Encoder models: ImageNet pre-trained features (ResNet18/ViT)
- Meta-Training: ImageNet-21k
- Validation: Standard few-shot learning benchmarks, e.g. mini-ImageNet, PlantDoc and BoneBreak
Data Preprocessing
- Images resized to 224ร224 pixels
- Normalized with ImageNet statistics:
- Mean: [0.4914, 0.4822, 0.4465]
- Std: [0.2023, 0.1994, 0.2010]
Evaluation
Performance Metrics
The model is evaluated using standard few-shot learning metrics:
- Accuracy: Overall classification accuracy
- Few-shot Accuracy: Performance with 1, 5 shot scenarios
Model Variants
Model | Backbone | Parameters | Model Size | Performance |
---|---|---|---|---|
ResPreAll | ResNet18 | 53M | ~200MB | Balanced speed/accuracy |
ViTPreAll | ViT-Base | 128M | ~500MB | Higher accuracy |
Ethical Considerations
Potential Biases
- The model inherits biases from ImageNet and Imagenet-21k pre-training
- Performance may vary across different demographic groups or geographic regions
- Context examples significantly influence predictions and may introduce bias
Responsible Use
- Validate performance on your specific use case and demographic groups
- Be aware of potential biases in context image selection
- Consider fairness implications when deploying in production systems
- Ensure diverse and representative context examples
Limitations and Risks
Technical Limitations
- Context Dependency: Requires good context examples for optimal performance
- Computational Requirements: Transformer architecture requires significant memory
- Fixed Architecture: Pre-trained models have fixed class numbers and architecture
- Image Size: Limited to 224ร224 input resolution
Potential Risks
- Misclassification: Incorrect predictions in critical applications
- Bias Amplification: May amplify biases present in context images
- Overfitting to Context: May not generalize beyond provided examples
Citation
@misc{schiesser2025pictsure,
title={PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers},
author={Lukas Schiesser and Cornelius Wolff and Sophie Haas and Simon Pukrop},
year={2025},
eprint={2506.14842},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.14842},
}
Model Card Contact
For questions about this model card or the PictSure model, open an issue in the GitHub repository.
Changelog
Version 1.0 (Current)
- Initial release with ResNet and ViT backbones
- Support for HuggingFace Hub integration
- CLI tools for model management
- Comprehensive documentation and examples
- Downloads last month
- 7
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support