IMAGENETTE

IMAGENETTE is a vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for multi-class image classification. It is trained to classify images into 10 categories from the popular Imagenette dataset using the SiglipForImageClassification architecture.

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features https://arxiv.org/pdf/2502.14786

ImageNet Large Scale Visual Recognition Challenge https://arxiv.org/pdf/1409.0575

Classification Report:
                  precision    recall  f1-score   support

           tench     0.9885    0.9834    0.9859       963
english springer     0.9843    0.9822    0.9832       955
 cassette player     0.9544    0.9486    0.9515       993
       chain saw     0.9257    0.8998    0.9125       858
          church     0.9654    0.9798    0.9726       941
     French horn     0.9757    0.9665    0.9711       956
   garbage truck     0.8883    0.9761    0.9301       961
        gas pump     0.9366    0.9044    0.9202       931
       golf ball     0.9925    0.9716    0.9819       951
       parachute     0.9821    0.9708    0.9764       960

        accuracy                         0.9590      9469
       macro avg     0.9593    0.9583    0.9586      9469
    weighted avg     0.9597    0.9590    0.9591      9469

Label Space: 10 Classes

The model predicts one of the following image classes:

0: tench
1: english springer
2: cassette player
3: chain saw
4: church
5: French horn
6: garbage truck
7: gas pump
8: golf ball
9: parachute

Install Dependencies

pip install -q transformers torch pillow gradio hf_xet

Inference Code

import gradio as gr
from transformers import AutoImageProcessor, SiglipForImageClassification
from PIL import Image
import torch

# Load model and processor
model_name = "prithivMLmods/IMAGENETTE"
model = SiglipForImageClassification.from_pretrained(model_name)
processor = AutoImageProcessor.from_pretrained(model_name)

# Label mapping
id2label = {
    "0": "tench",
    "1": "english springer",
    "2": "cassette player",
    "3": "chain saw",
    "4": "church",
    "5": "French horn",
    "6": "garbage truck",
    "7": "gas pump",
    "8": "golf ball",
    "9": "parachute"
}

def classify_image(image):
    image = Image.fromarray(image).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.nn.functional.softmax(logits, dim=1).squeeze().tolist()
    
    prediction = {
        id2label[str(i)]: round(probs[i], 3) for i in range(len(probs))
    }

    return prediction

# Gradio Interface
iface = gr.Interface(
    fn=classify_image,
    inputs=gr.Image(type="numpy"),
    outputs=gr.Label(num_top_classes=3, label="Image Classification"),
    title="IMAGENETTE - SigLIP2 Classifier",
    description="Upload an image to classify it into one of 10 categories from the Imagenette dataset."
)

if __name__ == "__main__":
    iface.launch()

Intended Use

IMAGENETTE is designed for:

Educational purposes and model benchmarking.
Demonstrating the performance of SigLIP2 on a small but diverse classification task.
Fine-tuning workflows on vision-language models.

prithivMLmods
/

IMAGENETTE

IMAGENETTE

Label Space: 10 Classes

Install Dependencies

Inference Code

Intended Use

Model tree for prithivMLmods/IMAGENETTE

Dataset used to train prithivMLmods/IMAGENETTE

Collection including prithivMLmods/IMAGENETTE

Multiclass Image Classification 05142025