metadata

library_name: transformers
license: apache-2.0
base_model: google/vit-base-patch16-224
tags:
  - image-classification
  - animals
  - vision-transformer
  - vit
  - transfer-learning
  - generated_from_trainer
datasets:
  - imagefolder
metrics:
  - accuracy
model-index:
  - name: vit-90-animals
    results:
      - task:
          name: Image Classification
          type: image-classification
        dataset:
          name: iamsouravbanerjee/animal-image-dataset-90-different-animals
          type: imagefolder
          config: default
          split: train
          args: default
        metrics:
          - name: Accuracy
            type: accuracy
            value: 0.9796296296296296

vit-90-animals

Model description

This model is a fine-tuned Vision Transformer version of google/vit-base-patch16-224 on the animal image dataset from kaggle - trained to classify images into 90 different animal species. It achieves high accuracy on unseen data and was trained using supervised learning. The model can be used for general-purpose image classification in the animal domain and serves as a comparison baseline for zero-shot classification models such as CLIP.

The model achieves the following results on the evaluation set:

Loss: 0.0840
Accuracy: 0.9796

Intended uses & limitations

Intended uses

Animal image classification (educational, demo, prototyping)
Benchmarking against zero-shot classification models
Use in Gradio interfaces or image analysis tools

Limitations

The model is limited to the 90 animal classes it was trained on
It may not generalize well to image domains outside of its training distribution
Performance can degrade with poor image quality or occlusions

Training and evaluation data

The model was trained on a dataset containing 5,400 animal images categorized into 90 distinct classes. The dataset was obtained from Kaggle and according to the creator originally sourced from Google Images. The training/validation/test split was 80/10/10, and the label distribution is relatively balanced across classes.

Evaluation was conducted on the test split and compared to results from a zero-shot model (openai/clip-vit-large-patch14) using the same label set.

Training procedure

Base model: google/vit-base-patch16-224
Fine-tuning method: Supervised training using the Hugging Face Trainer class
Data augmentation: Applied during training (e.g., RandomHorizontalFlip, ColorJitter)
Training time: ~5 epochs with and without augmentation
Optimizer: AdamW (default settings)
Evaluation metrics: Accuracy, precision, and recall
Best performance (no augmentation): 98.3% test accuracy

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003
train_batch_size: 16
eval_batch_size: 8
seed: 42
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 5

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy
1.2021	1.0	270	0.3500	0.9611
0.2978	2.0	540	0.1766	0.9685
0.1886	3.0	810	0.1500	0.9685
0.1706	4.0	1080	0.1409	0.9685
0.1678	5.0	1350	0.1373	0.9667

Framework versions

Transformers 4.50.0
Pytorch 2.6.0+cu124
Datasets 3.4.1
Tokenizers 0.21.1