vit-90-animals / README.md
maceythm's picture
Update README.md
e4c6682 verified
metadata
library_name: transformers
license: apache-2.0
base_model: google/vit-base-patch16-224
tags:
  - image-classification
  - animals
  - vision-transformer
  - vit
  - transfer-learning
  - generated_from_trainer
datasets:
  - imagefolder
metrics:
  - accuracy
model-index:
  - name: vit-90-animals
    results:
      - task:
          name: Image Classification
          type: image-classification
        dataset:
          name: iamsouravbanerjee/animal-image-dataset-90-different-animals
          type: imagefolder
          config: default
          split: train
          args: default
        metrics:
          - name: Accuracy
            type: accuracy
            value: 0.9796296296296296

vit-90-animals


Model description

This model is a fine-tuned Vision Transformer version of google/vit-base-patch16-224 on the animal image dataset from kaggle - trained to classify images into 90 different animal species. It achieves high accuracy on unseen data and was trained using supervised learning. The model can be used for general-purpose image classification in the animal domain and serves as a comparison baseline for zero-shot classification models such as CLIP.

The model achieves the following results on the evaluation set:

  • Loss: 0.0840
  • Accuracy: 0.9796

Intended uses & limitations

Intended uses

  • Animal image classification (educational, demo, prototyping)
  • Benchmarking against zero-shot classification models
  • Use in Gradio interfaces or image analysis tools

Limitations

  • The model is limited to the 90 animal classes it was trained on
  • It may not generalize well to image domains outside of its training distribution
  • Performance can degrade with poor image quality or occlusions

Training and evaluation data

The model was trained on a dataset containing 5,400 animal images categorized into 90 distinct classes. The dataset was obtained from Kaggle and according to the creator originally sourced from Google Images. The training/validation/test split was 80/10/10, and the label distribution is relatively balanced across classes.

Evaluation was conducted on the test split and compared to results from a zero-shot model (openai/clip-vit-large-patch14) using the same label set.

Training procedure

  • Base model: google/vit-base-patch16-224
  • Fine-tuning method: Supervised training using the Hugging Face Trainer class
  • Data augmentation: Applied during training (e.g., RandomHorizontalFlip, ColorJitter)
  • Training time: ~5 epochs with and without augmentation
  • Optimizer: AdamW (default settings)
  • Evaluation metrics: Accuracy, precision, and recall
  • Best performance (no augmentation): 98.3% test accuracy

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0003
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 5

Training results

Training Loss Epoch Step Validation Loss Accuracy
1.2021 1.0 270 0.3500 0.9611
0.2978 2.0 540 0.1766 0.9685
0.1886 3.0 810 0.1500 0.9685
0.1706 4.0 1080 0.1409 0.9685
0.1678 5.0 1350 0.1373 0.9667

Framework versions

  • Transformers 4.50.0
  • Pytorch 2.6.0+cu124
  • Datasets 3.4.1
  • Tokenizers 0.21.1