chitter99's picture
Update README.md
dabb0b0 verified
metadata
license: apache-2.0
tags:
  - image-classification
  - vision-transformer
  - pytorch
  - oxford-pets
library_name: torch
datasets:
  - cvdl/oxford-pets
language: []
model-index:
  - name: ViTPets
    results:
      - task:
          type: image-classification
        dataset:
          name: Oxford Pets
          type: cvdl/oxford-pets
        metrics:
          - type: accuracy
            value: 9

ViTPets - Vision Transformer trained from scratch on Oxford Pets ๐Ÿถ๐Ÿฑ

This model is a Vision Transformer (ViT) trained from scratch on the Oxford Pets dataset. It classifies images of cats and dogs into 37 different breeds.

Model Summary

  • Architecture: Custom Vision Transformer (ViT)
  • Input resolution: 128x128
  • Patch size: 16x16
  • Embedding dimension: 240
  • Number of Transformer blocks: 12
  • Number of heads: 4
  • MLP ratio: 2.0
  • Dropout: 10% on attention and MLP
  • Framework: PyTorch
  • Dataset: Oxford Pets (via ๐Ÿค— cvdl/oxford-pets)
  • Loss: CrossEntropyLoss
  • Optimizer: SGD with LR = 0.00257

Training Setup

  • Device: Multi-GPU (4 GPUs)
  • Batch size: 256 (64 ร— 4 GPUs)
  • Early stopping: patience 50, delta 1e-6
  • Logging: TensorBoard

How to Use

from model import ViT
import torch

model = ViT(
    img_size=(128, 128),
    patch_size=16,
    in_channels=3,
    embed_dim=240,
    n_classes=37,
    n_blocks=12,
    n_heads=4,
    mlp_ratio=2.0,
    qkv_bias=True,
    block_drop_p=0.1,
    attn_drop_p=0.1,
)

model.load_state_dict(torch.load("ViTPets.pth"))
model.eval()