ViTPets - Vision Transformer trained from scratch on Oxford Pets πΆπ±
This model is a Vision Transformer (ViT) trained from scratch on the Oxford Pets dataset. It classifies images of cats and dogs into 37 different breeds.
Model Summary
- Architecture: Custom Vision Transformer (ViT)
- Input resolution: 128x128
- Patch size: 16x16
- Embedding dimension: 240
- Number of Transformer blocks: 12
- Number of heads: 4
- MLP ratio: 2.0
- Dropout: 10% on attention and MLP
- Framework: PyTorch
- Dataset: Oxford Pets (via π€
cvdl/oxford-pets
) - Loss: CrossEntropyLoss
- Optimizer: SGD with LR = 0.00257
Training Setup
- Device: Multi-GPU (4 GPUs)
- Batch size: 256 (64 Γ 4 GPUs)
- Early stopping: patience 50, delta 1e-6
- Logging: TensorBoard
How to Use
from model import ViT
import torch
model = ViT(
img_size=(128, 128),
patch_size=16,
in_channels=3,
embed_dim=240,
n_classes=37,
n_blocks=12,
n_heads=4,
mlp_ratio=2.0,
qkv_bias=True,
block_drop_p=0.1,
attn_drop_p=0.1,
)
model.load_state_dict(torch.load("ViTPets.pth"))
model.eval()
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support