timm
/

Image Classification
timm
PyTorch
Safetensors
Transformers
rwightman HF Staff commited on
Commit
c7e047a
·
verified ·
1 Parent(s): 9f86144
Files changed (4) hide show
  1. README.md +180 -0
  2. config.json +34 -0
  3. model.safetensors +3 -0
  4. pytorch_model.bin +3 -0
README.md ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - image-classification
4
+ - timm
5
+ - transformers
6
+ library_name: timm
7
+ license: apache-2.0
8
+ datasets:
9
+ - imagenet-1k
10
+ ---
11
+ # Model card for naflexvit_base_patch16_par_gap.e300_s576_in1k
12
+
13
+ A NaFlexViT (Native-Aspect Flexible Vision Transformer) image classification model. This is variant with aspect-preserving 2D position embedding is pretrained on ImageNet-1k by Ross Wightman. NaFlexViT is based on the NaFlex ViT changes proposed in SigLip-2 with a number of timm tweaks, enabling training with dynamic batch sizing that maintains native aspect ratios and flexible resolutions w/ variable patch sizes. The model is trained using the NaFlex data loader, which supports variable sequence lengths and resolutions during training. Uses RandAugment, MixUp, CutMix, and grayscale augmentation on top of standard random resize + crop (RRC). Optimized with NAdamW and cosine learning rate schedule.
14
+
15
+ Training command:
16
+ ```
17
+ train.py --data-dir /data/imagenet/ --amp --amp-dtype bfloat16 --model <name> --naflex-loader -b 64 --opt nadamw --lr 3e-4 --warmup-lr 0 --sched-on-updates --aa rand-m8-inc1-mstd1.0 --weight-decay .1 --grayscale-prob .1 --drop-path 0.2 --reprob 0 --mixup 0.8 --cutmix 1.0 --remode pixel -j 8
18
+ ```
19
+
20
+
21
+ ## Model Details
22
+ - **Model Type:** Image classification / feature backbone
23
+ - **Model Stats:**
24
+ - Params (M): 86.6
25
+ - GMACs: 55.9
26
+ - Activations (M): 102.3
27
+ - Image size: 384 x 384
28
+ - **Papers:**
29
+ - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models
30
+ - SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786
31
+ - Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution: https://arxiv.org/abs/2307.06304
32
+ - FlexiViT: One Model for All Patch Sizes: https://arxiv.org/abs/2212.08013
33
+ - **Dataset:** ImageNet-1k
34
+ - **Training:**
35
+ - Sequence Lengths: [128, 256, 576, 784, 1024]
36
+ - Epochs: 300
37
+ - Batch Size: 64 per GPU (4 GPUs) @ seq-len 1024
38
+ - Optimizer: NAdamW
39
+ - Learning Rate: 3e-4
40
+ - Weight Decay: 0.1
41
+ - Augmentation: RandAugment (m=8), MixUp (0.8), CutMix (1.0), Grayscale (0.1)
42
+ - Drop Path: 0.2
43
+ - AMP dtype: bfloat16
44
+ - **Architecture:**
45
+ - Variant: base
46
+ - Patch Size: 16x16
47
+ - Positional Embedding: aspect-preserving 2D position embedding
48
+ - Pooling: global average pooling (GAP)
49
+
50
+ ## Model Usage
51
+ ### Image Classification
52
+ ```python
53
+ from urllib.request import urlopen
54
+ from PIL import Image
55
+ import timm
56
+
57
+ img = Image.open(urlopen(
58
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
59
+ ))
60
+
61
+ model = timm.create_model('naflexvit_base_patch16_par_gap.e300_s576_in1k', pretrained=True)
62
+ model = model.eval()
63
+
64
+ # get model specific transforms (normalization, resize)
65
+ data_config = timm.data.resolve_model_data_config(model)
66
+ transforms = timm.data.create_transform(**data_config, is_training=False)
67
+
68
+ output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
69
+
70
+ top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
71
+ ```
72
+
73
+ ### Feature Map Extraction
74
+ ```python
75
+ from urllib.request import urlopen
76
+ from PIL import Image
77
+ import timm
78
+
79
+ img = Image.open(urlopen(
80
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
81
+ ))
82
+
83
+ model = timm.create_model(
84
+ 'naflexvit_base_patch16_par_gap.e300_s576_in1k',
85
+ pretrained=True,
86
+ features_only=True,
87
+ )
88
+ model = model.eval()
89
+
90
+ # get model specific transforms (normalization, resize)
91
+ data_config = timm.data.resolve_model_data_config(model)
92
+ transforms = timm.data.create_transform(**data_config, is_training=False)
93
+
94
+ output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
95
+
96
+ for o in output:
97
+ # print shape of each feature map in output
98
+ # e.g.:
99
+ # torch.Size([1, 768, 24, 24])
100
+ # torch.Size([1, 768, 24, 24])
101
+ # torch.Size([1, 768, 24, 24])
102
+
103
+ print(o.shape)
104
+ ```
105
+
106
+ ### Image Embeddings
107
+ ```python
108
+ from urllib.request import urlopen
109
+ from PIL import Image
110
+ import timm
111
+
112
+ img = Image.open(urlopen(
113
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
114
+ ))
115
+
116
+ model = timm.create_model(
117
+ 'naflexvit_base_patch16_par_gap.e300_s576_in1k',
118
+ pretrained=True,
119
+ num_classes=0, # remove classifier nn.Linear
120
+ )
121
+ model = model.eval()
122
+
123
+ # get model specific transforms (normalization, resize)
124
+ data_config = timm.data.resolve_model_data_config(model)
125
+ transforms = timm.data.create_transform(**data_config, is_training=False)
126
+
127
+ output = model(transforms(img).unsqueeze(0)) # output is (batch_size, num_features) shaped tensor
128
+
129
+ # or equivalently (without needing to set num_classes=0)
130
+
131
+ output = model.forward_features(transforms(img).unsqueeze(0))
132
+ # output is unpooled, a (1, 580, 768) shaped tensor
133
+
134
+ output = model.forward_head(output, pre_logits=True)
135
+ # output is a (1, num_features) shaped tensor
136
+ ```
137
+
138
+ ## Model Comparison
139
+ | Model | Top-1 Acc | Top-5 Acc | Params (M) | Eval Seq Len |
140
+ |:---|:---:|:---:|:---:|:---:|
141
+ | naflexvit_base_patch16_par_gap.e300_s576_in1k | 83.67 | 96.45 | 86.63 | 576 |
142
+ | naflexvit_base_patch16_parfac_gap.e300_s576_in1k | 83.63 | 96.41 | 86.46 | 576 |
143
+ | naflexvit_base_patch16_gap.e300_s576_in1k | 83.50 | 96.46 | 86.63 | 576 |
144
+
145
+ ## Citation
146
+ ```bibtex
147
+ @misc{rw2019timm,
148
+ author = {Ross Wightman},
149
+ title = {PyTorch Image Models},
150
+ year = {2019},
151
+ publisher = {GitHub},
152
+ journal = {GitHub repository},
153
+ doi = {10.5281/zenodo.4414861},
154
+ howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
155
+ }
156
+ ```
157
+ ```bibtex
158
+ @article{tschannen2025siglip,
159
+ title={Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features},
160
+ author={Tschannen, Michael and Gritsenko, Alexey and Wang, Xiao and Naeem, Muhammad Ferjad and Alabdulmohsin, Ibrahim and Parthasarathy, Nikhil and Evans, Talfan and Beyer, Lucas and Xia, Ye and Mustafa, Basil and others},
161
+ journal={arXiv preprint arXiv:2502.14786},
162
+ year={2025}
163
+ }
164
+ ```
165
+ ```bibtex
166
+ @article{dehghani2023navit,
167
+ title={Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution},
168
+ author={Dehghani, Mostafa and Mustafa, Basil and Djolonga, Josip and Heek, Jonathan and Minderer, Matthias and Caron, Mathilde and Steiner, Andreas and Puigcerver, Joan and Geirhos, Robert and Alabdulmohsin, Ibrahim and others},
169
+ journal={arXiv preprint arXiv:2307.06304},
170
+ year={2023}
171
+ }
172
+ ```
173
+ ```bibtex
174
+ @article{beyer2022flexivit,
175
+ title={FlexiViT: One Model for All Patch Sizes},
176
+ author={Beyer, Lucas and Izmailov, Pavel and Kolesnikov, Alexander and Caron, Mathilde and Kornblith, Simon and Zhai, Xiaohua and Minderer, Matthias and Tschannen, Michael and Alabdulmohsin, Ibrahim and Pavetic, Filip},
177
+ journal={arXiv preprint arXiv:2212.08013},
178
+ year={2022}
179
+ }
180
+ ```
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architecture": "naflexvit_base_patch16_par_gap",
3
+ "num_classes": 1000,
4
+ "num_features": 768,
5
+ "global_pool": "avg",
6
+ "pretrained_cfg": {
7
+ "tag": "e300_s576_in1k",
8
+ "custom_load": false,
9
+ "input_size": [
10
+ 3,
11
+ 384,
12
+ 384
13
+ ],
14
+ "fixed_input_size": false,
15
+ "interpolation": "bicubic",
16
+ "crop_pct": 1.0,
17
+ "crop_mode": "center",
18
+ "mean": [
19
+ 0.5,
20
+ 0.5,
21
+ 0.5
22
+ ],
23
+ "std": [
24
+ 0.5,
25
+ 0.5,
26
+ 0.5
27
+ ],
28
+ "num_classes": 1000,
29
+ "pool_size": null,
30
+ "first_conv": "embeds.proj",
31
+ "classifier": "head",
32
+ "license": "apache-2.0"
33
+ }
34
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:79f93a1d9cc3f7ae71efa58d323e6e0d363b4f891037b3c152ea7489c7300acc
3
+ size 346551016
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:978fcc788a4bb89a4d9a3aa70470acf6cd83e7b221913af1436752e1bf4fb4b1
3
+ size 346599779