File size: 2,561 Bytes
ee04272
 
3a8b66a
 
 
 
 
 
 
 
 
 
 
 
 
 
ee04272
3a8b66a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d57a04b
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
license: cc-by-4.0
datasets:
- imagenet-1k
metrics:
- accuracy
pipeline_tag: image-classification
language:
- en
tags:
- convnext
- convolutional neural network
- simpool
- dino
- computer vision
- deep learning
---

# Self-supervised ConvNeXt-S model with SimPool

ConvNeXt-S model with SimPool (no gamma) trained on ImageNet-1k for 100 epochs. Self-supervision with [DINO](https://arxiv.org/abs/2104.14294).

SimPool is a simple attention-based pooling method at the end of network, introduced on this ICCV 2023 [paper](https://arxiv.org/pdf/2309.06891.pdf) and released in this [repository](https://github.com/billpsomas/simpool/).
Disclaimer: This model card is written by the author of SimPool, i.e. [Bill Psomas](http://users.ntua.gr/psomasbill/).

## Motivation

Convolutional networks and vision transformers have different forms of pairwise interactions, pooling across layers and pooling at the end of the network. Does the latter really need to be different? 
As a by-product of pooling, vision transformers provide spatial attention for free, but this is most often of low quality unless self-supervised, which is not well studied. Is supervision really the problem?

## Method

SimPool is a simple attention-based pooling mechanism as a replacement of the default one for both convolutional and transformer encoders. For transformers, we completely discard the [CLS] token. 
Interestingly, we find that, whether supervised or self-supervised, SimPool improves performance on pre-training and downstream tasks and provides attention maps delineating object boundaries in all cases.
One could thus call SimPool universal.

## Evaluation with k-NN

| k       | top1    | top5    |
| ------- | ------- | ------- |
| 10      | 68.768  | 86.012  |
| 20      | 68.644  | 87.704  |
| 100     | 66.528  | 88.864  |
| 200     | 65.094  | 88.624  |

## BibTeX entry and citation info

```
@misc{psomas2023simpool,
      title={Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?}, 
      author={Bill Psomas and Ioannis Kakogeorgiou and Konstantinos Karantzalos and Yannis Avrithis},
      year={2023},
      eprint={2309.06891},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```

```
@inproceedings{liu2022convnet,
  title={A convnet for the 2020s},
  author={Liu, Zhuang and Mao, Hanzi and Wu, Chao-Yuan and Feichtenhofer, Christoph and Darrell, Trevor and Xie, Saining},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={11976--11986},
  year={2022}
}
```